# Project 4 
## Sabq News Classification
#### By : Mohammad Qahtani

<img src="https://scontent.fruh3-1.fna.fbcdn.net/v/t1.0-9/50263531_2263997373610396_5681118121918726144_o.jpg?_nc_cat=102&_nc_ohc=bEm5jHdDi80AQnsxV1OekbyL0HNQYz7NzJQOg7_mXuvk7SkOHpENbCMcQ&_nc_ht=scontent.fruh3-1.fna&oh=9c49aa0d8cdba346a24cb300f06924fe&oe=5E7FBEED" style="float: left; margin: px; height: 450px"> 

## Data Description

This project was based on the data acquired from https://www.kaggle.com/abdulrahmanals/arabic-news-from-sabq-website
in the original Dataset 'title', 'author_name_of_news', 'city', 'date', 'time',
       'number_of_watching', 'number_of_like', 'comment', 'share', 'news_text',
       'link' were the columns with 4130 row, only the links of the news were taken and the web scraping process was repeated to get a suitable dataset for the scope of the project.

|Feature|Type|Description|
|---|---|---|
|title|object|News Title| 
|author_name|object|The author of the article|
|city|object|City of the news|
|shares|integer|Number of shares on social media|
|news_text|object| The article |
|label|list|The label of news from Sabq.org|

## Problem Statement :

To build A model that predict the classification of the news based of Sabq.org labeling system.

# Clean and Prepare your data : 

In [145]:
# Starting with the Imports
import requests
import re
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import nltk
from tashaphyne.stemming import ArabicLightStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import pyarabic.arabrepr
import warnings
warnings.filterwarnings('ignore')

### web scraping 

Scraping has taken place in another jupyter notebook
this is only the code without running it

In [None]:
# reading the data 
new_class = pd.read_csv('news_data.csv')

def get_data(link):
    '''
    input: str , link
    output: title of the news article, Author name, city, nomber of shares, news text, and label 
    '''
    title = []
    author_name = []
    city = []
    shares = []
    news_text = []
    label = []
    list_ = []
    response = requests.get(link)
    soup = BeautifulSoup(response.text, 'html.parser')
    for i in soup.find_all('div',attrs={'itemprop':"articleBody"}):
        news_text.append(i.text.replace('\n',''))
    shares.append(soup.find_all('ul',attrs={'class':"social-ul pull-left"})[0].strong.text)
    author_name.append(soup.find_all('div',attrs={'class':'author-name'})[0].text.replace('\n','').split('-')[0])
    city.append(soup.find_all('div',attrs={'class':'author-name'})[0].text.replace('\n','').split('-')[1])
    title.append(soup.find('title').text)
    list_ = []
    for j in soup.find_all('a',attrs={'class':"ng-binding ng-scope"}):
#         for k in j:
        list_.append(j.text) 
    label.append(list_)
  
    

In [None]:
for idx, i in enumerate(new_class.link):
    get_data(i)

data = pd.DataFrame({'title':title,'author_name': author_name,
                   'city':city,'shares':shares,'news_text': news_text,'label':label})
data.to_csv('full_data.csv')

In [132]:
news = pd.read_csv('full_data.csv', index_col= False)
news.head(5)

Unnamed: 0,title,author_name,city,shares,news_text,label
0,الدفاع المدني بالرياض يناقش مع الهلال الأحمر...,صحيفة سبق الإلكترونية,الرياض,2,استقبل مدير الدفاع المدني بمنطقة الرياض، اللوا...,"['المملكة', 'محليات']"
1,"""أبوساق"": خطاب الملك بـ""الشورى"" وثيقة استرات...",وكالة الأنباء السعودية (واس),الرياض,4,عدَّ وزير الدولة عضو مجلس الوزراء لشؤون مجلس ا...,"['المملكة', 'محليات']"
2,المملكة ترفض تصريحات الحكومة الأمريكية بشأن ...,وكالة الأنباء السعودية (واس),الرياض,5,عبر مصدر مسؤول بوزارة الخارجية عن رفضه التام ل...,"['المملكة', 'محليات']"
3,ولي العهد يستعرض مع وزير الدفاع بكوريا الجنو...,وكالة الأنباء السعودية (واس),الرياض,6,استقبل صاحب السمو الملكي الأمير محمد بن سلمان ...,"['المملكة', 'محليات']"
4,"رائدات أعمال لـ""سبق"": وضعنا بصمات مميزة عربي...",خلود غنام,الرياض,6,انطلق، مساء أمس، ملتقى سيدات الأعمال الـ19 الم...,"['المملكة', 'محليات']"


In [133]:
# changing the type of the label from list to string
new = []
import ast
for i in news.label:
    a = ast.literal_eval(i)
    new.append(''.join(a))
    
news['label2'] = new 

In [135]:
news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4136 entries, 0 to 4135
Data columns (total 7 columns):
title          4136 non-null object
author_name    4136 non-null object
city           4136 non-null object
shares         4136 non-null object
news_text      4134 non-null object
label          4136 non-null object
label2         4136 non-null object
dtypes: object(7)
memory usage: 226.3+ KB


In [136]:
# drop nulls because there is no text in news_text as its only a caricatures 
news[news['label2'] == 'كاريكاتير']

Unnamed: 0,title,author_name,city,shares,news_text,label,label2
3108,"""إرهاب نصر الله على المتظاهرين في لبنان""",صحيفة سبق الإلكترونية,الرياض,25,,['كاريكاتير'],كاريكاتير
3893,توقيع ترامب يجبر أردوغان على الانسحاب من سوريا⁩,صحيفة سبق الإلكترونية,الرياض,40,,['كاريكاتير'],كاريكاتير


In [137]:
# dropna
news.dropna(inplace=True)

In [138]:
news.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4134 entries, 0 to 4135
Data columns (total 7 columns):
title          4134 non-null object
author_name    4134 non-null object
city           4134 non-null object
shares         4134 non-null object
news_text      4134 non-null object
label          4134 non-null object
label2         4134 non-null object
dtypes: object(7)
memory usage: 258.4+ KB


# Preprocessing  :

In [139]:
# prepair to use tashyepen library 
arepr = pyarabic.arabrepr.ArabicRepr() 
repr = arepr.repr

In [140]:
def news_to_words(raw_news):
    '''
    input: str , raw movie review
    output: str , a preprocessed - - movie review
    '''
    
    stopWords = open("list.txt").read().splitlines()
    stops = set(stopWords) #convert it to set 
    ArListem = ArabicLightStemmer()
    
    # 1.Tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(raw_news)
    
    # 2. Remove stop words.
     # stops is a set now
    tokens = [w for w in tokens if not w in stops]
   
    # 3. Join the words back into one string separated by space, 
    # and return the result.
    
    stemmed_words = [ArListem.light_stem(i) for i in tokens]

    return(" ".join(stemmed_words))


In [141]:
print("Cleaning and parsing the news......") # good to know what is going on!

# Initialize empty lists to hold the clean reviews.
clean_news = []


# Let's grab the number to run the loop!
total_news = news.shape[0]
print(f'There are {total_news} news_text.')

j = 0 # our counter
for text in news['news_text']:
    # Convert news to words, then append to clean_news.
    clean_news.append(news_to_words(text)) 
    
    # If the index is divisible by 100, print a message
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {total_news} news.') # you can use .format() or anyother way of your choice!
    
    j += 1 # adding 1 to the counter 

print("All done!")

Cleaning and parsing the news......
There are 4134 news_text.
Review 100 of 4134 news.
Review 200 of 4134 news.
Review 300 of 4134 news.
Review 400 of 4134 news.
Review 500 of 4134 news.
Review 600 of 4134 news.
Review 700 of 4134 news.
Review 800 of 4134 news.
Review 900 of 4134 news.
Review 1000 of 4134 news.
Review 1100 of 4134 news.
Review 1200 of 4134 news.
Review 1300 of 4134 news.
Review 1400 of 4134 news.
Review 1500 of 4134 news.
Review 1600 of 4134 news.
Review 1700 of 4134 news.
Review 1800 of 4134 news.
Review 1900 of 4134 news.
Review 2000 of 4134 news.
Review 2100 of 4134 news.
Review 2200 of 4134 news.
Review 2300 of 4134 news.
Review 2400 of 4134 news.
Review 2500 of 4134 news.
Review 2600 of 4134 news.
Review 2700 of 4134 news.
Review 2800 of 4134 news.
Review 2900 of 4134 news.
Review 3000 of 4134 news.
Review 3100 of 4134 news.
Review 3200 of 4134 news.
Review 3300 of 4134 news.
Review 3400 of 4134 news.
Review 3500 of 4134 news.
Review 3600 of 4134 news.
Review 3700

In [142]:
#checkin the list content format
clean_news[0]

'ستقبل مدير دفاع مدن منطق رياض لواء خالد حرق مدير إدار عام فرع هيئ هلال أحمر سعود منطق رياض سعود حرب ذلك مناقش سبل تعا إطار عمل ميدان جه دا لقاء كد لواء خالد حرق تعا جه مشدد هم عزيز تنسيق مدير دفاع مدن هلال أحمر منطق تخاذ تدابير إجراء لازم حما سلام مواطن مقيم مخاطر قديم مساعد ذليل مصاعب حصل رض ميد حرب شكر قدير واء خالد حرق حفاو استقبال كد تعا ظل مستمر إذن له عالى خدم مواطن مقيم ما خدم صالح عام لقاء سلم لواء حرق درع ذكار سعود حرب فيما حضر لقاء مساعد مدير دفاع مدن شؤو عمل منطق مدير إدار إطفاء إنقاذ عميد عبدالل سحيبان ضباط مدير هلال أحمر حضر مدير إدار قياد ميدان مشرف غرف عمل عبدالمحسن حرب مدير إدار اتصال مؤسس متحدث رسم هلال أحمر سعود منطق رياض ياسر جلاجل'

In [143]:
len(clean_news)

4134

In [144]:
news['cleaned']  = clean_news

In [181]:
news.head()

Unnamed: 0,title,author_name,city,shares,news_text,label,label2,cleaned
0,الدفاع المدني بالرياض يناقش مع الهلال الأحمر...,صحيفة سبق الإلكترونية,الرياض,2,استقبل مدير الدفاع المدني بمنطقة الرياض، اللوا...,"['المملكة', 'محليات']",المملكةمحليات,ستقبل مدير دفاع مدن منطق رياض لواء خالد حرق مد...
1,"""أبوساق"": خطاب الملك بـ""الشورى"" وثيقة استرات...",وكالة الأنباء السعودية (واس),الرياض,4,عدَّ وزير الدولة عضو مجلس الوزراء لشؤون مجلس ا...,"['المملكة', 'محليات']",المملكةمحليات,عد زير دول عض مجلس وزراء شؤو مجلس شورى محمد صل...
2,المملكة ترفض تصريحات الحكومة الأمريكية بشأن ...,وكالة الأنباء السعودية (واس),الرياض,5,عبر مصدر مسؤول بوزارة الخارجية عن رفضه التام ل...,"['المملكة', 'محليات']",المملكةمحليات,عبر مصدر مسؤول وزار خارج رفض تام تصريح حكوم أم...
3,ولي العهد يستعرض مع وزير الدفاع بكوريا الجنو...,وكالة الأنباء السعودية (واس),الرياض,6,استقبل صاحب السمو الملكي الأمير محمد بن سلمان ...,"['المملكة', 'محليات']",المملكةمحليات,ستقبل صاحب سمو ملك أمير محمد سلم عبدالعزيز لي ...
4,"رائدات أعمال لـ""سبق"": وضعنا بصمات مميزة عربي...",خلود غنام,الرياض,6,انطلق، مساء أمس، ملتقى سيدات الأعمال الـ19 الم...,"['المملكة', 'محليات']",المملكةمحليات,نطلق ملتقى سيد أعمال ـ19 مقام قاع خزامى مناسب ...


In [182]:
news.label2.nunique()

22

In [91]:
from sklearn.feature_extraction.text import CountVectorizer

In [92]:
# Vectorizing.
vectorizer = CountVectorizer(lowercase=False ) 
data_features = vectorizer.fit_transform(clean_news)
vocab = vectorizer.get_feature_names()
data_features = data_features.toarray()

df=pd.DataFrame(data_features, columns = vectorizer.get_feature_names())

In [98]:
df.head(2)

Unnamed: 0,00,000,001,002,0056224708541,0056940702130,006,008,009611762711,009611762722,...,ﻣﺨﺮﺟﺎت,ﻣﻊ,ﻣﻦ,ﻣﻮاءﻣﺔ,ﻧﻔﻄﻴﺔ,ﻧﻮ,ﻫﻴﺌﺔ,ﻳﺮاد,ﻳﻜﺎﻓﺊ,ﻹﻳﺠﺎد
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [160]:
# Import train_test_split.
from sklearn.model_selection import train_test_split

# Features and target
X = df
y = news['label2']

# Create train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42 )#, stratify = True)

# Modeling
- done only on the text of the news

In [178]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import svm 
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix

In [156]:
# Initializing
clf = SGDClassifier()
lr = LogisticRegression()
model_nb = MultinomialNB()
svm_classfier = svm.SVC()
rf = RandomForestClassifier()

#### fitting 

In [170]:
svm_classfier.fit(X_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [171]:
svm_classfier.score(X_train,y_train)

0.6896774193548387

In [172]:
svm_classfier.score(X_test, y_test)

0.7205029013539652

In [164]:
clf.fit(X_train,y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [165]:
clf.score(X_train,y_train)

0.9987096774193548

In [166]:
clf.score(X_test, y_test)

0.8694390715667312

In [167]:
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [168]:
rf.score(X_train,y_train)

0.9896774193548387

In [169]:
rf.score(X_test, y_test)

0.8423597678916828

In [157]:
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [158]:
lr.score(X_train, y_train)

1.0

In [159]:
lr.score(X_test, y_test)

0.8887814313346228

In [107]:
model_nb.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [108]:
model_nb.score(X_train, y_train)

0.9170967741935484

In [109]:
model_nb.score(X_test, y_test)

0.8762088974854932

#### GridSearching

Grid search for Multinomial NB

In [188]:
parm_grid = {'alpha': np.arange(0.1,3.0,0.1),
            'fit_prior':[True, False],
            'class_prior':[None]}

In [189]:
grid = GridSearchCV(model_nb, parm_grid, verbose = 1,n_jobs=-1)

In [190]:
grid.fit(X_train, y_train)

Fitting 3 folds for each of 58 candidates, totalling 174 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   20.0s
[Parallel(n_jobs=-1)]: Done 174 out of 174 | elapsed:  1.4min finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=MultinomialNB(alpha=1.0, class_prior=None,
                                     fit_prior=True),
             iid='warn', n_jobs=-1,
             param_grid={'alpha': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3,
       1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6,
       2.7, 2.8, 2.9]),
                         'class_prior': [None], 'fit_prior': [True, False]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=1)

In [191]:
grid.best_estimator_

MultinomialNB(alpha=0.5, class_prior=None, fit_prior=True)

In [192]:
grid.score(X_train, y_train)

0.937741935483871

In [193]:
grid.score(X_test, y_test)

0.8713733075435203

In [194]:
grid.best_score_

0.87

#### Grid search for Logistic Regression

In [249]:
grid_lr = GridSearchCV(lr, parm_grid, verbose = 1,n_jobs=-1)

In [250]:
parm_grid = {'C': np.arange(0.1,1.0,0.1),
#             'penalty':['l1', 'l2'],
            'dual':[True, False],
#             'fit_intercept':[True, False],
#             'solver':['newton-cg', 'sag', 'saga', 'lbfgs','liblinear','warn'],
#             'max_iter':range(100,500,100),
            }

In [240]:
grid_lr.estimator.get_params().keys()

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])

In [251]:
grid_lr.fit(X_train, y_train)

Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   27.4s
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:   40.4s finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'C': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
                         'dual': [True, False]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=1)

In [252]:
grid_lr.best_estimator_

LogisticRegression(C=0.1, class_weight=None, dual=True, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [253]:
grid_lr.best_score_

0.8770967741935484

In [254]:
grid_lr.score(X_train, y_train)

0.9974193548387097

In [255]:
grid_lr.score(X_test, y_test)

0.8916827852998066

### Results

|Model|Training Score|Testing score|
|---|---|---|
|SVM|0.6896774193548387|0.7205029013539652| 
|Random Forest Classifier|0.9896774193548387|0.8423597678916828|
|Logistic Regression|1.0|0.8887814313346228|
|SGD Classifier|0.9987096774193548|0.8694390715667312|
|Multinomial NB|0.9170967741935484| 0.8762088974854932 |
|Grid Search Multinomial NB|0.937741935483871| 0.8713733075435203 |
|Grid Search Logistic Regression|0.9974193548387097| 0.8916827852998066
 |


In [111]:
# predictions for the test data on the best model
pred = model_nb.predict(X_test)

In [174]:
#Confusion Matrix
print (confusion_matrix(y_test, pred))
print (classification_report(y_test, pred))

[[100   0   0   0   0   0   6   0   0   0   0   0   0   0   0   0   0   1]
 [  0   0   0   0   0   0  15   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   8   0   0   0   4   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   7   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   3   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0]
 [ 17   1   7   0   0   0 695   0   4   0   0   0   0   0   0   0   0  17]
 [  2   0   0   0   0   0   2   7   0   0   0   0   0   0   0   0   0   1]
 [  0   0   0   0   0   0   4   0  55   0   0   0   0   0   0   0   0   0]
 [  1   0   0   0   0   0   0   0   4   1   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   1]
 [  0   0   0   0   0   0   3   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0