# Arabic News Classification - 02

## Model Evaluations

The best performing models were found to be the Multinomial Naïve Bayes and Random Forest Classifier models judging from the confusion matrices produced in the [modeling notebook](./01_modeling.ipynb), but accuracy metrics for each of the simple models are obtained in this notebook and stored in a dataframe to be used for visualizations. Testing, training, and cross validation scores were obtained to judge shortcomings in bias/variance.

Finally, a grid search was conducted on a pipeline containing steps for Count Vectorization, TFIDF Transformation, and a Naive Bayes classifier to determine the ideal parameters for scaling up to the full corpus, which is handled in the [models.py](./models.py) script. When executed, the model will be saved to `outputs/final_models`.

In [1]:
# Array operations and visualizations
import pandas as pd
import numpy as np

# Model evaluation, and data preparation
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate, cross_val_score
from sklearn.metrics import accuracy_score, f1_score

# Pipeline construction
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

# Defined functions
from tools import misc

# Remove warnings generated by Arabic words
import warnings
warnings.filterwarnings('ignore', category=UserWarning)

In [2]:
# Load and 
data = misc.load('raw_data/corpus_df.pkl')
data.head()

Unnamed: 0_level_0,cls,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,sports,أعلن المدرب النمسوي لبوروسيا دورتموند بيتر شتو...
1,sports,ذكرت وسائل الإعلام البلغارية الجمعة ان العداءة...
2,sports,برز اسم نجم مانشستر يونايتد رايان غيغز (36 عام...
3,sports,قال مدرب نادي انتر ميلان الإيطالي خوزيه موريني...
4,sports,بيتر تشيك: حارس مرمى تشيكي ولد في 20 مايو عام ...


In [3]:
X = data.text
y = data.cls

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=misc.SEED)

In [4]:
# Load MultinomialNB, RandomForestClassifier, GradientBoostingClassifier, SVC
nb = misc.load('outputs/models/cv0.pkl')
rfc = misc.load('outputs/models/cv1.pkl')
gbc = misc.load('outputs/models/cv2.pkl')
svc = misc.load('outputs/models/cv3.pkl')

In [5]:
models = [nb, rfc, gbc, svc]
names = ['Naive Bayes', 'Random Forest', 'Gradient Boosting', 'Support Vector']

train_scores = []
cv_scores = []
test_scores = []

for i, model in enumerate(models):
    print(f'{names[i]} Classifier')
    print('-'*30)
    train_score = model.score(X_train, y_train)
    train_scores.append(train_score)
    print(f'\tTrain Score: {train_score:.3f}')
    
    test_score = model.score(X_test, y_test)
    test_scores.append(test_score)
    print(f'\tTest Score: {test_score:.3f}')
    
    cv_score = cross_val_score(model, X_train, y_train, n_jobs=4, verbose=1).mean()
    cv_scores.append(cv_score)
    print(f'\tCross Val Score: {cv_score:.3f}')
    print('-'*30)

Naive Bayes Classifier
------------------------------
	Train Score: 0.956
	Test Score: 0.956


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:   36.4s finished


	Cross Val Score: 0.954
------------------------------
Random Forest Classifier
------------------------------
	Train Score: 0.992
	Test Score: 0.991


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:  2.0min finished


	Cross Val Score: 0.960
------------------------------
Gradient Boosting Classifier
------------------------------
	Train Score: 0.969
	Test Score: 0.968


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:  6.2min finished


	Cross Val Score: 0.958
------------------------------
Support Vector Classifier
------------------------------
	Train Score: 0.981
	Test Score: 0.982


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


	Cross Val Score: 0.952
------------------------------


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed: 10.2min finished


In [6]:
scores_df = pd.DataFrame(data={
    'Training': train_scores,
    'Cross Validation': cv_scores,
    'Testing': test_scores,
}, index=names)
misc.save(scores_df, 'outputs/evaluations/baseline_scores_df.pkl')

In [7]:
scores_df

Unnamed: 0,Training,Cross Validation,Testing
Naive Bayes,0.956044,0.953791,0.955604
Random Forest,0.991813,0.95956,0.991209
Gradient Boosting,0.968599,0.957637,0.968352
Support Vector,0.981484,0.95228,0.982198


### Grid Search on Naïve Bayes

In [10]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

param_grid = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.1, 0.5, 1)
}

best_estimator = misc.run_grid_search(X_train, y_train, pipeline, param_grid)

Grid Searching...
PIPELINE:
	vect
	tfidf
	clf
PARAMS:
{'clf__alpha': (0.1, 0.5, 1),
 'tfidf__norm': ('l1', 'l2'),
 'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0)}

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:  7.8min finished


Best score: 0.960576923076923
Best parameters:
{'clf': MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True),
 'clf__alpha': 0.1,
 'clf__class_prior': None,
 'clf__fit_prior': True,
 'memory': None,
 'steps': [('vect',
            CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.5, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)),
           ('tfidf',
            TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
           ('clf', MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True))],
 'tfidf': TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True),
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sub

In [23]:
misc.save(best_estimator, 'outputs/final_models/best_naive_bayes.pkl')