# First Simple Model

_Author: Ian Sharff_

This is the initial model created by assembling a corpus news articles written in Modern Standard Arabic (العربية الفصحى). These articles were obtained from <a href="https://www.kaggle.com/haithemhermessi/sanad-dataset">this dataset from Kaggle</a>, and consist of 7 categories, each with 6,500 articles. First, count vectorization was performed with the 10,000 most common words (removing the 750 most common Arabic stopwords, obtained from <a href= "https://github.com/mohataher/arabic-stop-words/blob/master/list.txt">here</a>), and training a Multinomial Naïve Bayes model with 80% of the stratified data, reserving the remaining 20% for testing. The model is stored as a .pkl file in this branch after training in the initial_eda notebook, which cannot be run without the data.

In [1]:
import requests
import pickle
import pandas as pd
import numpy as np
import nltk

from sklearn.metrics import classification_report

In [2]:
# Define global constants
TEST_SIZE = 0.2
SEED = 42
STOPWORDS_URL = 'https://raw.githubusercontent.com/mohataher/arabic-stop-words/master/list.txt'

In [3]:
r = requests.get(STOPWORDS_URL)
stopwords = []
if r.status_code:
    stopwords = r.text.split('\n')
else:
    print("Request status code: r.status_code")
    print("Stopwords not obstained from {STOPWORDS_URL}, defaulting to NLTK's Arabic stopwords list...")
    try:
        stopwords = nltk.corpus.stopwords.words('arabic')
    except:
        print("Failed to load NLTK Arabic Stopwords")
print(f"Stopwords loaded: {len(stopwords)}")

Stopwords loaded: 752


In [4]:
def load(filepath):
    obj = None
    with open(filepath, 'rb') as f:
        obj = pickle.load(f)
    return obj

In [5]:
# Unpickle stored models and arrays
first_model = load('outputs/first_model.pkl')
names = ['X_train_cv', 'X_test_cv', 'y_train', 'y_test']
X_train_cv, X_test_cv, y_train, y_test = [load(f'outputs/first_{name}.pkl') for name in names]
feature_names = load('outputs/first_feature_names.pkl')

In [6]:
# Display vectorized training features
pd.DataFrame(X_train_cv.todense(), columns=feature_names)

Unnamed: 0,تشير,خبراء,برامج,الحماية,أجهزة,الكمبيوتر,الإمارات,ارتفاع,معدلات,الهجمات,...,بلاتيني,التطعيم,توام,مشغلي,المصاحبة,كردستان,النمل,دارفور,بوسكي,حديثي
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36395,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
36396,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36397,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36398,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# Predict labels with first Naive Bayes model
y_hat_train = first_model.predict(X_train_cv)
y_hat_test = first_model.predict(X_test_cv)

In [8]:
# Display classification report metrics
print('Training Metrics_______')
print(classification_report(y_train, y_hat_train))
print('\nTesting Metrics_______')
print(classification_report(y_test, y_hat_test))

Training Metrics_______
              precision    recall  f1-score   support

     culture       0.91      0.97      0.94      5200
     finance       0.99      0.91      0.95      5200
     medical       0.95      0.99      0.97      5200
    politics       0.97      0.98      0.97      5200
    religion       0.97      0.89      0.93      5200
      sports       1.00      0.98      0.99      5200
        tech       0.92      0.98      0.95      5200

    accuracy                           0.96     36400
   macro avg       0.96      0.96      0.96     36400
weighted avg       0.96      0.96      0.96     36400


Testing Metrics_______
              precision    recall  f1-score   support

     culture       0.91      0.96      0.93      1300
     finance       0.99      0.90      0.95      1300
     medical       0.94      0.98      0.96      1300
    politics       0.97      0.97      0.97      1300
    religion       0.97      0.88      0.92      1300
      sports       1.00      0