## Dialect Classification Model Training and Evaluation

In this notebook, we will train and evaluate different machine learning models for dialect classification using a balanced dataset. We will also compare the performance of Logistic Regression, Decision Tree, and Naive Bayes models

## Step 1: Import Libraries
First, we import the necessary libraries for data manipulation and model training.

## Why? <br>
pandas: For data manipulation and analysis.<br>
numpy: For numerical operations.<br>

In [1]:
import pandas as pd
import numpy as np

## Step 2: Load Clean Data
Load the balanced dataset.

In [2]:
df_train = pd.read_csv('Data/train_data.csv')

In [3]:
df_train.head()

Unnamed: 0,text,dialect
0,انا البلد ده مافيهاش حاجه عدل تقريبا عمر ماتقل...,EG
1,قروش المرجع اكتر قروشك,SD
2,لذلك خسنا كمغاربه نسكتوا شويه حتي نشوفوا نهايه...,MA
3,حسني مبارك اتنحي لقا ناس بتموت المصريين ولانه ...,SD
4,العينبالعين جلست ضيوفك مازمه بيناتهن اكتر الاز...,LB


## Step 3: Feature Extraction
We use the TfidfVectorizer to convert the text data into numerical features.

## Why? <br>
TfidfVectorizer: To convert the text data into numerical features that can be used by machine learning models.<br>
## Benefit<br>
TF-IDF (Term Frequency-Inverse Document Frequency) helps in transforming textual data into meaningful numerical vectors, capturing the importance of words in the context of the documents<br>

In [4]:
# Initialize the TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

In [6]:
#dropping null values
df_train = df_train.dropna(subset=['text'])

In [7]:
# Transform the text data into TF-IDF features
X = vectorizer.fit_transform(df_train['text'])
y = df_train['dialect']

## Step 4: Train-Test Split
We split the dataset into training and testing sets.

## Why? <br>
Train-Test Split: To evaluate the performance of the model on unseen data.<br>
## Benefit <br>
This helps in assessing the generalizability of the model by testing it on a separate dataset that was not used during training.<br>

In [8]:
df_test = pd.read_csv('Data/Test_data.csv')
df_test = df_test.dropna(subset=['text'])

In [9]:
X_train = vectorizer.transform(df_train["text"])
y_train = np.array(df_train["dialect"])
X_test = vectorizer.transform(df_test["text"])
y_test = np.array(df_test["dialect"])

## Step 5: Train Logistic Regression Model
We train a Logistic Regression model with a specified maximum number of iterations and solver.

Logistic Regression: A simple and effective linear model for classification tasks.<br>
## Benefit <br>
Logistic Regression provides a baseline performance for the dialect classification task. 

In [10]:
# Initialize the Logistic Regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=50, solver='newton-cg', C=4)

In [11]:
# Train the model on the training data
model.fit(X_train, y_train)
# Predict on the test data
y_pred = model.predict(X_test)

we already did a grid search to find the best hyperparameters for the model.
those are the outputs of uing different values for C and solver

# Evaluation

Evaluating its accuracy, classification report, confusion matrix, and macro F1 score helps in understanding the model's performance.

In [12]:
# C=4

In [12]:
print(f'Train accuracy: {model.score(X_train, y_train)}')

from sklearn.metrics import accuracy_score
print(f'Test accuracy: {accuracy_score(y_test, y_pred)}')

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

Train accuracy: 0.9798625090724841
Test accuracy: 0.8208045509955303
              precision    recall  f1-score   support

          EG       0.84      0.90      0.87     11480
          LB       0.87      0.82      0.84      5578
          LY       0.80      0.81      0.81      7265
          MA       0.79      0.69      0.74      2283
          SD       0.72      0.64      0.68      2926

    accuracy                           0.82     29532
   macro avg       0.80      0.77      0.79     29532
weighted avg       0.82      0.82      0.82     29532

[[10351   202   487   115   325]
 [  412  4582   365    95   124]
 [  761   282  5856   147   219]
 [  290    86   266  1570    71]
 [  554   127   307    57  1881]]


In [60]:
#C= 5
print(f'Train accuracy: {model.score(X_train, y_train)}')

print(f'Test accuracy: {accuracy_score(y_test, y_pred)}')

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

Train accuracy: 0.983443572573025
Test accuracy: 0.8199241500744955
              precision    recall  f1-score   support

          EG       0.84      0.90      0.87     11480
          LB       0.87      0.82      0.84      5578
          LY       0.80      0.81      0.80      7265
          MA       0.79      0.69      0.74      2283
          SD       0.72      0.64      0.67      2926

    accuracy                           0.82     29532
   macro avg       0.80      0.77      0.79     29532
weighted avg       0.82      0.82      0.82     29532

[[10339   205   498   115   323]
 [  410  4593   361    93   121]
 [  761   284  5849   146   225]
 [  292    85   266  1569    71]
 [  557   132   312    61  1864]]


In [67]:
#C= 10
print(f'Train accuracy: {model.score(X_train, y_train)}')

print(f'Test accuracy: {accuracy_score(y_test, y_pred)}')

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

Train accuracy: 0.9905372279966586
Test accuracy: 0.8174183936069348
              precision    recall  f1-score   support

          EG       0.84      0.90      0.87     11480
          LB       0.86      0.82      0.84      5578
          LY       0.80      0.80      0.80      7265
          MA       0.79      0.68      0.73      2283
          SD       0.71      0.63      0.67      2926

    accuracy                           0.82     29532
   macro avg       0.80      0.77      0.78     29532
weighted avg       0.82      0.82      0.82     29532

[[10305   225   500   123   327]
 [  417  4582   371    90   118]
 [  765   288  5847   140   225]
 [  283    95   269  1563    73]
 [  568   135   317    63  1843]]


In [70]:
#C= 1
print(f'Train accuracy: {model.score(X_train, y_train)}')

print(f'Test accuracy: {accuracy_score(y_test, y_pred)}')

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

Train accuracy: 0.9390329074401216
Test accuracy: 0.8159284843559529
              precision    recall  f1-score   support

          EG       0.83      0.90      0.86     11480
          LB       0.88      0.80      0.84      5578
          LY       0.80      0.80      0.80      7265
          MA       0.79      0.69      0.73      2283
          SD       0.71      0.65      0.68      2926

    accuracy                           0.82     29532
   macro avg       0.80      0.77      0.78     29532
weighted avg       0.82      0.82      0.81     29532

[[10373   171   488   113   335]
 [  479  4464   407    93   135]
 [  827   259  5784   168   227]
 [  281    73   260  1574    95]
 [  576   103   290    56  1901]]


In [73]:
#C= 0.1
print(f'Train accuracy: {model.score(X_train, y_train)}')

print(f'Test accuracy: {accuracy_score(y_test, y_pred)}')

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

Train accuracy: 0.7986935623022883
Test accuracy: 0.7671000948124069
              precision    recall  f1-score   support

          EG       0.76      0.91      0.83     11480
          LB       0.90      0.67      0.77      5578
          LY       0.74      0.73      0.74      7265
          MA       0.76      0.63      0.69      2283
          SD       0.68      0.59      0.63      2926

    accuracy                           0.77     29532
   macro avg       0.77      0.71      0.73     29532
weighted avg       0.77      0.77      0.76     29532

[[10432    97   518    88   345]
 [  906  3758   654   100   160]
 [ 1324   198  5311   214   218]
 [  389    58   314  1432    90]
 [  745    67   340    53  1721]]


In [14]:
#Saving the best model (C=4)
import pickle
with open('Models/logistic_regression_model.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)
#Saving the vectorizer
with open('Models/tfidf_vectorizer.pkl', 'wb') as vectorizer_file:
    pickle.dump(vectorizer, vectorizer_file)

## Step 6: Sample Predictions
We test the trained model on some sample text data to predict the dialect.

## Benefit
This step allows us to qualitatively assess the model's performance and its ability to generalize to new data.

In [78]:
#writing sample text to find dialect
sample_text = ['الموديل بتاعي فيه زوربيح توكينز']
sample_text_vectorized = vectorizer.transform(sample_text)
model.predict(sample_text_vectorized)

array(['EG'], dtype=object)

In [79]:
#writing sample text to find dialect
sample_text = [" الخميرة قبت والعجينة انكبت "]
sample_text_vectorized = vectorizer.transform(sample_text)
model.predict(sample_text_vectorized)

array(['LY'], dtype=object)

In [80]:
#writing sample text to find dialect
sample_text = ["نوضحها تشطح دارتها بصح"]
sample_text_vectorized = vectorizer.transform(sample_text)
model.predict(sample_text_vectorized)

array(['MA'], dtype=object)

In [81]:
sample_text = ['همي الوحيد صراحه فيق وروح الشغل ويخلص الدوام بسرعه وارجع نام']
sample_text_vectorized = vectorizer.transform(sample_text)
model.predict(sample_text_vectorized)

array(['LB'], dtype=object)

In [15]:
sample_text = ['الافتتاح دا يارب بكون شكل شنو ممكن يكون رساله الي مجهول مثلا']
sample_text_vectorized = vectorizer.transform(sample_text)
model.predict(sample_text_vectorized)

array(['SD'], dtype=object)

## Step 7: Train Decision Tree Model
We train a Decision Tree model and evaluate its performance.

Decision Tree: To compare the performance of a non-linear model with Logistic Regression.
## Benefit
Decision Trees can capture non-linear relationships and interactions between features, potentially improving classification performance.

In [None]:
#trying decision tree model
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=50)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [93]:
#Without max_depth
print(f'Train accuracy: {model.score(X_train, y_train)}')
print(f'Test accuracy: {accuracy_score(y_test, y_pred)}')

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

Train accuracy: 0.9997466551634416
Test accuracy: 0.6504808343491806
              precision    recall  f1-score   support

          EG       0.72      0.76      0.74     11480
          LB       0.70      0.66      0.68      5578
          LY       0.61      0.59      0.60      7265
          MA       0.61      0.49      0.54      2283
          SD       0.43      0.47      0.45      2926

    accuracy                           0.65     29532
   macro avg       0.61      0.59      0.60     29532
weighted avg       0.65      0.65      0.65     29532

[[8762  529 1218  213  758]
 [ 755 3654  747   95  327]
 [1464  642 4315  332  512]
 [ 382  159  440 1118  184]
 [ 860  227  406   72 1361]]


In [97]:
#With max_depth=50
print(f'Train accuracy: {model.score(X_train, y_train)}')
print(f'Test accuracy: {accuracy_score(y_test, y_pred)}')

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

Train accuracy: 0.5625761746299112
Test accuracy: 0.5564133820939997
              precision    recall  f1-score   support

          EG       0.74      0.60      0.66     11480
          LB       0.83      0.41      0.55      5578
          LY       0.37      0.75      0.50      7265
          MA       0.79      0.32      0.46      2283
          SD       0.60      0.36      0.45      2926

    accuracy                           0.56     29532
   macro avg       0.67      0.49      0.52     29532
weighted avg       0.66      0.56      0.56     29532

[[6876   63 4000   36  505]
 [ 400 2309 2810   24   35]
 [1330  258 5463  118   96]
 [ 225   81 1181  734   62]
 [ 438   79 1345   14 1050]]


## Step 8: Train Naive Bayes Model
We train a Multinomial Naive Bayes model and evaluate its performance.

Naive Bayes: To evaluate the performance of a probabilistic model for text classification.
## Benefit
Naive Bayes is effective for text classification tasks, especially when the features (words) are independent. Comparing its performance with other models helps in selecting the best model for the task.

In [83]:
#trying naive bayes
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [91]:
print(f'Train accuracy: {model.score(X_train, y_train)}')
print(f'Test accuracy: {accuracy_score(y_test, y_pred)}')

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

Train accuracy: 0.8771893786889062
Test accuracy: 0.7791548151158065
              precision    recall  f1-score   support

          EG       0.68      0.99      0.80     11480
          LB       0.96      0.68      0.79      5578
          LY       0.91      0.67      0.77      7265
          MA       0.91      0.64      0.76      2283
          SD       0.83      0.53      0.65      2926

    accuracy                           0.78     29532
   macro avg       0.86      0.70      0.75     29532
weighted avg       0.82      0.78      0.77     29532

[[11335    17    53    22    53]
 [ 1473  3780   202    29    94]
 [ 2096    94  4874    73   128]
 [  641    29   104  1471    38]
 [ 1241    20   100    15  1550]]


With these steps, we have trained and evaluated multiple models for dialect classification. We compared their performance using various metrics and tested their predictions on sample text data. This comprehensive evaluation helps in selecting the most suitable model for the dialect classification task.

Which is **Logistic Regression** model with acuraccy  **0.82**