# Arabic Dialect Identification system

**CCAI-413: Natural Language Processing project.**


------------------------------------------------------------------

# About 
Is an Arabic Dialect Identification system. its task of identifying the dialect of Arabic language in a text format. It is a challenging task due to the high variability of Arabic dialects and the lack of large-scale annotated datasets.  




# important Library

In [1]:
import pandas as pd
import numpy as np
import re
import nltk 

nltk.download('wordnet')
from nltk.stem.isri import ISRIStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV



[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Load Dataset

**Messages Dataset:**

Data contains tweets in different Arabic dialects.

In [2]:
# Load messages dataset
tweets = pd.read_csv('/kaggle/input/aim-technologies-predict-the-dialectal-arabic/messages.csv',lineterminator='\n')
column_names = ['id', 'tweets'] # list of column names
tweets.columns = column_names # Label the columns
tweets

Unnamed: 0,id,tweets
0,1.175358e+18,@Nw8ieJUwaCAAreT لكن بالنهاية .. ينتفض .. يغير .
1,1.175416e+18,@7zNqXP0yrODdRjK يعني هذا محسوب على البشر .. ح...
2,1.175450e+18,@KanaanRema مبين من كلامه خليجي
3,1.175471e+18,@HAIDER76128900 يسلملي مرورك وروحك الحلوه💐
4,1.175497e+18,@hmo2406 وين هل الغيبه اخ محمد 🌸🌺
...,...,...
458656,1.057419e+18,@mycousinvinnyys @hanyamikhail1 متهيالي دي شكو...
458657,1.055620e+18,@MahmoudWaked7 @maganenoo في طريق مطروح مركز ب...
458658,,0
458659,1.057419e+18,@mycousinvinnyys @hanyamikhail1 متهيالي دي شكو...


**Dialect Dataset:**

Data contains the dialect (labels) of the tweets.

In [3]:
# Load dialect dataset
dialects = pd.read_csv("/kaggle/input/aim-technologies-predict-the-dialectal-arabic/dialect_dataset.csv")
dialects

Unnamed: 0,id,dialect
0,1175358310087892992,IQ
1,1175416117793349632,IQ
2,1175450108898565888,IQ
3,1175471073770573824,IQ
4,1175496913145217024,IQ
...,...,...
458192,1019484980282580992,BH
458193,1021083283709407232,BH
458194,1017477537889431552,BH
458195,1022430374696239232,BH


# Data processing

In [4]:
# Marge tweets and dialects datasets in one dataframe
data = pd.merge(tweets, dialects, on='id')

# drop the id columns
data = data.drop(columns=['id'])

data

Unnamed: 0,tweets,dialect
0,@Nw8ieJUwaCAAreT لكن بالنهاية .. ينتفض .. يغير .,IQ
1,@7zNqXP0yrODdRjK يعني هذا محسوب على البشر .. ح...,IQ
2,@KanaanRema مبين من كلامه خليجي,IQ
3,@HAIDER76128900 يسلملي مرورك وروحك الحلوه💐,IQ
4,@hmo2406 وين هل الغيبه اخ محمد 🌸🌺,IQ
...,...,...
458196,@Al_mhbaa_7 مبسوطين منك اللي باسطانا😅,BH
458197,@Zzainabali @P_ameerah والله ماينده ابش يختي,BH
458198,@Al_mhbaa_7 شو عملنا لك حنا تهربي مننا احنا مس...,BH
458199,@haneenalmwla الله يبارك فيها وبالعافيه 😋😋😋,BH


**Dialect names**

In [5]:
# Display the dialect names
dialect_names = data['dialect'].unique()
print(dialect_names)

['IQ' 'LY' 'QA' 'PL' 'SY' 'TN' 'JO' 'MA' 'SA' 'YE' 'DZ' 'EG' 'LB' 'KW'
 'OM' 'SD' 'AE' 'BH']


**Convert dialectal Arabic names to full country names**

In [6]:
# Define a dictionary that maps dialectal Arabic names to full country names
short_to_full = {
    'EG': 'مصري',
    'DZ': 'جزائري',
    'TN': 'تونسي',
    'LY': 'ليبي',
    'MA': 'مغربي',
    'JO': 'اردني',
    'LB': 'لبناني',
    'PL': 'فلسطيني',
    'SY': 'سوري',
    'IQ': 'عراقي',
    'KW': 'كويتي',
    'SA': 'سعودي',
    'AE': 'اماراتي',
    'OM': 'عماني',
    'QA': 'قطري',
    'YE': 'يمني',
    'SD': 'سوداني',
    'BH': 'بحريني'
}
# Define a function that converts the short names to full names
def convert_name(name):
    return short_to_full[name]

# Convert dialectal Arabic names to full country names
data['dialect'] = data['dialect'].apply(convert_name)

data

Unnamed: 0,tweets,dialect
0,@Nw8ieJUwaCAAreT لكن بالنهاية .. ينتفض .. يغير .,عراقي
1,@7zNqXP0yrODdRjK يعني هذا محسوب على البشر .. ح...,عراقي
2,@KanaanRema مبين من كلامه خليجي,عراقي
3,@HAIDER76128900 يسلملي مرورك وروحك الحلوه💐,عراقي
4,@hmo2406 وين هل الغيبه اخ محمد 🌸🌺,عراقي
...,...,...
458196,@Al_mhbaa_7 مبسوطين منك اللي باسطانا😅,بحريني
458197,@Zzainabali @P_ameerah والله ماينده ابش يختي,بحريني
458198,@Al_mhbaa_7 شو عملنا لك حنا تهربي مننا احنا مس...,بحريني
458199,@haneenalmwla الله يبارك فيها وبالعافيه 😋😋😋,بحريني


# preprocess Text

**Filter data from some Unwanted additions, such as :**
> symbols (such as @ # .. etc)

> stopwords ( such as في , ال التعريف )

> Reducing a word to its stem using stemming

In [7]:
def preprocessText(text):
    # Remove URLs, mentions, and hashtags
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'@\w+|\#\w+', '', text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('arabic'))
    tokens = [word for word in tokens if word not in stop_words]

    # Perform stemming
    stemmer = ISRIStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    # Join the tokens back into a string
    preprocessed_text = ' '.join(tokens)

    return preprocessed_text

# splt the data

In [8]:
# list of column names
column_names = ['tweets','dialect'] 

# Train data
train_data = data.sample(frac = 0.75) # Take 75% of the data randomly
train_data.columns = column_names # Label the columns
x_train = train_data.tweets # x = the tweets in train data
y_train = train_data.dialect # y = the labels of the train data

# Test data
test_data = data.drop(train_data.index) # Take the remaining 25% of the data
test_data.columns = column_names # Label the columns
x_test = test_data.tweets # x = the tweets in test data
y_test = test_data.dialect # y = the labels of the test data

# Traning

In [9]:
# Filter the train data
x_train = x_train.apply(lambda x: preprocessText(x))
  
# Feature extraction
# Define a custom analyzer that applies stemming to the tokens
stemmer = ISRIStemmer()
analyzer = TfidfVectorizer().build_analyzer()
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))
 
# Create a Vectorizer Object
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=5000, analyzer=stemmed_words) 
 
# Fit and transform the Vectorizer Object on the train data
x_train = vectorizer.fit_transform(x_train) 
y_train = y_train.values # The true labels of the train data

# Define the SVM model
model = LinearSVC()
# Define the hyperparameters to tune
hyperparameters = { 
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2'] 
}

# Use GridSearchCV to find the best hyperparameters
grid = GridSearchCV(model, hyperparameters)
grid.fit(x_train, y_train)

# Print the best hyperparameters
print("Best hyperparameters: ", grid.best_params_)

# Get the best model
best_model = grid.best_estimator_

15 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/svm/_classes.py", line 274, in fit
    self.coef_, self.intercept_, n_iter_ = _fit_liblinear(
  File "/opt/conda/lib/python3.10/site-packages/sklearn/svm/_base.py", line 1223, in _fit_liblinear
    solver_type = _get_liblinear_solver_type(multi_class, penalty, loss, dual)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/svm/_base.py", line 1062, in _get_

Best hyperparameters:  {'C': 1, 'penalty': 'l2'}


# Testing

In [10]:
# Evaluate the model

# Preprocess the test data
x_test = x_test.apply(preprocessText)

# Feature extraction
x_test = vectorizer.transform(x_test) # Encode the data

# Predict the labels of the test data using the best model
y_test = y_test.values # The true labels of the test data
y_predict = grid.predict(x_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_predict)
print("Accuracy: ", accuracy)

Accuracy:  0.4327106067219555


# Arabic Dialect Identification system

In [12]:
# Take text from user
userText = input("فضلاً ادخل النص:")

# Convert string into an DataFrame
text = [userText]
text = pd.DataFrame(text)

# Filter the text
text[0] = text[0].apply(preprocessText)

# Encode the text
text_user = vectorizer.transform(text[0]) 

# The prediction made by the model
predict = best_model.predict(text_user)

# Display the result
print("لهجه النص هي:",predict[0])

فضلاً ادخل النص: ازيك يا باشا


لهجه النص هي: مصري




## Contributors
> Section: AIL

> Raneem Saad Alomari, ID: 2006352

> Bedoor Ayad Alsulami, ID: 2005961

> Layal Soud Halwani, ID: 2007896

> Afnan Tariq Algogandi, ID: 2007926 
