# arabic dataset classifiction

### importing

In [1]:
import pandas as pd
import numpy as np
from nltk import word_tokenize
import pyarabic.araby as ar           
import nltk                         
import string        
import os
import re
from nltk.corpus import stopwords            
from nltk.stem.porter import PorterStemmer 
from nltk.stem import SnowballStemmer

### data_read_and_remove(Null.values)

In [3]:

nltk.download('stopwords')

nltk.download('punkt')

stop_words = list(set(stopwords.words('arabic')))  

df = pd.read_csv('arabic_dataset_classifiction.csv',chunksize=10000)

data = pd.concat(df)

clean=[]

data.dropna(axis=0,inplace=True)       # Remove 'Null' values.

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\9961013738.UPS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\9961013738.UPS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


This code is using the Natural Language Toolkit (NLTK) library to download two packages: 
1. 'stopwords' - a collection of stop words (words that are commonly excluded from search queries) 
2. 'punkt' - a package of tokenizers for natural language processing

It then creates a list of stop words in Arabic, taken from the NLTK library.

It then reads a CSV file called 'arabic_dataset_classification.csv' in chunks of 10,000 rows each. 

The code then concatenates the chunks into one dataframe and drops any rows with missing values. 

Finally, it creates an empty list called 'clean' which will be used for further preprocessing of the data.

### stopword.list_read

In [4]:
t1 = pd.read_csv('list.txt')

t2 = pd.read_csv('arabicST.txt')

t3 = pd.read_csv('list.tsv')


This code reads in 3 separate files, with different formats. The first file is a list.txt, which is read in with the pd.read_csv() method. This creates a DataFrame (t1) with the data from the list.txt file. 

The second file is an arabicST.txt, which is also read in with the pd.read_csv() method. This creates another DataFrame (t2) with the data from the arabicST.txt file. 

The third file is a list.tsv, which is read in with the pd.read_csv() method. This creates a third DataFrame (t3) with the data from the list.tsv file.

### filtaring 

In [5]:
arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''

english_punctuations = string.punctuation     # Get all the special characters.   

punctuations_list = arabic_punctuations + english_punctuations  

stemmer = nltk.ISRIStemmer()

This code creates two strings of punctuation marks, one for Arabic and one for English. It then combines the two strings into a single list of punctuation marks. Finally, it creates an instance of the ISRIStemmer from the Natural Language Toolkit library. This stemmer is used for Arabic text and is a part of the NLTK library.

### count the number of words we have befor cleaning

In [6]:
unique = [j.split() for j in data['text']]

unique = pd.DataFrame(unique)

print(unique.nunique().sum()) 

9240130


The code reads the data from the data frame and splits each row into a list. It stores the resulting list in a new data frame called unique. It then uses the nunique() function to count the number of unique items in the data frame and sums them up. The final result is the number of unique words in the data frame.

### cleaning

In [7]:
for index,i in enumerate(data['text']):
    
    text = re.sub('[a-zA-Z0-9]',' ',i)
    
    text = text.split()
    
    text = [word for word in text if word not in list(t3)]
    
    text = [word for word in text if word not in t1]
    
    text = [word for word in text if word not in t2]
    
    text = [word for word in text if word not in punctuations_list]
    
    text = [stemmer.stem(word) for word in text if word not in stop_words]
    
    text = ' '.join(text)
    
    text = text.replace("آ", "ا")
    text = text.replace("إ", "ا")
    text = text.replace("أ", "ا")
    text = text.replace("ؤ", "و")
    text = text.replace("ئ", "ي")
    clean.append(text)

This code is preprocessing data from a data set. It is taking each element from the "text" field of the data set and performing a series of steps to clean it. The steps include replacing certain characters with others, removing words from a predefined list, stemming each word using a stemmer, and removing stop words and punctuation. Once the text is fully preprocessed, it is added to the list "clean".

### count the number of words we have after cleaning

In [8]:
unique = [j.split() for j in clean]

unique = pd.DataFrame(unique) 

print(unique.nunique().sum()) 

2761834


The code reads the data from the data frame and splits each row into a list. It stores the resulting list in a new data frame called unique. It then uses the nunique() function to count the number of unique items in the data frame and sums them up. The final result is the number of unique words in the data frame.

#### countvectorizer and divide into X and y 

In [9]:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=500000)

This code imports the CountVectorizer class from the scikit-learn library. CountVectorizer is a class used for extracting features from text documents and converting them into numeric representation. The max_features parameter specifies the maximum number of features (words or phrases) to be included in the CountVectorizer object. In this case, the maximum number of features is set to 500000.

In [10]:
X = cv.fit_transform(clean)

y = data.iloc[:,1].values

This code is part of a machine learning program.  It is segmenting the data into two parts, X and y. The X variable is being set to the result of a "fit_transform" function which uses "clean" as an input. This function is part of the "cv" library, which stands for "cross-validation". This library is used to help in the training of a machine learning model. The y variable is being set to the second column of the "data" variable, which is assumed to be a dataframe.

### classifiction

In [12]:
# =============================================================================
# DECISION TREE
# =============================================================================


from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

classifier.fit(X_train,y_train)

y_pred = classifier.predict(X_test)

# =============================================================================
# DECISION TREE - TEST
# =============================================================================
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,y_pred)

print(cm)

from sklearn.metrics import accuracy_score

print(f"\Accuracy score for DECISION TREE: {accuracy_score(y_test,y_pred)}")

[[ 2866    79   160   232   128]
 [   83  3578   114   233    93]
 [  126   115  2665   511    84]
 [  216   245   498  4183   119]
 [   95    98    75   121 10481]]
\Accuracy score for DECISION TREE: 0.8740716229134495


This code is using the sklearn library to create a Decision Tree Classifier and use it to predict a target variable. The Decision Tree Classifier is first initialized, then the data is split into training and test sets. The classifier is then fit on the training data, and then used to make predictions on the test set. Finally, a confusion matrix and accuracy score are calculated to evaluate the accuracy of the model.

In [14]:
# =============================================================================
# SVR RBF 
# =============================================================================


from sklearn.svm import SVC

classifier = SVC()

classifier.fit(X_train,y_train)

y_pred = classifier.predict(X_test)

# =============================================================================
# SVR RBF - TEST
# =============================================================================


print(f"\nAccuracy score: {accuracy_score(y_test,y_pred)}")



Accuracy score: 0.9474961394220163


This code is using the Scikit-Learn library to create a Support Vector Machine (SVM) classifier. It is creating an instance of the SVM classifier, fitting it to the X_train and y_train data, then predicting the labels of the X_test data set. Finally, it prints out the accuracy score for the SVM classifier on the X_test data set.

In [15]:

# =============================================================================
# SVR LINEAR
# =============================================================================

classifier = SVC(kernel='linear')

classifier.fit(X_train,y_train)

y_pred = classifier.predict(X_test)

# =============================================================================
# SVR LINEAR - TEST
# =============================================================================


print(f"\nAccuracy score for SVR LINEAR: {accuracy_score(y_test,y_pred)}")


Accuracy score for SVR LINEAR: 0.9301419222001618


This code is creating a Support Vector Machine (SVM) classifier with a linear kernel, which is a supervised machine learning algorithm used for classification. It is then fitting the training data (X_train) to its corresponding labels (y_train) and using it to predict the labels of the test data (X_test). Finally, it is printing out the accuracy score of the SVM classifier's predictions.

In [17]:
# =============================================================================
# RANDOM FOREST 
# =============================================================================

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators=100)

classifier.fit(X_train,y_train)

y_pred = classifier.predict(X_test)

# =============================================================================
# RANDOM FOREST - TEST 
# =============================================================================

print(f"\nAccuracy score: {accuracy_score(y_test,y_pred)}")


Accuracy score: 0.9367232884770939


This code is using the RandomForestClassifier() function from the sklearn.ensemble library to create a machine learning model. The n_estimators parameter controls the number of trees in the forest. The .fit() method fits the training data to the model. Then the .predict() method is used to make predictions on the test data. Finally, the accuracy_score function is used to measure the model's performance by comparing the predicted values with the actual values.