# The Project Structure

The project is divided into five main sections as following: <br/>
1. Text Preprocessing.
2. Embedding.
3. Dimensionality Reduction.
4. Clustering.
5. Topic Representation (Topic Modeling).
6. Evaluation of Topic Modeling

## Text Preprocessing

During this step, we will clean the Arabic Text to create two datasets:
1. Raw data without any modification.
2. Normalized dataset <br/>
In the normalized dataset, we performed the following on the dataset: <br/>
    <ol>
        <li>Removal of stop words.</li>
        <li>Normalizing words (converting the different forms of a word to one original form).</li>
        <li>Tokenization using (Pyarabic.araby tokenizer).</li>
    </ol>

In [2]:
# Obtain needed libraries for text preprocessing.
!pip install pyArabic
!pip install tashaphyne




For the sake of normalization we will use the <a href="https://pyarabic.readthedocs.io/ar/latest/index.html" >PyArabic</a> python package.

In [54]:
# Hide warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Import needed libraries for this task.
import numpy as np
import pandas as pd
from pyarabic.araby import tokenize, is_arabicrange, strip_tashkeel  # to remove non-arabic words, and strip tashkeel
from tashaphyne.stemming import ArabicLightStemmer  # to stem the Arabic words 
from nltk.corpus import stopwords

In [3]:
# Read the raw dataset
df = pd.read_csv("arabic_dataset_classifiction.csv", encoding="utf-8")
df.head(4)

Unnamed: 0,text,targe
0,بين أستوديوهات ورزازات وصحراء مرزوكة وآثار ولي...,0
1,قررت النجمة الأمريكية أوبرا وينفري ألا يقتصر ع...,0
2,أخبارنا المغربية الوزاني تصوير الشملالي ألهب ا...,0
3,اخبارنا المغربية قال ابراهيم الراشدي محامي سعد...,0


In [4]:
# Viewing the shape of the raw data
df.shape

(111728, 2)

In [5]:
# Null values handling
df.isnull().sum()

text     2939
targe       0
dtype: int64

In [6]:
# Dropping null values
df.dropna(axis = 0, inplace = True)#

In [7]:
# View the dataframe columns
df.columns

Index(['text', 'targe'], dtype='object')

In [8]:
# Dropping unnecessary column target
df.rename(columns={"targe":"target", "text":"document"}, inplace = True)

In [8]:
# view data after drop of column
df.head(3)

Unnamed: 0,document,target
0,بين أستوديوهات ورزازات وصحراء مرزوكة وآثار ولي...,0
1,قررت النجمة الأمريكية أوبرا وينفري ألا يقتصر ع...,0
2,أخبارنا المغربية الوزاني تصوير الشملالي ألهب ا...,0


In [9]:
# Getting a test dataframe
test_df = df.groupby("target", group_keys=False).apply(lambda x: x.sample(200))
test_df.head(3)

  test_df = df.groupby("target", group_keys=False).apply(lambda x: x.sample(200))


Unnamed: 0,document,target
8758,بعث الملك محمد السادس برقية تعزية إلى أفراد أس...,0
9447,على صدى الفقرات الف كاهية المقرونة بالقهقهات و...,0
5390,فيلم المخرج الصباحي سبق له التتويج بالعديد من ...,0


In [10]:
# Number of test samples.
test_df.shape

(1000, 2)

In [11]:
# function to normalize the Arabic text
def normalize_text(raw_text) -> str:
    '''
    Normalize the arabic text by removing stopwords, non-Arabic words, and Tashkeel
    Parameters:
        raw_text : Arabic text sentence.
    Return:
        normalized_text : normalized Arabic text with removal of Non-Arabic words, Tashkeel, 
    '''
    tokens = tokenize(raw_text, conditions= is_arabicrange, morphs= strip_tashkeel)
    # Obtain the stopwords and create the light stemmer
    ara_stopwords = set( stopwords.words("arabic"))
    ArListem = ArabicLightStemmer()
    # Create a stemmer to get the stem word
    stemmer = lambda word: ArListem.light_stem(word)

    filter_tokens = [stemmer(word) for word in tokens if word not in ara_stopwords]

    return " ".join(filter_tokens)

In [12]:
# Testing the function
normalize_text("أفتضاربانني وانا لحالي وبشكل وحَيد ")

'ضارب نا حال شكل حيد'

### Apply normalization

In [13]:
# Creating a Normalized text column
test_df["normalized_document"] = test_df["document"].apply(normalize_text)

In [14]:
test_df.head(3)

Unnamed: 0,document,target,normalized_document
8758,بعث الملك محمد السادس برقية تعزية إلى أفراد أس...,0,عث ملك محمد سادس رق تعز راد سر فن مرحوم سعيد ش...
9447,على صدى الفقرات الف كاهية المقرونة بالقهقهات و...,0,صدى فقر لف اه مقرون قهقه أنغام موسيق أنماط مخت...
5390,فيلم المخرج الصباحي سبق له التتويج بالعديد من ...,0,لم مخرج صباح سبق تتويج عديد جوائز وطن دول شرع ...


## Embedding Text

For the sake of embedding the Arabic text, we will use the following techniques:
<ol>
    <li> <a href="https://huggingface.co/aubmindlab/bert-base-arabert"> AraBERT v2  </a> </li>  
    <li> <a href="https://github.com/bakrianoo/aravec?tab=readme-ov-file"> AraVec 3.0</a> </li>
</ol>

In [None]:
# Install necessary libraries if not installed
!pip install transformers
!pip install sentence_transformers
!pip install pytorch
!pip install tensorflow
!pip install tf-keras

### AraBERT v2.0 Embedder 

In [15]:
# Import libraries and AraBERT Model
from transformers import AutoModel, AutoTokenizer
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [30]:
# Specify the  pre-trained model AraBERT
class AraBertEmbedding:
    def __init__(self, model_name):
        ''' Initialize an AraBERT embedder based on the name, can you other different
            embedding models such as:
            1- CAMeL-mBERT
            2- mBERT
        '''
        # initialize the model and its tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
    
    def sentence_embedding(self, sentence )-> np.array:
        '''
        Get the sentence and convert it to its corresponding embedding using AraBERT mainly.
        Args:
            sentence: the text to be embedded.
        Return:
            A list of sentence embedding.
        '''
        # tokenize the sentence
        inputs = self.tokenizer(sentence, padding = True, truncation= True, return_tensors='pt')

        # get the model outputs
        with torch.no_grad():
            outputs = self.model(**inputs)
        
        sentence_embedding = outputs.pooler_output.squeeze(0)
        sentence_embedding = sentence_embedding.numpy()
        return sentence_embedding

In [31]:
# Class to prepare AraBERT embeddings
model_name = "aubmindlab/bert-base-arabertv2"
AraBERT_embeder = AraBertEmbedding(model_name)

sentence = "هذا نص عربي نظيف"
sentence_embedding = AraBERT_embeder.sentence_embedding(sentence)
print(sentence_embedding)
print(f"Length of embedding list: {len(sentence_embedding)}")
print(f"embedding array shape: {sentence_embedding.shape}")



[ 5.23370206e-01  9.99328911e-01 -9.91155088e-01  2.74203926e-01
 -9.87248719e-01  9.98714924e-01  9.88129497e-01  9.77524459e-01
 -4.93585855e-01 -3.35825264e-01 -9.97633576e-01  9.99559700e-01
 -1.63868994e-01 -7.92213082e-01  1.83501184e-01  1.44676566e-01
 -1.00128412e-01  9.99945164e-01  9.89741862e-01 -9.97253478e-01
 -9.99998271e-01 -9.68582928e-01 -1.40344519e-02  4.25757080e-01
  5.55203930e-02  4.23745424e-01 -9.96149957e-01  8.58649388e-02
 -1.21733166e-01  9.98427391e-01 -2.77254730e-01 -8.37919414e-02
 -2.39879116e-01  1.36790797e-01 -9.99370992e-01 -2.57207781e-01
  8.94751847e-01 -4.23349798e-01 -1.76405579e-01  1.99711040e-01
 -1.21830843e-01  9.38047528e-01 -1.72215834e-01 -7.50236154e-01
  3.91459763e-02  9.78196681e-01  4.21842694e-01  9.99960184e-01
 -9.97235298e-01  3.70462805e-01  2.63041586e-01  2.80179054e-01
  3.10383290e-01 -2.83111513e-01 -2.89760754e-02  9.17396963e-01
  2.50469327e-01  9.99639213e-01 -3.42061460e-01  5.27315810e-02
  7.32271552e-01  3.09987

### AraVec 2.0 Embedder

In [15]:
# Import the needed libraries
import gensim
from nltk import ngrams

In [16]:
# Specify the  pre-trained model AraBERT
class AraVecEmbedding:
    def __init__(self, model_path):
        # initialize the model and its tokenizer
        self.aravec_model = gensim.models.Word2Vec.load(model_path)  
    
    def sentence_embedding(self, sentence ):
        '''
        Method to embed sentence using AraVec 2.0 Embedder.
        Args:
            Sentence: a string Arabic sentence.
        Returns:
            A list of sentence embeddings.
        '''
        word_vectors = []
        for word in sentence.split():
            try:
                if self.aravec_model.wv.key_to_index[word]:
                    word_vectors.append(self.aravec_model.wv[word])
                if word_vectors:
                    sentence_embedding = np.mean(word_vectors, axis=0)
                else: 
                    sentence_embedding = np.zeros(self.aravec_model.vector_size)
            except:
                print(f"Exception in word {word}")
    
        return sentence_embedding

In [17]:
# Testing the sentence AraVec embedder
model_path = "models/full_grams_cbow_300_twitter.mdl"
AraVecEmbedder = AraVecEmbedding(model_path)
sentence = "هذا نص عربي نظيف"
sentence_embedding = AraVecEmbedder.sentence_embedding(sentence)
print(sentence_embedding)
print(f"length of embedding sentence: {len(sentence_embedding)}")

[-3.36966991e-01 -6.22530878e-01  9.82408762e-01  3.53037804e-01
  1.41825095e-01  3.80157173e-01  8.17987800e-01 -6.81027889e-01
  8.72404277e-01 -1.19060230e+00  1.47645742e-01  6.31813347e-01
  1.55117586e-01 -3.72911215e-01 -1.39753222e-01 -4.17479396e-01
 -2.09697932e-01  1.45365143e+00  1.10528791e+00  1.88391805e+00
 -6.56987548e-01  1.04125679e+00 -8.71298611e-01  6.63073599e-01
 -1.15656205e-01  6.91220045e-01  6.93762004e-01 -1.03224277e-01
  8.33525062e-01  2.82451481e-01  5.84654927e-01  1.51760960e+00
  1.25159657e+00  4.87975836e-01 -1.64878383e-01  5.81699491e-01
 -5.19690216e-01  3.41344535e-01  2.24834636e-01 -4.68001008e-01
 -4.98701632e-02  1.07942271e+00  4.15080786e-02 -3.38985115e-01
  8.29212308e-01 -2.49477401e-02  2.64519930e-01 -5.27034104e-01
  4.33270305e-01 -3.21631432e-01 -4.83811855e-01 -7.61056542e-01
 -1.22210646e+00  1.62160707e+00 -5.05381763e-01 -1.52093038e-01
 -5.69052219e-01  3.89696062e-02  1.00912070e+00 -9.78980184e-01
 -9.28324163e-02  8.45302

In [17]:
print(f"embedding array shape: {sentence_embedding.shape}")

embedding array shape: (300,)


# Dimensionality Reduction

For the sake of dimensionality reduction, we will use the following techniques: <br />
1. Principle Component Reduction (PCA)
2. Truncated Singular Value Decomposition (TruncatedSVD)

## Principle Component Reduction (PCA)

In [48]:
# Import the needed libraries
from sklearn.decomposition import PCA

class PCAClass:
    def __init__(self):
        '''Initialize a PCA compression object'''
        self.pca = PCA(n_components = 1, random_state= 42, svd_solver='full')

    def reduce(self, data):
        '''Function to reduce the data size.
        Args:
            data: data to be reduced (an array)
        Returns:
            reduced data: an array of compressed data
        ''' 
        return self.pca.fit_transform(data.reshape(1, -1))

In [49]:
# Testing functions
data = np.random.rand(100, 1)  # Example data
pca = PCAClass()
red_data = pca.reduce(data)
red_data.shape

  explained_variance_ = (S**2) / (n_samples - 1)


(1, 1)

## Truncated-SVD

In [38]:
# Import needed libraries
from sklearn.decomposition import TruncatedSVD

class TruncatedSVDClass:
    def __init__(self):
        '''Function to initiate a decomposition object'''
        self.svd = TruncatedSVD(n_components= 50, random_state=42)

    def reduce(self, data):
        '''Function to reduce the data size.
        Args:
            data: data to be reduced (an array)
        Returns:
            reduced data: an array of compressed data
        ''' 
        return self.svd.fit_transform(data.reshape(-1,1))

In [41]:
# Testing functions
data = np.random.rand(100, 300)  # Example data
svd_reducer = TruncatedSVDClass()
red_data = svd_reducer.reduce(data)
red_data.shape

(100, 50)

# Clustering 

For the sake of clustering; two cluster techniques will be used <b> K-means </b> and <b> DBSCAN </b>

## K-means

In [None]:
# Import Kmeans model
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

class KmeansClustering:
    '''A  class to handle Kmeans clustering'''
    def __init__(self):
        '''Creating a kmeans clustering model'''
        self.inertias = None

    def find_best_k_value(self, data, max_no_clusters):
        for i in range(1, max_no_clusters, 2):
            self.kmeans = KMeans(n_clusters = i)
            self.kmeans.fit(data)
            self.inertias.append(self.kmeans.inertia_)
    
    def draw_elbow_graph(self):
        plt.plot(range(1,11), self.inertias, marker='o')
        plt.title('Elbow method')
        plt.xlabel('Number of clusters')
        plt.ylabel('Inertia')
        plt.show()
    
    def cluser(self, data, n_clusters):
        '''Apply the optimal number of clusters and return the labels.
        Args:
            data: the data to be clustered.
            n_clusters: number of clusters
        Returns:
            labels: labels of the cluster
        '''
        self.kmeans = KMeans(n_clusters = n_clusters, max_iter= 100, init="k-means++")
        self.kmeans.fit(data)
        return self.kmeans.labels_

## HDBSCAN

In [59]:
# import HDBSCAN model
from sklearn.cluster import HDBSCAN

class HDBSCANClass:
    def __init__(self):
        '''Initialize a HDBSCAN clustering model'''
        self.hdb = HDBSCAN(min_cluster_size = 100)

    def cluster(self, data):
        '''Function to perform clustering on the data.
        Args:
            data: data to be clustered.
        Returns:
            labels: labels of the data clustered.
        '''
        self.hdb.fit(data)
        return self.hdb.labels_

In [60]:
# Find the unique number of clusters
def find_clusters_no(labels_arr):
    return len(np.unique(labels_arr))

# Topic Representation

Here is the topic representation using .... etc.

## TF-IDF Cosine

In [None]:
# imported needed libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import consine_similarity

In [None]:
class TfidfClass:
    def __init__(self, df):
        '''Initiate a TF-IDF vectorizer'''
        self.grouped_data = df.groupby("cluster")
        self.vectorizer = TfidfVectorizer()
        self.mega_documents = {}
        self.cluster_topic_terms = {}

    def get_cluster_topic_terms(self):
        '''Function to return the cluster number with its terms'''
        for cluster_id, group in self.grouped_data:
            self.mega_documents[cluster_id] = " ".join(group["document"])
        
        tfidf_matrix = self.vectorizer.fit_transform(self.mega_documents.values())
        # Implement iterative term merging (customize as needed)
        def merge_terms(tfidf_matrix, threshold=0.8):
            merged_matrix = tfidf_matrix.copy()
            for i in range(merged_matrix.shape[0]):
                for j in range(i + 1, merged_matrix.shape[0]):
                    similarity = consine_similarity(merged_matrix[i].reshape(1, -1), merged_matrix[j].reshape(1, -1))[0][0]
                    if similarity >= threshold:
                        # Merge terms (e.g., add their TF-IDF values or average them)
                        merged_matrix[i] += merged_matrix[j]
                        merged_matrix[j] = 0  # Mark the merged term as removed

            # Remove merged terms
            merged_matrix = merged_matrix[merged_matrix.sum(axis=1) > 0]

            return merged_matrix
        
        merged_tfidf_matrix = merge_terms(tfidf_matrix)

        # Extract top terms
        top_terms = []
        for i in range(merged_tfidf_matrix.shape[0]):
            top_terms.append(self.vectorizer.get_feature_names()[merged_tfidf_matrix[i].argsort()[-10:][::-1]])

        # Print the topics
        for cluster_id, terms in zip(self.mega_documents.keys(), top_terms):
            self.cluster_topic_terms[cluster_id] = terms

        return self.cluster_topic_terms

    def print_cluster_terms(self):
        '''Function to print the cluster along with the terms'''
        for cluster_id, terms in enumerate(self.cluster_topic_terms):
            print(f"Cluster {cluster_id}: {terms}")

## LDA

0:2
1:44


# Evaluation of Topic Modeling

For the sake of evaluation, two techniques are used: <br/>
1. Normalized Pointwise Mutual Information (NPMI)
2. Coherence

In [None]:
# Here goes the code of evaluation


# Techniques Permutation

A permutation of all the possible: <br/>
1. Two datasets (raw and normalized)
2. Two Embedding techniques ( <a href="https://huggingface.co/aubmindlab/bert-base-arabert"> AraBERT v2  </a> and <a href="https://github.com/bakrianoo/aravec?tab=readme-ov-file"> AraVec 3.0</a> ) 
3. Dimensionality Reduction ()
4. Clustering Algorithms ()
5. Topic Representations () <br/>

Then final results are evaluated using the two methods the NPMI and coherence.

In [24]:
# Get the data to be embedded using AraVec
model_path = "models/full_grams_cbow_300_twitter.mdl"
AraVecEmbedder = AraVecEmbedding(model_path)
# sentence = "هذا نص عربي نظيف"
# sentence_embedding = AraVecEmbedder.sentence_embedding(sentence)
# print(sentence_embedding)
# print(f"length of embedding sentence: {len(sentence_embedding)}")

# model_name = "aubmindlab/bert-base-arabertv2"
# AraBERT_embeder = AraBertEmbedding(model_name)

# sentence = "هذا نص عربي نظيف"
# sentence_embedding = AraBERT_embeder.sentence_embedding(sentence)

test_df["norm_doc_AraVec_emb"] = test_df["normalized_document"].apply(AraVecEmbedder.sentence_embedding)

Exception in word شرايب
Exception in word أليم
Exception in word صدقائ
Exception in word مواسا
Exception in word تحلى
Exception in word آلة
Exception in word غنى
Exception in word إبداع
Exception in word أعمال
Exception in word أليم
Exception in word إن
Exception in word عالى
Exception in word أوفى
Exception in word على
Exception in word أن
Exception in word صدى
Exception in word أنغام
Exception in word أنماط
Exception in word على
Exception in word مارشيك
Exception in word أربعاء
Exception in word أول
Exception in word شرفاو
Exception in word إعداد
Exception in word أمن
Exception in word أمس
Exception in word أولى
Exception in word أمر
Exception in word أيام
Exception in word إيك
Exception in word عة
Exception in word أخير
Exception in word حفاو
Exception in word استينغ
Exception in word نة
Exception in word موسيقى
Exception in word نتزع
Exception in word أمس
Exception in word طنجاو
Exception in word وسحاسح
Exception in word إيقاع
Exception in word إيقاع
Exception in word أغن
Exception

In [73]:
test_df.head()

Unnamed: 0,document,target,normalized_document,norm_doc_AraVec_emb,norm_doc_AraVec_emb_red,cluster
8758,بعث الملك محمد السادس برقية تعزية إلى أفراد أس...,0,عث ملك محمد سادس رق تعز راد سر فن مرحوم سعيد ش...,"[0.025512429, -0.12770364, 0.28066695, 0.00590...",[[0.0]],-1
9447,على صدى الفقرات الف كاهية المقرونة بالقهقهات و...,0,صدى فقر لف اه مقرون قهقه أنغام موسيق أنماط مخت...,"[0.3132456, -0.45250255, -0.08230107, 0.468265...",[[0.0]],-1
5390,فيلم المخرج الصباحي سبق له التتويج بالعديد من ...,0,لم مخرج صباح سبق تتويج عديد جوائز وطن دول شرع ...,"[0.3376163, -0.14279619, -0.025770979, 0.36993...",[[0.0]],-1
4781,فاز المخرج المكسيكي أليخاندرو جونزاليس إيناريت...,0,از مخرج مكسيك يخاندرو جونزاليس إيناريت جائز سك...,"[0.12612131, -0.35999724, -0.467802, 0.6016373...",[[0.0]],-1
6994,استقبل مصطفى الخلفي وزير الاتصال الناطق الرسمي...,0,ستقبل مصطفى خلف زير اتصال ناطق رسم اسم حكوم مط...,"[0.2438032, -0.2603266, -0.23658782, 0.1457876...",[[0.0]],-1


In [55]:
pca = PCAClass()
test_df["norm_doc_AraVec_emb_red"] = test_df["norm_doc_AraVec_emb"].apply(pca.reduce)

In [71]:
hdbscan = HDBSCANClass()
test_df["cluster"] = hdbscan.cluster(test_df[["norm_doc_AraVec_emb_red"]])

In [72]:
test_df["cluster"].unique()

array([-1], dtype=int64)