<a href="https://colab.research.google.com/github/haitam-zouhair/workspace_henceforth/blob/master/arabic_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [21]:
%%shell
cp -r 'drive/My Drive/Projects/Zouhair/' .

cp: cannot stat 'drive/My Drive/Projects/Zouhair/': No such file or directory


CalledProcessError: ignored

In [None]:
import os
os.chdir("Zouhair")

In [None]:
!pip install pyLDAvis

# Dataset

I used [SANAD](https://www.sciencedirect.com/science/article/pii/S2352340919304305) as dataset. Dataset can be downloaded [here](https://data.mendeley.com/datasets/57zpx667y9/2), you can also use directly the dataset from Google Drive, I put them [here](https://drive.google.com/file/d/1hjDR0C5vHPckZisqYby13r2m1i5mtH3p/view?usp=sharing).

In [None]:
%%shell
unzip arabic.zip
unzip Akhbarona.zip -d Akhbarona
unzip Arabiya.zip -d Arabiya
unzip Khaleej.zip -d Khaleej
unzip SANAD_SUBSET.zip -d SANAD_SUBSET
rm *.zip 

In [None]:
import pandas as pd


DATANAME = "Arabiya"

data_frame = pd.DataFrame(columns=["text", "label"])

for cat in os.listdir(DATANAME):
    print("reading files from category {}...".format(cat))
    dir_path = os.path.join(DATANAME, cat)

    for file_name in os.listdir(dir_path):
        path = os.path.join(dir_path, file_name)
        
        with open(path, "r") as f:
            text = f.read()

        data_frame = data_frame.append({"text": text, "label": cat},
                                       ignore_index=True)


In [None]:
print(len(data_frame))
data_frame.head()

# I/-Arabic Text Processing

Similar to English text, arabic text should be preporcessed before performing any NLP task on it. The pre-processing steps are roughly the same:

* **Tokenization:**
* **Remove Stopwords:** Arabic stop words can be found [here](https://github.com/mohataher/arabic-stop-words/blob/master/list.txt).
* **Lemmatization**:  I am not sure how to lemmatize arabic text, I think in arabic steeming and lemmatization are the same text. I have found this [paper](https://www.aclweb.org/anthology/L18-1181.pdf)  but we need to check this later to be sure (**TO DO**).
* **Stemming**: We use `nltk.SRIStemmer`



In [None]:
%%shell
#Download arabic stop words
wget https://raw.githubusercontent.com/mohataher/arabic-stop-words/master/list.txt
mv list.txt arabic_stop_words.txt

In [None]:
from string import punctuation
from nltk import ISRIStemmer


punctuation += '،؛؟”0123456789“'
stop_words = open("arabic_stop_words.txt").read().splitlines()
stemmer = ISRIStemmer()

def preprocess(text):
    # tokenize
    text = ''.join(c for c in text if c not in punctuation)
    tokens = text.split()  

    # remove stop words   
    tokens = [w for w in tokens if w not in stop_words]

    # stem
    stems = [stemmer.stem(w) for w in tokens]

    return stems


processed_docs = data_frame['text'].map(preprocess)
data_frame['processed_text'] = processed_docs

# II/- Topic modeling using LDA


In [None]:
from gensim import corpora, models

import numpy as np

## 1. Bag of Words

In [None]:
# build bag of words dictionnary
dictionary = corpora.Dictionary(data_frame['processed_text'])

# filter iinfrequent words
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in data_frame['processed_text']]

View Bag of Word result

In [None]:
for i in range(len(bow_corpus[0])):
    print("Word {} (\"{}\") appears {} time.".format(bow_corpus[0][i][0], 
                                               dictionary[bow_corpus[0][i][0]], 
                                               bow_corpus[0][i][1]))

## 2. Running LDA

Our dataset has $6$ topics, we will try to build LDA model with $6$ topics.

In [None]:
NUM_TOPICS = 6

lda_model = models.LdaMulticore(bow_corpus,
                                num_topics=NUM_TOPICS,
                                id2word=dictionary,
                                passes=2,
                                workers=4)

In [None]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} | Word: {}'.format(idx, topic))

Seems that we can recognize some topics from those results

* Topic 0 corresponds to **Finance**
* Topic 1 corresponds to **Sports**
* Topic 2 corresponds to **Culture**
* Topic 3 corresponds to **Tech**   
* Topic 4 correnponds to **Finance**
* Topic 5 corresponds to **Politics**




# III/- Evaluation

## 1) Vizualisation

In [None]:
import pyLDAvis.gensim
import pickle 
import pyLDAvis

pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)
LDAvis_prepared

## 2. Perplexity and Coherence Score

In [None]:
coherence_model_lda = models.CoherenceModel(model=lda_model,
                                     texts=data_frame["processed_text"],
                                     dictionary=dictionary,
                                     coherence='c_v')

print('Coherence Score: {:.2f}'.format(coherence_model_lda.get_coherence()))

# IV/- Content classification using topic modeling

In this section, we use the results of topic modeling in order to build a content classifier. Each document will be represented by a `NUM_TOPICS`-vector capturing the distribution of the learnt topics. This is similar to the approach used in [Phan et al. (2008)](http://gibbslda.sourceforge.net/fp224-phan.pdf)

## 1. Build dataset

In [None]:
N = len(bow_corpus)


X = np.zeros((N, NUM_TOPICS))
y = np.zeros(N, dtype=np.uint8)


label2id = {label: idx for (idx, label)
 in enumerate(list(set(data_frame["label"])))}

for i in range(N):
    topics_proba = lda_model.get_document_topics(bow_corpus[i],
                                              minimum_probability=0.0)
    
    topic_probas = [v for _, v in  topics_proba] 
    X[i,:] = np.array(topic_probas)
    y[i] = label2id[data_frame["label"][i]]
    

## 2. Train classifier

We train some simple classifiers using the extracted feature vectors

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# SVM
svm_clf = SVC()
svm_clf.fit(X_train, y_train)

train_acc = svm_clf.score(X_train, y_train)
test_acc = svm_clf.score(X_test, y_test)
print("SVM: Train accuracy {0:.2f} | Test acuracy {1:.2f}".format(train_acc,
                                                                  test_acc))

SVM: Train accuracy 0.87 | Test acuracy 0.87


# V/- Model Deployment

In this section we save the model, and we load it into `lda_model_uploaded` and test it on new text.



In [23]:
import warnings
warnings.filterwarnings('ignore')


model_path = "model"

lda_model.save(model_path)
lda_model_loaded = lda_model.load(model_path)

In [89]:
import pickle

with open("arab_class.pkl", "wb") as f:
    pickle.dump(svm_clf, f, pickle.HIGHEST_PROTOCOL)

In [91]:

clf = pickle.load(open("arab_class.pkl","rb"))

Consider some text, for example

In [92]:
#get a random text
text = data_frame.sample()["text"].values[0]

print(text)

# preprocess text
processed_text = preprocess(text)

# get Bag of Words
bow_doc = lda_model_loaded.id2word.doc2bow(processed_text)

# get topics 
topics = lda_model_loaded.get_document_topics(bow_doc,
                                              minimum_probability=0.0)
topic = [v for _, v in  topics] 
query= np.array(topic)

result=clf.predict([query])
A=['Culture', 'Tech', 'Politics', 'Finance', 'Medical', 'Sports']
print(A[result[0]])
print("*"*30)

print([topics])

أعلن في بغداد عن أسماء الأديبات الفائزات بجائزة "نازك الملائكة للإبداع الأدبي" النسوي في دورتها السابعة (دورة 2014)، وتميّزت دورة هذا العام عن الدورات السابقة بسعة المشاركة وتنوّع البلدان؛ حيث تلقت اللجنة التحكيمية 156 مشاركة من داخل وخارج العراق؛ وتناولت المشاركات مواضيع عدة وكلّ ضمن الحقل المختصة فيه. والملاحظة التي يمكن تأشيرها في هذه الدورة هو فوز الأديبات العربيات في مجالي الشعر والقصة؛ تاركات للأديبات العراقيات جوائز النقد. فجائزة الشعر الأولى ذهبت للشاعرة اللبنانية نور حيدر، والثانية للشاعرة السورية ليندا سلمان ابراهيم؛ والجائزة الثالثة للشاعرة المغربية صباح الدبى. أما في مجال القصة القصيرة فكانت الجائزة الأولى للقاصة العراقية إيمان راضي عبد الحسين، والفائزة بالجائزة الثانية القاصة زينة بو رويسة من الجزائر، في الوقت الذي كان المركز الثالث من نصيب القاصة المصرية غادة محمد العبسي. وجاءت جوائز النقد العراقية كالآتي (ماجدة هاتو بالمركز الأول) و(نادية هناوي سعدون بالمركز الثاني) بينما كان المركز الثالث من نصيب (فاطمة بدر حسين). ويذكر أن جائزة نازك الملائكة للإبداع النسوي أسستها وزارة

In [None]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} | Word: {}'.format(idx, topic))

In [72]:
data_frame["label"]


0        Culture
1        Culture
2        Culture
3        Culture
4        Culture
          ...   
71241     Sports
71242     Sports
71243     Sports
71244     Sports
71245     Sports
Name: label, Length: 71246, dtype: object

In [78]:
result[0]

4

In [81]:
A

0         Culture
5619         Tech
10029    Politics
14397     Finance
44473     Medical
48188      Sports
Name: label, dtype: object

In [85]:
A[0]

'Culture'

In [93]:
! cp "/content/Zouhair/arab_class.pkl" "/content/drive/My Drive/Projects/Zouhair/Models/"

In [94]:
! cp "/content/Zouhair/model" "/content/drive/My Drive/Projects/Zouhair/Models/"

In [95]:
! cp "/content/Zouhair/model.expElogbeta.npy" "/content/drive/My Drive/Projects/Zouhair/Models/"

In [96]:
! cp "/content/Zouhair/model.id2word" "/content/drive/My Drive/Projects/Zouhair/Models/"

In [97]:
! cp "/content/Zouhair/model.state" "/content/drive/My Drive/Projects/Zouhair/Models/"

In [None]:
lda_model_loaded = lda_model.load("/content/drive/My Drive/Projects/Zouhair/Models/model")
clf = pickle.load(open("/content/drive/My Drive/Projects/Zouhair/Models/arab_class.pkl","rb"))

#get a random text
text = " هو فريق كرة قدم محترف إسباني أُسس عام 1902، مقره العاصمة الإسبانية مدريد. يلعب الفريق في الدوري الإسباني واختير كأفضل فريق كرة قدم في القرن العشرين، وقد فاز بالدوري الإسباني 33 مرة (رقم قياسي)، وتسعة عشر مرة بكأس ملك إسبانيا وأحرز رقماً قياسياً بحيازته 13 بطولة في دوري أبطال أوروبا "

print(text)

# preprocess text
processed_text = preprocess(text)

# get Bag of Words
bow_doc = lda_model_loaded.id2word.doc2bow(processed_text)

# get topics 
topics = lda_model_loaded.get_document_topics(bow_doc,
                                              minimum_probability=0.0)
topic = [v for _, v in  topics] 
query= np.array(topic)

result=clf.predict([query])
A=['Culture', 'Tech', 'Politics', 'Finance', 'Medical', 'Sports']
print(A[result[0]])
print("*"*30)

print([topics])