
# Web Mining - Project : #

## Objectif : 

<p>
Mining social media network (FaceBook, Twitter, etc.) and news websites </br>
for discovering what topics Moroccan people are discussing during the two last years.  

- Use LDA (Latent Dirichlet Allocation) or other topic identification techniques.

- Provide a deep analysis.
</p>

***
</br>

### Importation des libraires :


In [1]:
# !pip install -r requirements.txt
import numpy as np
import pandas as pd 
import re
from tqdm.notebook import tqdm

from googletrans import Translator
from langdetect import detect
from textblob import TextBlob

import gensim
from gensim.parsing.preprocessing import STOPWORDS
from nltk.corpus import stopwords

home = 'D:/WISD/S3/Web_Mining/Current_Trends_in_Moroccan_Social_Networks/'

from jupyterthemes import jtplot
jtplot.style()

In [23]:
import os 
os.system('python -m scrapy runspider "HespressSpider3.py" --nolog')

NameError: name 'os' is not defined

# Processing tweets with NLP #

The steps we followed to process the stored Tweets are:

a) Delete unnecessary data: usernames,emails,hyperlinks, retweets, punctuation, 
possessives from a noun,duplicate characters, and special characters like smileys.

b) Normalize whitespace (convert multiple sequential whitespace chars into one whitespace character).

c) Convert hashtags into separate words, for example, thehashtag #MoroccanUsers 
is converted into two wordsMoroccan and Users.

d) Transform words writtenin Moroccan dialect, or in a dialect of Berber Tamazight into Standard Arabic. 
These words couldbe written using the Arabic or French alphabet. 
To performthistask, we create a python file that contains a dictionary of words that we gathered, 
then we store it in each slave node of our clusterand imported inside the NLP script

e) Create a function to detect the language used to write
the text of tweet (Standard Arab, French or English).

f) Create a function for automatic correction of spelling mistakes.

g) Create a list of contractions to normalize and
expandwordslike What's=>What is

h) Delete the suffix of a word until we find the root. For
example; Stemming => stem

i) Remove stopwords for standard Arabic ( ,(...,أن, إن, بعد
French (alors, à, ainsi, ...), and English (about, above, almost,...).


## a) Delete unnecessary data: usernames,emails,hyperlinks, punctuation, possessives from a noun,duplicate characters, and special characters like smileys. ##

In [2]:
demoji_df = pd.read_json(home + 'demoji.json', encoding='utf-8') # from https://pypi.org/project/demoji/
demoji_df.reset_index(inplace=True)
demoji_df.astype({"index": str})

def delete_unnecessary_data(tweet):
    #delete www.* or https?://* 
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',tweet)
    #delete @username
    tweet = re.sub(' @[^\s]+',' ',tweet)
    #delete username*
    tweet = re.sub('username[^\s]+','',tweet)
    #delete emails
    tweet = re.sub('[^\s@]+@[^\s@]+',' ',tweet)

    #delete javascript tags
    tweet =re.sub('< *script*>.*?< *script*>',' ',tweet)
    #delete all html tags
    tweet = re.sub('<.*?>',' ',tweet)
    #delete numbers 
    tweet = re.sub("[0-9><,]+"," ",tweet)
    #delete reteur a la ligne
    tweet = re.sub(r"\n+|┊"," ",tweet)

    # delete duplicate characters
    tweet = re.sub(r"(.)\1{2,}", r"\1\1", tweet)

    # Remove special characters like smileys (imojie)
    # 😂🤷‍♀️❤️🔴📢✅❎🥘↘️🌻♥️♥️♥️🥵🆚📅🕗📍👋😩😢🙌🏾🔥😮💖😭👄❤🤢💥💣
    # 🎄❤❤🐪🐱💰🏷⭐🙄😍🙌👇💚😭😹🌸💛🙏👏😔🎁🥰❄🎄💤
     
    # tweet = re.sub(r'😂|🤷‍♀️|❤️|🔴|📢|✅|❎|🥘|↘️|🌻|♥️️|️🥵|🆚|📅|🕗|📍|👋|😩|😢|🙌🏾|🔥|😮|💖|😭|👄|❤|🤢|💥|💣','',tweet)
    # Stage el wa7che
    # tweet = tweet.encode('ascii', 'ignore').decode('ascii')
   
    for code in demoji_df["index"]:
        try:
            tweet = re.sub(code, '', tweet)
        except:
            pass
        
    return tweet

In [3]:
import demoji
demoji.download_codes()

[33mDownloading emoji data ...[0m
[92m... OK[0m (Got response in 3.40 seconds)
[33mWriting emoji data to C:\Users\mhmh2\.demoji/codes.json ...[0m
[92m... OK[0m


In [4]:
df = pd.read_json('C:/Users/mhmh2/.demoji/codes.json')
df.reset_index(inplace=True)
df

Unnamed: 0,index,timestamp,codes
0,#⃣,2020-01-05 20:19:37.340117693,keycap: #
1,#️⃣,2020-01-05 20:19:37.340117693,keycap: #
2,*⃣,2020-01-05 20:19:37.340117693,keycap: *
3,*️⃣,2020-01-05 20:19:37.340117693,keycap: *
4,0⃣,2020-01-05 20:19:37.340117693,keycap: 0
...,...,...,...
3831,🪑,2020-01-05 20:19:37.340117693,chair
3832,🪒,2020-01-05 20:19:37.340117693,razor
3833,🪓,2020-01-05 20:19:37.340117693,axe
3834,🪔,2020-01-05 20:19:37.340117693,diya lamp


In [5]:
text = "┊✨ ┊ 💧رقي ┊🍃🌟💧ابداااع┊🌟┊✨┊🌟┊💧ذوووق┊🍃┊🌟┊💧فخامة┊✨🌟┊💧اناقه."
# text =' ┊ test'
# text = ' tdg@bkja "قول بذلت جهدي والباقي على الله؟ 😹😹😹😹 https://t.co/YalhezFnsl "#acsjjfac !\n @ # % ^ & * ( 📢✅❎🥘↘️🌻♥️♥️♥️🥵🆚📅🕗📍👋😩😢🙌🏾🔥😮💖😭👄❤🤢💥💣'
text = delete_unnecessary_data(text)
text

'   رقي  ابدااع    ذووق   فخامة  اناقه.'

## b) Normalize whitespace (convert multiple sequential whitespace chars into one whitespace character). ##

In [6]:
def normalize_whitespace(tweet):
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    return tweet

In [7]:
text = normalize_whitespace(text)
text

' رقي ابدااع ذووق فخامة اناقه.'

## c) Convert hashtags into separate words, for example, thehashtag #MoroccanUsers is converted into two words Moroccan and Users. ##

In [8]:
def sp_h(hashtagestring):
    fo = re.compile(r'#[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
    fi = fo.findall(hashtagestring)
    return ' '.join(fi)

def split_hashtage(tweet):
    tweet = re.sub(r'#[^\s]+', lambda m: sp_h(m.group()), tweet) # #WakeUpMorocco => Wake Up Morocco
    return tweet

In [9]:
import unicodedata as ud
import string

def reamove_punctuation(tweet):
#     regex = re.compile('[%s]' % re.escape(string.punctuation))
#     tweet = regex.sub('', tweet)
#     tweet = re.sub(ud.category(c).startswith('P'),'',tweet)

    tweet = re.sub('[^\w\s\']',' ',tweet)
#     tweet = ''.join(c for c in tweet if not ud.category(c).startswith('P'))
    return tweet

## e) Create a function to detect the language used to write the text of tweet (Standard Arab, French or English). ##

In [10]:
from googletrans import Translator
translator = Translator()
from langdetect import detect
from textblob import TextBlob

def language_detction(tweet):
    try:
        lang = translator.detect(tweet).lang
    except:
        try:
            lang = detect(tweet)
        except:
            try:
                lang = str(TextBlob(tweet).detect_language())
            except:
                lang = 'unknown'
    
    if lang in ['ar', 'arfa', 'fa', 'faar'] :
        tweet = correct_ar(tweet)
    elif lang == 'fr':
        tweet = correct_fr(tweet)
    elif lang == 'en':
        tweet = correct_en(tweet)
    else:
        tweet = translator.translate(tweet,src='auto',dest='en' ).text
        tweet = correct_en(tweet)
    return tweet

In [11]:
# text = 'chkon nta'
# text = "اتبارك الله على الحاج راك تما.هه"
text = 'hhh je suis encore vivant'
# text ='كيف يمكن تعزيز كفاءات هذا الجهاز بتطويره بآستخذام الذكاء الإصطناعي ؟ مثلا يمكن إستعمال قلم يغير قطر الكتابة أو تغيير لون الحبر بدون تغيير القلم فقط بإصدار الأوامر، ماذا عنكم ؟'
dt = translator.detect(text)
a = translator.translate(text,src='auto',dest='en' ).text
print(text , "\t\t", a , "\t\t", dt)

hhh je suis encore vivant 		 hhh I'm still alive 		 Detected(lang=fr, confidence=1.0)


## f) Create a function for automatic correction of spelling mistakes. ##

In [12]:
from autocorrect import Speller
spell = Speller(lang='en')

def correct_en(text):
    #convert to lower case
    text = text.lower()
    # Clean the text
    text = re.sub("\'s ", " is", text) # we have cases like "Sam is" or "Sam's" (i.e. his) these two cases aren't separable, I choose to compromise are kill "'s" directly
    text = re.sub(r" whats ", " what is ", text, flags=re.IGNORECASE)
    text = re.sub("\'ve", " have ", text)
    text = re.sub("n\'t", " not ", text)
    text = re.sub("i'm", "i am", text, flags=re.IGNORECASE)
    text = re.sub("\'re", " are ", text)
    text = re.sub("\'d", " would ", text)
    text = re.sub("\'ll", " will ", text)
    text = re.sub("e-mail", " email ", text, flags=re.IGNORECASE)
    text = re.sub("\(s\)", " ", text, flags=re.IGNORECASE) #mester(s)
    text = re.sub(r" (the[\s]+|The[\s]+)?(us(a)?|u\.s\.(a\.)?|united state(s)?) ", " america ", text)
    text = re.sub(r" uk ", " england ", text, flags=re.IGNORECASE)
    text = re.sub(r" imrovement ", " improvement ", text, flags=re.IGNORECASE)
    text = re.sub(r" intially ", " initially ", text, flags=re.IGNORECASE)
    text = re.sub(r" dms ", " direct messages ", text, flags=re.IGNORECASE)  
    text = re.sub(r" demonitization ", " demonetization ", text, flags=re.IGNORECASE) 
    text = re.sub(r" actived ", " active ", text, flags=re.IGNORECASE)
    text = re.sub(r" kms ", " kilometers ", text, flags=re.IGNORECASE)
    text = re.sub(r" cs ", " computer science ", text, flags=re.IGNORECASE)
    text = re.sub(r" calender ", " calendar ", text, flags=re.IGNORECASE)
    text = re.sub(r" ios ", " operating system ", text, flags=re.IGNORECASE)
    text = re.sub(r" programing ", " programming ", text, flags=re.IGNORECASE)
    text = re.sub(r" bestfriend ", " best friend ", text, flags=re.IGNORECASE)
    text = re.sub(r"bn8|god8" ,'good night', text, flags=re.IGNORECASE)
    text = re.sub(r" 2moro | 2mrrw | 2morrow | 2mrw | tomrw ", " tomorrow ", text)
    text = re.sub(r" b4 ", " before ", text)
    text = re.sub(r" otw ", " on the way ", text)

    text = spell(text)

    return text

In [13]:
from textblob import TextBlob

def correct_fr(text):
    text = TextBlob(text).correct()
    return str(text)

In [14]:
print(correct_fr('je mangee le banane et la pom'))

je manger le banana et la pot


## d) Transforming words written in Moroccan dialect ##

(by the French alphabet or Arabic) and also written in a dialect of Berber Tamazight (by the French alphabet into the standard Arabic. 

For this reason, we create a dictionary of words that we gathered in a python file to perform this task, This file will be stored in each slave node of our cluster, and it will be imported in the NLP script executed in these nodes, the file looks like

In [15]:
word_ma_df = pd.read_csv( home + "Moroccan_dialect.csv", sep=';',encoding='utf-8')
word_ma_df.head(10)

Unnamed: 0,English,Arabic,Transcribed Moroccan Arabic,Moroccan Darija in the Arabic Alphabet
0,Yes,نعم,Iyyeh,إييه
1,Yes,نعم,ah,آه
2,Yes,نعم,wah,واه
3,No,لا,Lla,لا
4,Please,من فضلك,3afak,عافاك
5,Thanks,شكرا,Shokran,شكرا
6,I love you,انا احبك,Kanbghik,كنبغيك
7,I miss you,اشتقت اليك,Twe77eshtek,توحشتك
8,A lot,كثير,Bezzaf,بزاف
9,A little,قليل,Shwiya,شوية


In [16]:
def moroccan_dialect(tweet):
    tweet= ' '+tweet+' '
    for ligne in word_ma_df.values:
        tweet = re.sub(r' '+str(ligne[2])+' | '+str(ligne[3])+' ', ' '+str(ligne[1])+' ', tweet)
    return tweet

tweet = 'ahramon Ettebb elbaytari 3afak إييه wah 3tini'
tweet = moroccan_dialect(tweet)
tweet

' ahramon الطب البيطري من فضلك نعم نعم أعطني '

 Use scrapy to scrape 'http://mylanguages.org/moroccan_vocabulary.php'

In [21]:
word_ma_df2 = pd.read_csv( home + "mylanguages.csv", sep=',',encoding='utf-8')
word_ma_df2.head(10)

Unnamed: 0,en,ar,mafr
0,numbers,أرقام,nmari
1,one,واحد,wahade
2,two,اثنان,zouje
3,three,ثلاثة,tlata
4,four,أربعة,rabaa
5,five,خمسة,khamssa
6,six,ستة,satta
7,a green tree,شجرة خضراء,shajra khdra
8,a tall building,مبنى طويل القامة,imara aliya
9,a very old man,رجل كبير جدا,rajale sharafe


In [22]:
def moroccan_dialect2(tweet):
    tweet= ' '+tweet+' '
    for ligne in word_ma_df.values:
        tweet = re.sub(r' '+str(ligne[2])+' ', ' '+str(ligne[1])+' ', tweet)
    return tweet

In [17]:
# from https://alraqmiyyat.github.io/2013/01-02.html
def normalizeArabic(text):
    text = re.sub("[إأٱآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    return(text)

normalizeArabic(tweet)

' ahramon الطب البيطري من فضلك نعم نعم اعطني '

In [18]:
# from https://alraqmiyyat.github.io/2013/01-02.html
def deNoise(text):
    noise = re.compile(""" ّ    | # Tashdid
                             َ    | # Fatha
                             ً    | # Tanwin Fath
                             ُ    | # Damma
                             ٌ    | # Tanwin Damm
                             ِ    | # Kasra
                             ٍ    | # Tanwin Kasr
                             ْ    | # Sukun
                             ـ     # Tatwil/Kashida
                         """, re.VERBOSE)
    text = re.sub(noise, '', text)
    return text

testLine = "إِنَّ الْقُرَّاْءَ يَقْرَؤُوْنَ الْقُرْآنَ قِرَاْءَةً جَمِيْلَــــــةً"
print(deNoise(testLine))

إن القراء يقرؤون القرآن قراءة جميلة


In [19]:
import pyarabic.araby as araby

ModuleNotFoundError: No module named 'pyarabic'

In [None]:
def correct_ar(text):
    text = moroccan_dialect(text)
    text = moroccan_dialect2(text)
    text = normalizeArabic(text)
    text = deNoise(text)
    
    return text

In [None]:
import gensim
from gensim.parsing.preprocessing import STOPWORDS
from nltk.corpus import stopwords

stop_words = stopwords.words('arabic')
with open("arabic_stop_words.txt","r", newline="",encoding="utf-8") as f:        
    for l in f:
        l = re.sub(r"\n+",'',l)
        stop_words.append(l)

stop_words.extend(stopwords.words('french'))
stop_words.extend(stopwords.words('english'))
stop_words.extend(gensim.parsing.preprocessing.STOPWORDS)
stop_words.extend(['tout','le','la'])
stop_words = set(stop_words)


In [None]:
import nltk
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer, PorterStemmer

stemmer = PorterStemmer()
lmtzr = WordNetLemmatizer()

def lemmatize_stemming(text):
    return stemmer.stem(lmtzr.lemmatize(text, pos='v'))


def tokenize_lemmatize_stemming(text):
    text = re.sub(r'[^\w\s]','',text)
    #replace multiple spaces with one space
    text = re.sub(r'[\s]+',' ',text)
    #transfer text to lowercase
    text = text.lower() 
    # tokenaze text
    tokens = re.split(" ", text)

    # Remove stop words 
    result = []
    for token in tokens :
        if token not in stop_words and len(token) > 1:
            result.append(lemmatize_stemming(token))

    return result

In [None]:
def preprocess(tweet):
    tweet = delete_unnecessary_data(tweet)
    tweet = normalize_whitespace(tweet)
    tweet = split_hashtage(tweet)
    tweet = reamove_punctuation(tweet)
    tweet = language_detction(tweet)
#     return tweet
    tokens = tokenize_lemmatize_stemming(tweet)
    return tokens

In [None]:
# text = "┊✨ ┊ 💧رقي ┊🍃🌟💧ابداااع┊🌟┊✨┊🌟┊💧ذوووق┊🍃┊🌟┊💧فخامة┊✨🌟┊💧اناقه."
# text = '   رقي  ابدااع    ذووق   فخامة  اناقه.'
text =' RT @PsiliLemonia: Πώς να βάλεις φωτιά στο κρεβάτι σας με εφτά απλές κινή'
# text = "Se realmente quisessem estar do meu lado iam estar, e quem tá já basta viu!"
# text = preprocess("قول بذلت جهدي والباقي على الله؟ 😹😹😹😹 https://t.co/YalhezFnsl ")
text = preprocess(text)
# print(delete_unnecessary_data("قول بذلت جهدي والباقي على الله؟ 😹😹😹😹 https://t.co/YalhezFnsl "))
print(text)

# dt = translator.detect(text).lang

# tweet = translator.translate(text,src='auto',dest='en' ).text
# print(text , "\n",dt)

### start

In [None]:
tweets_df = pd.read_csv(home+'tweets2.csv',encoding='utf-8')
tweets_df.columns = ['user','date','text']
tweets_df.astype({'text': str})

tweets_df

In [None]:
tweet = tweets_df['text'][1]
# tweet = '@jawaddelycee Bah ouais logique bg aaaaaaa'
preprocess(tweet)

In [None]:
# tokens = []
# for tweet in tqdm(tweets_df['text']):
#     tokens.append(preprocess(tweet))
        

https://www.machinelearningplus.com/nlp/gensim-tutorial/

## How to create Topic Models with LDA?

<p>
The objective of topic models is to extract the underlying topics from a given collection of text documents. Each document in the text is considered as a combination of topics and each topic is considered as a combination of related words.
</p>

<p>
Topic modeling can be done by algorithms like Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI).
</p>

<p>
In both cases you need to provide the number of topics as input. The topic model, in turn, will provide the topic keywords for each topic and the percentage contribution of topics in each document.
</p>

<p>
The quality of topics is highly dependent on the quality of text processing and the number of topics you provide to the algorithm. The earlier post on how to build best topic models explains the procedure in more detail. However, I recommend understanding the basic steps involved and the interpretation in the example below.
</p>


### Step 0: Load the necessary packages and import the stopwords.