


<font size='10' color = 'E3A440'>**Mégadonnées et techniques avancées démystifiées**</font>
=======
<font color = 'E3A440'>*Nouvelles méthodes d’analyse et leur implication quant à la gestion des mégadonnées en SSH (partie 1)*</font>
=============


Cet atelier s’inscrit dans le cadre de la formation [Mégadonnées et techniques avancées démystifiées](https://www.4point0.ca/2022/08/22/formation-megadonnees-demystifiees/) (séance 6).

Les sciences humaines et sociales sont souvent confrontées à l’analyse de données non structurées, comme le texte. Après avoir préparé les données, plusieurs techniques d’analyse venant de l’apprentissage automatique peuvent être utilisées. Pendant cet atelier, les participants seront initiés aux méthodes supervisées et non supervisées à des buts d’analyse avec Python.

Note : Cet atelier se poursuit lors d’une 2e séance le 10 novembre.

Structure de l'atelier :
1. Presentation of sections 1 and 2 in a plenary mode (20 minutes)
2. Individual work on section 3 (20 minutes)
3. Group work on section 4 (60 minutes)
4. Plenary session with groups presentations (20 minutes)

Ce tutoriel ne peut pas être consideré exaustif .... 

### Authors: 
- Bruno Agard <bruno.agard@polymtl.ca>
- Davide Pulizzotto <davide.pulizzotto@polymtl.ca>

### Table of Contents
Bruno Agard

Département de Mathématiques et de génie Industriel

École Polytechnique de Montréal

# Préparation environnement

In [136]:
# Downloading of data from the GitHub project
!rm -rf Donnees_demystifiees_seance_6/
!git clone https://github.com/puli83/Donnees_demystifiees_seance_6

Cloning into 'Donnees_demystifiees_seance_6'...
remote: Enumerating objects: 52, done.[K
remote: Counting objects: 100% (52/52), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 52 (delta 15), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (52/52), done.


In [3]:
# Import modules
import nltk
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

# <font color = 'E3A440'>*Préparation des données textulles*</font>

L'analyse de données textuelles implique la transformation d'un texte en un objet mathematique qui peut être utilisé par des algorithmes et des modèles statistique. Cette étape est importante car permet de **structurer** des données non structurées, comme le texte.


###  <font color = 'E3A440'>**Étapes fondamentales du prétraitement**</font>

Prenons une phrase pour decortiquer les passages que nous pemrettent de la transformer en information structurée.

In [3]:
sentence = """At eight o'clock, on Thursday morning, the great Arthur didn't feel VERY good."""

In [4]:
len(sentence)

78

#### <font color = 'E3A440'>*1. Tokenisation*</font>

Cette étape consiste à couper la phrase en unités linguistiques élémantaire et dotées de sens, ce qui est gnralement appelé le "mot".

Dans le module `nltk`, il existe une fonction qui permet cette opération, soit `word_tokenize()`.

In [5]:
# La function word_tokenize() prend la phrase comme argument.
words = nltk.word_tokenize(sentence)
print(words)
len(words)

['At', 'eight', "o'clock", ',', 'on', 'Thursday', 'morning', ',', 'the', 'great', 'Arthur', 'did', "n't", 'feel', 'VERY', 'good', '.']


17

#### <font color = 'E3A440'>*2. Analyse morphosyntaxique*</font>

Après avoir identifé tous les mots, il est possible de analyser leur rôle morphosyntaxique, à des fins d'analyse et/ou filtrage. 

In [6]:
 # La function word_tokenize() prend la liste de mots comme argument.
words_pos = nltk.pos_tag(words, tagset='universal')
print(words_pos)
len(words_pos)

[('At', 'ADP'), ('eight', 'NUM'), ("o'clock", 'NOUN'), (',', '.'), ('on', 'ADP'), ('Thursday', 'NOUN'), ('morning', 'NOUN'), (',', '.'), ('the', 'DET'), ('great', 'ADJ'), ('Arthur', 'NOUN'), ('did', 'VERB'), ("n't", 'ADV'), ('feel', 'VERB'), ('VERY', 'ADV'), ('good', 'ADJ'), ('.', '.')]


17

Voici la liste de possible POS tags:

| **POS** | **DESCRIPTION**           | **EXAMPLES**                                      |
| ------- | ------------------------- | ------------------------------------------------- |
| ADJ     | adjective                 | big, old, green, incomprehensible, first      |
| ADP     | adposition                | in, to, during                                |
| ADV     | adverb                    | very, tomorrow, down, where, there            |
| AUX     | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ    | conjunction               | and, or, but                                  |
| CCONJ   | coordinating conjunction  | and, or, but                                  |
| DET     | determiner                | a, an, the                                    |
| INTJ    | interjection              | psst, ouch, bravo, hello                      |
| NOUN    | noun                      | girl, cat, tree, air, beauty                  |
| NUM     | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART    | particle                  | ’s, not                                      |
| PRON    | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN   | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT   | punctuation               | ., (, ), ?                                    |
| SCONJ   | subordinating conjunction | if, while, that                               |
| SYM     | symbol                    | $, %, §, ©, +, −, ×, ÷, =, :)               |
| VERB    | verb                      | run, runs, running, eat, ate, eating          |
| X       | other                     | sfpksdpsxmsa                                  |
| SPACE   | space                     |                                                   |


#### <font color = 'E3A440'>*3. Retirer la ponctuation*</font>

Une autre opération consiste à retirer la ponctuation. Ce type de filtrage reduit le nombre de sugne graphique qui participent le moin à la constructuion de la sémantique de la prhase. 
Dans certain contexte, comme en stylometrie, ce processus est appliquée avec de terchniques plus sofistiquées. 

In [7]:
# La ligne de code suivant itére sur chaque signe graphique et retient ceux qui contiennet de caracteres alphanumérique.
words_pos = [(w, pos) for w, pos in words_pos if w.isalnum()]
print(words_pos)
len(words_pos)
# Ikl est possible aussi d'utiilser le résultat de l'analyse morphosyntaxique pour éliminer la ponctuaction
# words_pos = [(w, pos) for w, pos in words_pos if pos != '.']
# print(words_pos)

[('At', 'ADP'), ('eight', 'NUM'), ('on', 'ADP'), ('Thursday', 'NOUN'), ('morning', 'NOUN'), ('the', 'DET'), ('great', 'ADJ'), ('Arthur', 'NOUN'), ('did', 'VERB'), ('feel', 'VERB'), ('VERY', 'ADV'), ('good', 'ADJ')]


12

#### <font color = 'E3A440'>*4. Convertir chaque caractère en minuscule*</font>

Cette étape constitue une première opéraiton de normalisation des mots et leur réduction à une forme graphique unique. Ce genre d'étape permet de regrouper chaque occurence d'un mot sous une seule forme.

In [8]:
# La ligne de code suivant itére sur chaque signe graphique et le transforme en minuscule.
words_pos = [(w.lower(), pos) for w, pos in words_pos]
print(words_pos)

[('at', 'ADP'), ('eight', 'NUM'), ('on', 'ADP'), ('thursday', 'NOUN'), ('morning', 'NOUN'), ('the', 'DET'), ('great', 'ADJ'), ('arthur', 'NOUN'), ('did', 'VERB'), ('feel', 'VERB'), ('very', 'ADV'), ('good', 'ADJ')]


#### <font color = 'E3A440'>*5. retirer les stopwrods (mots vides)*</font>

Une autre opération de filtrage constitue dan l'élimination de mots fonctionnels. Cette liste de mots contient tout les connecteurs de phrase, comme "et", "mais", "toutefois" et de mots avec faible valeur sémantique, comem les verbes modaux. 
Comme d'autres opéraiton de filtrage, l'enjeuy est celui de nettoyer le plus possible le vocabulaire et de reduyire toutes les occurrences d'un mot sous une forme graphique unique.

In [9]:
# Nous importons la liste de stopword en anglais
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [10]:
# La ligne de code suivant itére sur chaque signe graphique et garde ce qui n sont pas dans la liste de stopword.
words_pos = [(w, pos) for w, pos in words_pos if w not in stopwords.words("english")]
print(words_pos)
len(words_pos)

[('eight', 'NUM'), ('thursday', 'NOUN'), ('morning', 'NOUN'), ('great', 'ADJ'), ('arthur', 'NOUN'), ('feel', 'VERB'), ('good', 'ADJ')]


7

#### <font color = 'E3A440'>*6. Rammener les mots à leur racine*</font> 

En suivant le même objectif, nous retirons le suffixe morphologique des mots, ce qui augmente le niveau de réduction de chaque occurrence d'un mot à une unique forme graphique.

Ils existent deux méthode fondamentales: la racinisaiton et la lemmatisation.
La première reduit les occurence à une racine qui est inférée au moyen de plusieur techniques, l'autre est la réduciton de l'occurrence à son lemme. 

In [11]:
# Racinisation: technique Porter
from nltk.stem.porter import PorterStemmer
stemmed_pos = [(PorterStemmer().stem(w), pos) for w, pos in words_pos]
print(stemmed_pos)

[('eight', 'NUM'), ('thursday', 'NOUN'), ('morn', 'NOUN'), ('great', 'ADJ'), ('arthur', 'NOUN'), ('feel', 'VERB'), ('good', 'ADJ')]


In [12]:
# Racinisation: technique Lancaster
from nltk.stem import LancasterStemmer
stemmed_pos = [(LancasterStemmer().stem(w), pos) for w, pos in words_pos]
print(stemmed_pos)

[('eight', 'NUM'), ('thursday', 'NOUN'), ('morn', 'NOUN'), ('gre', 'ADJ'), ('arth', 'NOUN'), ('feel', 'VERB'), ('good', 'ADJ')]


In [13]:
# Lemmatisaiton: utilisant le thesaurus wordnet
from nltk.stem.wordnet import WordNetLemmatizer
lemmed_pos = [(WordNetLemmatizer().lemmatize(w), pos) for w, pos in words_pos]
print(lemmed_pos)

[('eight', 'NUM'), ('thursday', 'NOUN'), ('morning', 'NOUN'), ('great', 'ADJ'), ('arthur', 'NOUN'), ('feel', 'VERB'), ('good', 'ADJ')]


#### <font color = 'E3A440'>*7. Filtrage selon le rôle morphosyntaxique*</font>

Le filtrage des unités lexicales peut s'étendre jusqu'à l'élimination d'unités qui ne font pas partie d'une liste de rôles morphosyntaxique prédéfinie. 

In [14]:
# Retenir seulement les noms et les adjectifs
lemmed_pos = [(w, pos) for w, pos in words_pos if pos in ['NOUN','ADJ']]
print(lemmed_pos)

[('thursday', 'NOUN'), ('morning', 'NOUN'), ('great', 'ADJ'), ('arthur', 'NOUN'), ('good', 'ADJ')]


## <font color = 'E3A440'>Traitement d'un texte</font>

Le prétraitement d'un corpus de recherche peut mettre en place plusieurs autres étapes. La plus importante est la **segmentation**. 

### <font color = 'E3A440'>*1. Segmentation*</font>

Tout dépendant de l'objectif de l'analyse, un segment peut être un document, un paragraphe, une concordance, un groupe de phrases, une phrase, etc.



In [15]:
text = """At eight o'clock, on Thursday morning, the great Arthur didn't feel VERY good.
          The following morning, at nine, Arthur felt better.
          A dog run in the street."""
len(text)

175

Dans le bloc de code suivant, nous faisons une segmentation par pharse.

In [34]:
sentences = nltk.sent_tokenize(text)
print(sentences)
len(sentences)

["At eight o'clock, on Thursday morning, the great Arthur didn't feel VERY good.", 'The following morning, at nine, Arthur felt better.', 'A dog run in the street.']


3

### <font color = 'E3A440'>*2. Annotation et nettoyage*</font>

Tout dépendant de l'objectif de l'analyse, un segment peut être un document, un paragraphe, une concordance, un groupe de phrases, une phrase, etc.


#### <font color = 'E3A440'>*2.1 Création d'une fonction*</font>

La cération de funciton est utile pour plusieurs raison. Dans notre casm, nous voulons englober Souvent, il est utile de créer de fonctions pour executer accomplir pluseurs étapes

In [35]:
# To run this function proprlely, you need to import modules needed
def CleaningText(text_as_string, language = 'english', reduce = '', list_pos_to_keep = [], Stopwords_to_add = []):
    from nltk.corpus import stopwords

    words = nltk.word_tokenize(text_as_string)
    words_pos = nltk.pos_tag(words, tagset='universal')
    words_pos = [(w, pos) for w, pos in words_pos if w.isalnum()]
    words_pos = [(w.lower(), pos) for w, pos in words_pos]
    
    if reduce == 'stem': 
        from nltk.stem.porter import PorterStemmer
        reduced_words_pos = [(PorterStemmer().stem(w), pos) for w, pos in words_pos]
        
    elif reduce == 'lemma':
        from nltk.stem.wordnet import WordNetLemmatizer
        reduced_words_pos = [(WordNetLemmatizer().lemmatize(w), pos) for w, pos in words_pos]
    else:
        import warnings
        reduced_words_pos = words_pos
        warnings.warn("Warning : any reduction was made on words! Please, use \"reduce\" argument to chosse between 'stem' or  'lemma'")
    if list_pos_to_keep:
        reduced_words_pos = [(w, pos) for w, pos in reduced_words_pos if pos in list_pos_to_keep]
    else:
        import warnings
        warnings.warn("Warning : any POS filtering was made. Pleae, use \"list_pos_to_keep\" to create a list of POS tag to keep.")
    
    list_stopwords = stopwords.words(language) + Stopwords_to_add
    reduced_words_pos = [(w, pos) for w, pos in reduced_words_pos if w not in list_stopwords and len(w) > 1 ]
    return reduced_words_pos



In [145]:
words = nltk.word_tokenize(text_as_string)
words_pos = nltk.pos_tag(words, tagset='universal')
words_pos = [(w, pos) for w, pos in words_pos if w.isalnum()]
words_pos = [(w.lower(), pos) for w, pos in words_pos]
from nltk.stem.wordnet import WordNetLemmatizer
reduced_words_pos = [(WordNetLemmatizer().lemmatize(w), pos) for w, pos in words_pos]
reduced_words_pos = [(w, pos) for w, pos in reduced_words_pos if pos in ['NOUN','ADJ','VERB']]
list_stopwords = stopwords.words(language) + ['http']
words_pos = [(w, pos) for w, pos in reduced_words_pos if w not in list_stopwords ]
reduced_words_pos



[('entire', 'ADJ'),
 ('swiss', 'ADJ'),
 ('football', 'NOUN'),
 ('league', 'NOUN'),
 ('is', 'VERB'),
 ('hold', 'NOUN'),
 ('postponing', 'VERB'),
 ('game', 'NOUN'),
 ('professional', 'ADJ'),
 ('amateur', 'ADJ'),
 ('level', 'NOUN'),
 ('coronavirus', 'ADJ'),
 ('http', 'NOUN')]

#### <font color = 'E3A440'>*2.2 Application function nettoyage*</font>

Maintenant, nous pouvons apliquer cette function à chacun de not segment. dans notre cas il s'agit de phrases.

In [37]:
cleaned_sentences = [CleaningText(sent) for sent in sentences]
print(cleaned_sentences)

[[('eight', 'NUM'), ('thursday', 'NOUN'), ('morning', 'NOUN'), ('great', 'ADJ'), ('arthur', 'NOUN'), ('feel', 'VERB'), ('good', 'ADJ')], [('following', 'ADJ'), ('morning', 'NOUN'), ('nine', 'NUM'), ('arthur', 'NOUN'), ('felt', 'VERB'), ('better', 'ADV')], [('dog', 'NOUN'), ('run', 'NOUN'), ('street', 'NOUN')]]




In [None]:
cleaned_sentences = [CleaningText(sent, reduce = 'lemma', list_pos_to_keep = ['']) for sent in sentences]
print(cleaned_sentences)

#### <font color = 'E3A440'>*2.3 Fréquence mots*</font>

Quelle est la fréquence des mots de notre texte? Commencon par retirer le POS tag, pour obtenir une liste de listes de mots ( et non une liste de tuple mot-pos).

In [48]:
freqs_in_text = nltk.FreqDist([w for sent in cleaned_sentences for w, pos in sent ])
freqs_in_text

FreqDist({'morning': 2, 'arthur': 2, 'eight': 1, 'thursday': 1, 'great': 1, 'feel': 1, 'good': 1, 'following': 1, 'nine': 1, 'felt': 1, ...})

### <font color = 'E3A440'>*3. Vectorisation*</font>

Généralement, pour utiliser le texte dans un contexte d'analyse de données ou d'apprentissage automatique, ce texte doit être transformé dans un objet mathématique approprié. 
Le modèle le plus simple et connu est le "bags-of-words", dans lequel chaque document (ou chaque mot) est défini par un certain nombre d'unités lexicales qui le caractérise. 

In [38]:
# Initialisation de l'objet
from nltk.corpus import stopwords

def identity_tokenizer(text):
    return text

# Transforming the word in frequencies
vectorized = CountVectorizer(lowercase = False, # Convert all characters to lowercase before tokenizing
                             min_df = 1, # Ignore terms that have a document frequency strictly lower than the given threshold 
                             max_df = 10, # Ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
                             stop_words = stopwords.words('english'), # Remove the list of words provided
                             ngram_range = (1, 1), # Get the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted
                             tokenizer=identity_tokenizer) # Override the string tokenization step while preserving the preprocessing and n-grams generation steps

Utilisation du "vectorizer" avec une liste de listes de mot (et non une liste de tuple de mots-pos).

In [49]:
# Liste de liste de mots:
[[w for w, pos in sent] for sent in cleaned_sentences]

[['eight', 'thursday', 'morning', 'great', 'arthur', 'feel', 'good'],
 ['following', 'morning', 'nine', 'arthur', 'felt', 'better'],
 ['dog', 'run', 'street']]

In [79]:
# pplication du vectorizer
freq_term_DTM = vectorized.fit_transform([[w for w, pos in sent] for sent in cleaned_sentences])
print(pd.DataFrame(freq_term_DTM.todense(), columns =  [k for k, v in sorted(vectorized.vocabulary_.items(), key=lambda item: item[1])] ))

   arthur  better  dog  eight  feel  felt  following  good  great  morning  \
0       1       0    0      1     1     0          0     1      1        1   
1       1       1    0      0     0     1          1     0      0        1   
2       0       0    1      0     0     0          0     0      0        0   

   nine  run  street  thursday  
0     0    0       0         1  
1     1    0       0         0  
2     0    1       1         0  


## Exercise : Sentiment analysis on a COVID-19 twetter dataset

In [130]:
ROOT_DIR='Donnees_demystifiees_seance_6/'
DATA_DIR=os.path.join(ROOT_DIR, 'Data')
os.listdir(DATA_DIR)

['Top_50_tweet_profiles.zip', '4POINT0_Top_50_tweet_profiles.zip']

In [117]:
!pwd

/content


In [137]:
#import gzip
import zipfile
from datetime import datetime

#Unzips the dataset and gets the TSV dataset
with zipfile.ZipFile(os.path.join(DATA_DIR,'4POINT0_Top_50_tweet_profiles.zip'), 'r') as zip_ref:
    zip_ref.extractall(DATA_DIR)

df = pd.read_pickle('Top_50_tweet_profiles.pkl')

In [138]:
df

Unnamed: 0,Tweet Id,Tweet URL,Tweet Posted Time,Tweet Content,Tweet Type,Client,Retweets received,Likes received,User Id,Name,Username,Verified or Non-Verified,Profile URL,Protected or Not Protected
0,1220182010972557313,https://twitter.com/cnnbrk/status/122018201097...,2020-01-23 03:10:35,"""At least one person was killed and several ot...",Tweet,TweetDeck,354,534,"""428333""","""CNN Breaking News""",cnnbrk,Verified,https://twitter.com/cnnbrk,Not Protected
1,1220152440315613185,https://twitter.com/cnnbrk/status/122015244031...,2020-01-23 01:13:05,"""Another inmate has died at Mississippi's Parc...",Tweet,TweetDeck,162,281,"""428333""","""CNN Breaking News""",cnnbrk,Verified,https://twitter.com/cnnbrk,Not Protected
2,1220133332702375937,https://twitter.com/cnnbrk/status/122013333270...,2020-01-22 23:57:10,"""Eli Manning, quarterback for the New York Gia...",Tweet,TweetDeck,199,1067,"""428333""","""CNN Breaking News""",cnnbrk,Verified,https://twitter.com/cnnbrk,Not Protected
3,1220108211191209984,https://twitter.com/cnnbrk/status/122010821119...,2020-01-22 22:17:20,"""Rapper and singer Juice WRLD died from an acc...",Tweet,TweetDeck,329,923,"""428333""","""CNN Breaking News""",cnnbrk,Verified,https://twitter.com/cnnbrk,Not Protected
4,1220107262741618688,https://twitter.com/cnnbrk/status/122010726274...,2020-01-22 22:13:34,"""Four people were killed when a small plane cr...",Tweet,TweetDeck,76,146,"""428333""","""CNN Breaking News""",cnnbrk,Verified,https://twitter.com/cnnbrk,Not Protected
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
153467,865359712245698561,https://twitter.com/niallofficial/status/86535...,2017-05-19 00:13:27,"""Great day with all the radio stations in Chic...",Tweet,Twitter for iPhone,18247,60261,"""105119490""","""Niall Horan""",niallofficial,Verified,https://twitter.com/niallofficial,Not Protected
153468,865281066797383681,https://twitter.com/theellenshow/status/865281...,2017-05-18 19:00:57,""".@NiallOfficial's new music is here, and you ...",Retweet,Twitter Web Client,10578,30754,"""15846407""","""Ellen DeGeneres""",theellenshow,Verified,https://twitter.com/theellenshow,Not Protected
153469,865269249857794049,https://twitter.com/niallofficial/status/86526...,2017-05-18 18:13:59,"""@BezChristiaan happy birthday dude ! Shooting...",Tweet,Twitter for iPhone,1675,4114,"""105119490""","""Niall Horan""",niallofficial,Verified,https://twitter.com/niallofficial,Not Protected
153470,865268733195026432,https://twitter.com/niallofficial/status/86526...,2017-05-18 18:11:56,"""Sad to hear about Chris Cornell . Was literal...",Tweet,Twitter for iPhone,17402,58310,"""105119490""","""Niall Horan""",niallofficial,Verified,https://twitter.com/niallofficial,Not Protected


In [106]:
import os
os.listdir()

['.config',
 'COVID-19 (1).zip',
 'COVID-images.csv',
 'Top_50_profile.zip',
 'Top_50_profile (1).zip',
 'COVID-videos.csv',
 'COVID.csv',
 'COVID-19.zip',
 'sample_data']

In [None]:
import datetime
#Select english
df=df[df['Tweet Language']=='English'].reset_index()
df

In [139]:
df.dtypes

Tweet Id                              object
Tweet URL                             object
Tweet Posted Time             datetime64[ns]
Tweet Content                         object
Tweet Type                            object
Client                                object
Retweets received                      int64
Likes received                         int64
User Id                               object
Name                                  object
Username                              object
Verified or Non-Verified              object
Profile URL                           object
Protected or Not Protected            object
dtype: object

In [140]:
df.iloc[0]

Tweet Id                                                    1220182010972557313
Tweet URL                     https://twitter.com/cnnbrk/status/122018201097...
Tweet Posted Time                                           2020-01-23 03:10:35
Tweet Content                 "At least one person was killed and several ot...
Tweet Type                                                                Tweet
Client                                                                TweetDeck
Retweets received                                                           354
Likes received                                                              534
User Id                                                                "428333"
Name                                                        "CNN Breaking News"
Username                                                                 cnnbrk
Verified or Non-Verified                                               Verified
Profile URL                             

In [142]:
df.shape

(153471, 14)

In [141]:
df['Tweet Content']

0         "At least one person was killed and several ot...
1         "Another inmate has died at Mississippi's Parc...
2         "Eli Manning, quarterback for the New York Gia...
3         "Rapper and singer Juice WRLD died from an acc...
4         "Four people were killed when a small plane cr...
                                ...                        
153467    "Great day with all the radio stations in Chic...
153468    ".@NiallOfficial's new music is here, and you ...
153469    "@BezChristiaan happy birthday dude ! Shooting...
153470    "Sad to hear about Chris Cornell . Was literal...
153471    "How do you celebrate turning 23? 🤔 \n\nWith a...
Name: Tweet Content, Length: 153471, dtype: object

Cette opération prendra 2 minutes, donc, nous vous suggérons de continuer la lecture dce l'exercices pour vous faire une idée de ce qui s'en vient.

In [143]:
cleaned_tweets = [CleaningText(sent, reduce = 'lemma', list_pos_to_keep = ['NOUN', 'ADJ', 'VERB'], Stopwords_to_add=['http']) for sent in list(df['Tweet Content'])]


In [147]:
# Initialisation de l'objet
from nltk.corpus import stopwords

def identity_tokenizer(text):
    return text

# Transforming the word in frequencies
vectorized = CountVectorizer(lowercase = False, # Convert all characters to lowercase before tokenizing
                             min_df = 50, # Ignore terms that have a document frequency strictly lower than the given threshold 
                             max_df = 50000, # Ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
                             stop_words = stopwords.words('english'), # Remove the list of words provided
                             ngram_range = (1, 1), # Get the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted
                             tokenizer=identity_tokenizer) # Override the string tokenization step while preserving the preprocessing and n-grams generation steps

In [148]:
freq_term_DTM = vectorized.fit_transform([[w for w, pos in sent] for sent in cleaned_tweets])

In [149]:
freq_term_DTM

<153471x3500 sparse matrix of type '<class 'numpy.int64'>'
	with 883083 stored elements in Compressed Sparse Row format>

In [150]:
df['Tweet Content'].iloc[[0,1,2,3]]

0    Also the entire Swiss Football League is on ho...
1    World Health Org Official: Trump’s press confe...
2    I mean, Liberals are cheer-leading this #Coron...
3    Under repeated questioning, Pompeo refuses to ...
Name: Tweet Content, dtype: object

In [147]:
print(CleaningText(df['Tweet Content'].iloc[0], reduce = 'lemma', list_pos_to_keep = ['NOUN', 'ADJ', 'VERB'], Stopwords_to_add=['http']))

text_as_string = df['Tweet Content'].iloc[0]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [150]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

In [7]:
sia.polarity_scores("NLTK is not bad!")

{'neg': 0.0, 'neu': 0.488, 'pos': 0.512, 'compound': 0.484}

In [8]:
sia.polarity_scores("NLTK is bad!")

{'neg': 0.655, 'neu': 0.345, 'pos': 0.0, 'compound': -0.5848}

In [None]:
df.iloc[[0,1,2,3]]

In [151]:
datasent = df.apply(lambda x: sia.polarity_scores(x['Tweet Content']), 1)
df = df.join(pd.DataFrame(list(datasent)))
df

Unnamed: 0,Tweet Id,Tweet URL,Tweet Posted Time,Tweet Content,Tweet Type,Client,Retweets received,Likes received,User Id,Name,Username,Verified or Non-Verified,Profile URL,Protected or Not Protected,neg,neu,pos,compound
0,1220182010972557313,https://twitter.com/cnnbrk/status/122018201097...,2020-01-23 03:10:35,"""At least one person was killed and several ot...",Tweet,TweetDeck,354,534,"""428333""","""CNN Breaking News""",cnnbrk,Verified,https://twitter.com/cnnbrk,Not Protected,0.231,0.769,0.000,-0.8020
1,1220152440315613185,https://twitter.com/cnnbrk/status/122015244031...,2020-01-23 01:13:05,"""Another inmate has died at Mississippi's Parc...",Tweet,TweetDeck,162,281,"""428333""","""CNN Breaking News""",cnnbrk,Verified,https://twitter.com/cnnbrk,Not Protected,0.340,0.660,0.000,-0.8957
2,1220133332702375937,https://twitter.com/cnnbrk/status/122013333270...,2020-01-22 23:57:10,"""Eli Manning, quarterback for the New York Gia...",Tweet,TweetDeck,199,1067,"""428333""","""CNN Breaking News""",cnnbrk,Verified,https://twitter.com/cnnbrk,Not Protected,0.000,0.765,0.235,0.8271
3,1220108211191209984,https://twitter.com/cnnbrk/status/122010821119...,2020-01-22 22:17:20,"""Rapper and singer Juice WRLD died from an acc...",Tweet,TweetDeck,329,923,"""428333""","""CNN Breaking News""",cnnbrk,Verified,https://twitter.com/cnnbrk,Not Protected,0.197,0.803,0.000,-0.5994
4,1220107262741618688,https://twitter.com/cnnbrk/status/122010726274...,2020-01-22 22:13:34,"""Four people were killed when a small plane cr...",Tweet,TweetDeck,76,146,"""428333""","""CNN Breaking News""",cnnbrk,Verified,https://twitter.com/cnnbrk,Not Protected,0.191,0.809,0.000,-0.6705
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
153467,865359712245698561,https://twitter.com/niallofficial/status/86535...,2017-05-19 00:13:27,"""Great day with all the radio stations in Chic...",Tweet,Twitter for iPhone,18247,60261,"""105119490""","""Niall Horan""",niallofficial,Verified,https://twitter.com/niallofficial,Not Protected,0.000,1.000,0.000,0.0000
153468,865281066797383681,https://twitter.com/theellenshow/status/865281...,2017-05-18 19:00:57,""".@NiallOfficial's new music is here, and you ...",Retweet,Twitter Web Client,10578,30754,"""15846407""","""Ellen DeGeneres""",theellenshow,Verified,https://twitter.com/theellenshow,Not Protected,0.000,0.712,0.288,0.7840
153469,865269249857794049,https://twitter.com/niallofficial/status/86526...,2017-05-18 18:13:59,"""@BezChristiaan happy birthday dude ! Shooting...",Tweet,Twitter for iPhone,1675,4114,"""105119490""","""Niall Horan""",niallofficial,Verified,https://twitter.com/niallofficial,Not Protected,0.127,0.694,0.179,0.3164
153470,865268733195026432,https://twitter.com/niallofficial/status/86526...,2017-05-18 18:11:56,"""Sad to hear about Chris Cornell . Was literal...",Tweet,Twitter for iPhone,17402,58310,"""105119490""","""Niall Horan""",niallofficial,Verified,https://twitter.com/niallofficial,Not Protected,0.000,0.609,0.391,0.9041


In [153]:
df['compound'].describe()

count    153470.000000
mean          0.205227
std           0.413690
min          -0.990300
25%           0.000000
50%           0.000000
75%           0.570700
max           0.993700
Name: compound, dtype: float64

In [152]:
df[['neg','neu','pos','compound']].sum()

neg           5339.0810
neu         126931.6410
pos          21199.1830
compound     31496.1198
dtype: float64

In [154]:
bins = [0, 0.4, 0.50, 0.6]
names = ['lower', 'medium', 'high']
df['neg_category']  = pd.cut(df['neg'], bins, labels=names, include_lowest =True)
df['pos_category']  = pd.cut(df['pos'], bins, labels=names, include_lowest =True)


In [21]:
df[df['neg_category'] == 'high'].shape

(12, 29)

In [23]:
df[df['pos_category'] == 'high'].shape

(15, 29)

In [25]:
df['compound'].describe()

count    33174.000000
mean        -0.065940
std          0.461793
min         -0.982800
25%         -0.421500
50%          0.000000
75%          0.318200
max          0.986200
Name: compound, dtype: float64

In [155]:
bins = [-np.inf, -0.5, 0.5, 1]
names = ['high_negative', 'neu', 'high_positive']
df['compound_category']  = pd.cut(df['compound'], bins, labels=names, include_lowest =True)

In [None]:
df[df['compound_category'] == 'high_positive']

In [160]:
from collections import Counter
Counter(df['compound_category'])

Counter({'high_negative': 8506, 'high_positive': 43711, 'neu': 101253, nan: 1})

In [175]:
logical_vector = df['compound_category'] == 'high_negative'

In [None]:
Counter(df.Username)

In [None]:
logical_vector = df['compound_category'] == 'high_negative'

In [90]:
sum(logical_vector)

4119

In [176]:
df_freq_target = pd.DataFrame(np.asarray(freq_term_DTM[logical_vector].sum(0).T).reshape(-1))
df_freq_target.index = [word for (word,idx) in sorted(vectorized.vocabulary_.items(), key= lambda x:x[1])]
df_freq_target.columns = ['freq1']
df_freq_target['freq2'] = np.asarray(freq_term_DTM[~(logical_vector)].sum(0).T).reshape(-1)
df_freq_target['tot'] = df_freq_target['freq1'] + df_freq_target['freq2']

df_freq_target['freq1'] = df_freq_target['freq1'].apply(lambda x: 0.0000001 if x == 0 else x).astype(float)
df_freq_target['freq2'] = df_freq_target['freq2'].apply(lambda x: 0.0000001 if x == 0 else x).astype(float)
#
df_freq_target['freq1_norm'] = df_freq_target['freq1']/df_freq_target['freq1'].sum() * 1000000
df_freq_target['freq2_norm'] = df_freq_target['freq2']/df_freq_target['freq2'].sum() * 1000000
#
df_freq_target['fraction'] = df_freq_target['freq1_norm'] / df_freq_target['freq2_norm']
df_freq_target['Log-likelihood Ratio'] = df_freq_target['fraction'].apply(math.log2)
import math
df_freq_target['Log-likelihood Ratio'] = df_freq_target['fraction'].apply(math.log2)
frequency_threshold = 10 # Insert your frequency threshold as integer
df_freq_target[df_freq_target['tot'] > frequency_threshold]['Log-likelihood Ratio'].sort_values(ascending=False).iloc[range(50)]

rape           8.315716
murder         6.486185
killed         5.772018
suspected      5.674082
abuse          5.619487
dead           5.573555
injured        5.461088
qasem          5.412896
killing        5.282342
suspect        5.240428
suicide        5.240428
danger         5.205663
violence       5.078409
soleimani      5.018036
terrorist      4.987662
assault        4.977394
fired          4.976216
died           4.896474
suleimani      4.873646
arrested       4.838057
accused        4.780997
sentenced      4.780997
prison         4.749547
jailed         4.681001
terror         4.662555
devastating    4.602998
death          4.550814
attack         4.545968
kill           4.523676
lying          4.511076
fraud          4.474894
scam           4.433073
terrorism      4.398308
war            4.391079
convicted      4.281070
worst          4.278099
guilty         4.243249
charged        4.156864
crisis         4.062430
failed         4.058678
cancer         4.018036
sexual         4

In [85]:
df_freq_target['freq1'].sort_values(ascending=False).iloc[range(20)]

rt           5232.0
china        2615.0
people       1928.0
wuhan        1168.0
doctor       1145.0
death        1063.0
health       1022.0
case         1007.0
city          936.0
ha            896.0
lady          883.0
infected      882.0
wa            867.0
patient       859.0
new           852.0
virus         760.0
confirmed     702.0
public        699.0
chinese       697.0
amp           667.0
Name: freq1, dtype: float64

In [58]:
df_freq_target

Unnamed: 0,freq1,freq2,tot,freq1_norm,freq2_norm,fraction,Log-likelihood Ratio
abc7,1.000000e-07,102.0,102,0.000002,303.936876,6.692071e-09,-27.154900
abc7newsbayarea,7.000000e+00,155.0,162,142.377708,461.864861,3.082670e-01,-1.697748
able,1.000000e+01,56.0,66,203.396725,166.867305,1.218913e+00,0.285595
abscbnnews,1.000000e-07,117.0,117,0.000002,348.633476,5.834113e-09,-27.352839
abuse,1.000000e-07,619.0,619,0.000002,1844.479672,1.102732e-09,-29.756270
...,...,...,...,...,...,...,...
york,4.000000e+00,55.0,59,81.358690,163.887531,4.964300e-01,-1.010338
youtube,1.500000e+01,119.0,134,305.095088,354.593023,8.604092e-01,-0.216905
zero,1.000000e-07,56.0,56,0.000002,166.867305,1.218913e-08,-26.289830
zerohedge,5.000000e+00,72.0,77,101.698363,214.543678,4.740217e-01,-1.076975


In [None]:

def lexical_keyness(DTM, df, categories, vocabulary_vectorize, terget_category):
    import math
    # This code ref takes inspiration from this python module : https://pypi.org/project/corpus-toolkit/
    # and its main script:  https://github.com/kristopherkyle/corpus_toolkit/blob/master/corpus_toolkit/corpus_tools.py
    # which is based on this paper: https://aclanthology.org/J93-1003/

    cluster_keyness = n_cluster
    logical_vector = df['compound_category'] == 'high_positive'
    df_freq_target = pd.DataFrame(np.asarray(DTM[cls_kmeans.labels_ == cluster_keyness].sum(0).T).reshape(-1))#, columns = [word for (word,idx) in sorted(vectorized.vocabulary_.items(), key= lambda x:x[1])]))
    df_freq_target.index = [word for (word,idx) in sorted(vocabulary_vectorize.items(), key= lambda x:x[1])]
    df_freq_target.index
    df_freq_target.columns = ['freq1']
    df_freq_target['freq2'] = np.asarray(DTM[~(cls_kmeans.labels_ == cluster_keyness)].sum(0).T).reshape(-1)
    #
    df_freq_target['tot'] = df_freq_target['freq1'] + df_freq_target['freq2']
    #
    df_freq_target['freq1'] = df_freq_target['freq1'].apply(lambda x: 0.0000001 if x == 0 else x).astype(float)
    df_freq_target['freq2'] = df_freq_target['freq2'].apply(lambda x: 0.0000001 if x == 0 else x).astype(float)
    #
    df_freq_target['freq1_norm'] = df_freq_target['freq1']/df_freq_target['freq1'].sum() * 1000000
    df_freq_target['freq2_norm'] = df_freq_target['freq2']/df_freq_target['freq2'].sum() * 1000000
    #
    df_freq_target['fraction'] = df_freq_target['freq1_norm'] / df_freq_target['freq2_norm']
    df_freq_target['Log-likelihood Ratio'] = df_freq_target['fraction'].apply(math.log2)
    return df_freq_target


In [174]:
datasent

0    {'neg': 0.077, 'neu': 0.923, 'pos': 0.0, 'comp...
1    {'neg': 0.0, 'neu': 0.917, 'pos': 0.083, 'comp...
2    {'neg': 0.0, 'neu': 0.839, 'pos': 0.161, 'comp...
3    {'neg': 0.092, 'neu': 0.789, 'pos': 0.119, 'co...
dtype: object