


<font size='10' color = 'E3A440'>**Megadata and Advanced Techniques Demystified**</font>
=======
<font color = 'E3A440'>*New Analysis Methods and their Implications for Megadata Management in SSH (part 2)*</font>
=============


This workshop is part of the training [Megadata and Advanced Techniques Demystified](https://www.4point0.ca/en/2022/08/22/formation-megadonnees-demystifiees//) (session 6).

Humanities and social sciences are often confronted with the analysis of unstructured data, such as text. After preparing the data, several analysis techniques from machine learning can be used. During this workshop, participants will be introduced to the preprocessing of textual data and to supervised and unsupervised methods for analysis purposes with Python.


Structure of the workshop :
1. Part 1 : Examples of supervised and unsupervised methods for text mining
2. Part 2 : Exercices

### Autors: 
- Bruno Agard <bruno.agard@polymtl.ca>
- Davide Pulizzotto <davide.pulizzotto@polymtl.ca>

Département de Mathématiques et de génie industriel

École Polytechnique de Montréal

# <font color = 'E3A440'>0. Preparation environnement </font>

In [None]:
# Downloading of data from the GitHub project
!rm -rf Data_techniques_demystified_webinars/
!git clone https://github.com/4point0-ChairInnovation-Polymtl/Data_techniques_demystified_webinars

In [None]:
# Import modules
import os
import pandas as pd
import re
import numpy as np
import nltk
from nltk.corpus import stopwords
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.preprocessing import Normalizer
import matplotlib.pyplot as plt
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

<a name='Section_1'></a>
# <font color = 'E3A440'>1. *Preparation of textual data* (wrap up)</font>

The pre-processing of a corpus of texts may require the implementation of several steps including: the splitting of sentences, words, cleaning, filtering, etc.

In the next blocks of code, a text will be segmented into sentences and preprocessed using the `CleaningText()` function prepared in the previous session (October 27th, 2022).



In [None]:
text = """At eight o'clock, on Thursday morning, the great Arthur didn't feel VERY good.
This morning, Arthur is feeling better.
A dog runs in the street.
A little boy in running in the street.
Arthur is my dog, he sleeps every morning."""

In [None]:
# extraction os sentences
sentences = nltk.sent_tokenize(text)
print(sentences)

["At eight o'clock, on Thursday morning, the great Arthur didn't feel VERY good.", 'This morning, Arthur is feeling better.', 'A dog runs in the street.', 'A little boy in running in the street.', 'Arthur is my dog, he sleeps every morning.']


In [None]:
# Cleaning fonction to preprocess text
def CleaningText(text_as_string, language = 'english', reduce = '', list_pos_to_keep = [], Stopwords_to_add = []):
    from nltk.corpus import stopwords

    words = nltk.word_tokenize(text_as_string)
    words_pos = nltk.pos_tag(words, tagset='universal')
    words_pos = [(w, pos) for w, pos in words_pos if w.isalnum()]
    words_pos = [(w.lower(), pos) for w, pos in words_pos]
    
    if reduce == 'stem': 
        from nltk.stem.porter import PorterStemmer
        reduced_words_pos = [(PorterStemmer().stem(w), pos) for w, pos in words_pos]
        
    elif reduce == 'lemma':
        from nltk.stem.wordnet import WordNetLemmatizer
        reduced_words_pos = [(WordNetLemmatizer().lemmatize(w), pos) for w, pos in words_pos]
    else:
        import warnings
        reduced_words_pos = words_pos
        warnings.warn("Warning : any reduction was made on words! Please, use \"reduce\" argument to chosse between 'stem' or  'lemma'")
    if list_pos_to_keep:
        reduced_words_pos = [(w, pos) for w, pos in reduced_words_pos if pos in list_pos_to_keep]
    else:
        import warnings
        warnings.warn("Warning : any POS filtering was made. Please, use \"list_pos_to_keep\" to create a list of POS tag to keep.")
    
    list_stopwords = stopwords.words(language) + Stopwords_to_add
    reduced_words_pos = [(w, pos) for w, pos in reduced_words_pos if w not in list_stopwords and len(w) > 1 ]
    return reduced_words_pos   

In [None]:
# nettoyage des phrases, sélection de pos-tag
cleaned_sentences = [CleaningText(sent, reduce = 'stem', list_pos_to_keep = ['NOUN','ADJ','VERB']) for sent in sentences]
print(cleaned_sentences)

[[('thursday', 'NOUN'), ('morn', 'NOUN'), ('great', 'ADJ'), ('arthur', 'NOUN'), ('feel', 'VERB'), ('good', 'ADJ')], [('morn', 'NOUN'), ('arthur', 'NOUN'), ('feel', 'VERB')], [('dog', 'NOUN'), ('run', 'VERB'), ('street', 'NOUN')], [('littl', 'ADJ'), ('boy', 'NOUN'), ('run', 'VERB'), ('street', 'NOUN')], [('arthur', 'NOUN'), ('dog', 'NOUN'), ('sleep', 'VERB'), ('morn', 'NOUN')]]


Voici la liste de POS tag existant.

| **POS** | **DESCRIPTION**           | **EXAMPLES**                                      |
| ------- | ------------------------- | ------------------------------------------------- |
| ADJ     | adjective                 | big, old, green, incomprehensible, first      |
| ADP     | adposition                | in, to, during                                |
| ADV     | adverb                    | very, tomorrow, down, where, there            |
| AUX     | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ    | conjunction               | and, or, but                                  |
| CCONJ   | coordinating conjunction  | and, or, but                                  |
| DET     | determiner                | a, an, the                                    |
| INTJ    | interjection              | psst, ouch, bravo, hello                      |
| NOUN    | noun                      | girl, cat, tree, air, beauty                  |
| NUM     | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART    | particle                  | ’s, not                                      |
| PRON    | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN   | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT   | punctuation               | ., (, ), ?                                    |
| SCONJ   | subordinating conjunction | if, while, that                               |
| SYM     | symbol                    | $, %, §, ©, +, −, ×, ÷, =, :)               |
| VERB    | verb                      | run, runs, running, eat, ate, eating          |
| X       | other                     | sfpksdpsxmsa                                  |
| SPACE   | space                     |                                                   |

### <font color = 'E3A440'>*3. Vectorization*</font>

Typically, to use text in a data analysis or machine learning context, text must be transformed into an appropriate mathematical object.
The simplest and most widespread model is the "bags-of-words", in which each text (or each text fragment) is defined in a vector, by a certain number of lexical units which characterize it. This model belongs to the family of vector semantics models and it has the following form:


$$X = \begin{bmatrix} 
x_{1,1} & x_{1,2} & \ldots & x_{1,w} \\
\vdots & \vdots       &  \ddots      & \vdots \\ 
x_{n,1} & x_{1,2} & \ldots & x_{n,w} \\
\end{bmatrix}
$$ 

In this matrix, the value $x_{i,j}$ represents the "weigth" of the word $j$ in the text fragment $i$. This weigth can be computed in several way. Thus :

- $x_{i,j}$ can represents the presence of the word "j" in text fragment $i$,
- $x_{i,j}$ can measures the quantoty of occurrences of a word $j$ in text fragment $i$,
- $x_{i,j}$ can represent the **value** of the word $j$ in text fragment $i$, and this, using metric such as tf-idf :
 $$\text{tf-idf}_{i,j}=\text{tf}_{i,j}.log\left(\frac{n}{n_i}\right)$$
 - $\text{tf}_{i,j}$ is the frequency of word $i$ in text fragment $j$,
 - $n$ total count of text fragments,
 - $n_i$ total counts of text fragments containing the word $i$.


In [None]:
# Object initialization
from nltk.corpus import stopwords

def identity_tokenizer(text):
    return text

# Transforming the word in frequencies
vectorized = CountVectorizer(lowercase = False, # Convert all characters to lowercase before tokenizing
                             min_df = 1, # Ignore terms that have a document frequency strictly lower than the given threshold 
                             max_df = 10, # Ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
                             stop_words = stopwords.words('english'), # Remove the list of words provided
                             ngram_range = (1, 1), # Get the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted
                             tokenizer=identity_tokenizer) # Override the string tokenization step while preserving the preprocessing and n-grams generation steps

Use of the "vectorizer" with a list of word lists (and not a list of word-pos tuples).

In [None]:
# Create a list of list of words:
[[w for w, pos in sent] for sent in cleaned_sentences]

In [None]:
# Application of the vectorizer
freq_term_DTM = vectorized.fit_transform([[w for w, pos in sent] for sent in cleaned_sentences])
print(pd.DataFrame(freq_term_DTM.todense(), columns =  [k for k, v in sorted(vectorized.vocabulary_.items(), key=lambda item: item[1])] ))

Thus, we assign the result of the Tf-IDF weighting to the variable named `tfidf_DTM`. 

In [None]:
# Calculate the tfidf matrix
tfidf = TfidfTransformer(norm='l1')
tfidf_DTM = tfidf.fit_transform(freq_term_DTM)
print(pd.DataFrame(tfidf_DTM.todense(), columns =  [k for k, v in sorted(vectorized.vocabulary_.items(), key=lambda item: item[1])] ))

<a name="Section_2"></a>
# <font color = 'E3A440'> 2. *Exercise : Sentiment Analysis on Twitter* </font>

The exercise proposed in this section is based on a simple processing chain for **sentiment analysis** on Twitter data and **analysis of lexical specificities**.

The corpus used was collected in 2020 by *trackmyhashtag.com* and contains 150,000 tweets for the 50 most followed profiles on Twitter. The data is in tabular format in a CSV file. For pedagogical reasons, this exercise foresees the use of a random sample of 5,000 tweets.

First, the textual data of 5,000 tweets will be analyzed by a sentiment analysis module of the `nltk` module. Then the text will be preprocessed and some lexical analysis will be performed.

During the exercise, the participant will be invited to fill the missing parts of the code which are indicated with `...` (three dots).

## <font color = 'E3A440'> 2.1 Presentation of the exercise </font>

### <font color = 'E3A440'> a. Import data </font>

The file with the data is archived in a `.zip` and contains more than 150,000 tweets. For educational reasons, we only import 5,000 random tweets. 

In [None]:
ROOT_DIR='Data_techniques_demystified_webinars/'
DATA_DIR=os.path.join(ROOT_DIR, 'Data')
import zipfile
from datetime import datetime

#Unzips the dataset and gets the TSV dataset
with zipfile.ZipFile(os.path.join(DATA_DIR,'4POINT0_Top_50_tweet_profiles.zip'), 'r') as zip_ref:
    zip_ref.extractall(DATA_DIR)

df = pd.read_pickle(os.path.join(DATA_DIR,'Top_50_tweet_profiles.pkl')).sample(5000, random_state = 5641).reset_index()

Here available variables in the dataset.

In [None]:
df.columns

Here an observation (one row of the table of data):

In [None]:
df.iloc[0]

### <font color = 'E3A440'> b. Run Sentiment Analysis </font>

The `SentimentIntensityAnalyzer` object is used to perform sentiment analysis. The object must be initialized and then the `polarity_scores()` function can be applied to a string.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

Here are three examples of sentiment analysis. The result of the `polarity_scores()` function returns four values:

 1. `neg` : indicates the degree, on a scale from 0 to 1, of negative sentiment of the text.
 2. `neu` : indicates the degree, on a scale of 0 to 1, of neutral sentiment of the text.
 3. `pos` :indicates the degree, on a scale of 0 to 1, of positive sentiment of the text.
 4. `compound` : contains a composite value of the previous three metrics with a range from -1 to 1.



In [None]:
sia.polarity_scores("Wow, Montreal Canadiens is the greatest hockey team in the world!")

In [None]:
sia.polarity_scores("Ottawa is not bad city!")

In [None]:
sia.polarity_scores("No, you cannot put pineapple on a pizza! This is disgusting!")

These are tweets on which we will apply the sentiment analysis:

In [None]:
df['Tweet Content']

In the next block of code, we run sentiment analysis on the `Tweet Content` column, and add the resulting results to the data table (the object named `df`).

In [None]:
# Running Sentiment Analysis on the Corpus
datasent = df.apply(lambda x: sia.polarity_scores(x['Tweet Content']), 1)
df = df.join(pd.DataFrame(list(datasent)))

The result of the analysis is saved in 4 variables. Here is an example:

In [None]:
df.iloc[0]

To make the analysis simple, we will only use the compound metric `compound` which is automatically calculated by the `polarity_score()` function.

In [None]:
df['compound'].describe()

To use the `compound` metric in a **lexical specificity analysis** context, it is necessary to constitute categories, i.e. to group the tweets under the following categories:
 1. `negative` : which groups tweets containing negative sentiment (`compound` from -1 to -0.1) 
 2. `neu` : which groups tweets that are more neutral (`compound` from -0.5 to 0.5)
 3. `positive` : which groups tweets containing a positive sentiment (`compound` more than 0.5)

In [None]:
# 1 Determine the values ​​to cut the compound metric
bins = [-1, -0.1, 0.5, 1]
# 2 Determine the names of the categories. NOTE that the numbers of category names must be less than the cut values.
names = ['negative', 'neu', 'positive']
# Execute slicing with pandas 'cut' function.
df['compound_category']  = pd.cut(df['compound'], bins, labels=names, include_lowest =True)

Here is the distribution of tweets by category:

In [None]:
Counter(df['compound_category'])

### <font color = 'E3A440'> c. Annotation, cleaning and vectorization of tweets </font>

We use the previously written function to clean the lexical units of tweets. For this first test, we keep only the adjectives.

This operation will take a few seconds.

In [None]:
cleaned_tweets = [CleaningText(sent, reduce = 'lemma', list_pos_to_keep = ['ADJ'], Stopwords_to_add=['http']) for sent in list(df['Tweet Content'])]

In the vectorization step we retain the words that appear in at least 5 documents (`min_df = 5`).

In [None]:
# Object initialization
def identity_tokenizer(text):
    return text
# Transforming the word in frequencies
vectorized = CountVectorizer(lowercase = False, # Convert all characters to lowercase before tokenizing
                             min_df = 5, # Ignore terms that have a document frequency strictly lower than the given threshold 
                             max_df = 4500, # Ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
                             stop_words = stopwords.words('english'), # Remove the list of words provided
                             ngram_range = (1, 1), # Get the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted
                             tokenizer=identity_tokenizer) # Override the string tokenization step while preserving the preprocessing and n-grams generation steps

In [None]:
freq_term_DTM = vectorized.fit_transform([[w for w, pos in sent] for sent in cleaned_tweets])
freq_term_DTM

### <font color = 'E3A440'> d. Analysis of lexical specificities </font>

The analysis of lexical specificities makes it possible to highlight the lexical units which are specific to a particular group of data. In our case, it is possible to identify the words that are more strongly associated with positive or negative feelings.

To do this, we use a widely used metric in lexicometry, which is the likelihood function (log-likelihood ratio). The metric is based on this [article](https://aclanthology.org/J93-1003.pdf). Other methods can be used, such as mutual information, chi2 or tf-idf weighting.

In [None]:
def GetLexicalSpecificities(freq_term_DTM, logical_vector):
    # This code ref takes inspiration from this python module : https://pypi.org/project/corpus-toolkit/
    # and its main script:  https://github.com/kristopherkyle/corpus_toolkit/blob/master/corpus_toolkit/corpus_tools.py
    # which is based on this paper: https://aclanthology.org/J93-1003/
    import math
    df_freq_target = pd.DataFrame(np.asarray(freq_term_DTM[logical_vector].sum(0).T).reshape(-1))
    df_freq_target.index = [word for (word,idx) in sorted(vectorized.vocabulary_.items(), key= lambda x:x[1])]
    df_freq_target.columns = ['freq1']
    df_freq_target['freq2'] = np.asarray(freq_term_DTM[~(logical_vector)].sum(0).T).reshape(-1)
    df_freq_target['tot'] = df_freq_target['freq1'] + df_freq_target['freq2']

    df_freq_target['freq1'] = df_freq_target['freq1'].apply(lambda x: 0.0000001 if x == 0 else x).astype(float)
    df_freq_target['freq2'] = df_freq_target['freq2'].apply(lambda x: 0.0000001 if x == 0 else x).astype(float)
    #
    df_freq_target['freq1_norm'] = df_freq_target['freq1']/df_freq_target['freq1'].sum() * 1000000
    df_freq_target['freq2_norm'] = df_freq_target['freq2']/df_freq_target['freq2'].sum() * 1000000
    #
    df_freq_target['fraction'] = df_freq_target['freq1_norm'] / df_freq_target['freq2_norm']
    df_freq_target['Log-likelihood Ratio'] = df_freq_target['fraction'].apply(math.log2)
    frequency_threshold = 10 # Insert your frequency threshold as integer
    return df_freq_target[df_freq_target['tot'] > frequency_threshold]['Log-likelihood Ratio'].sort_values(ascending=False).iloc[range(50)]

To perform the specificity analysis, it is necessary to create a logical vector (with binary values) which indicates with `True` the class for which we want to analyze the lexical specificity and with `False` the rest of the corpus.

In [None]:
logical_vector = df['compound_category'] == 'positive'
logical_vector

In [None]:
sum(logical_vector)

Run the function with the frequency matrix (`freq_term_DTM`) and the logical vector we created above.

In [None]:
GetLexicalSpecificities(freq_term_DTM, logical_vector)

In [None]:
del logical_vector, freq_term_DTM

## <font color = 'E3A440'> 2.2 Exercice </font>

During the exercise, the participant are invited to fill the missing parts of the code which are indicated with `...` (three dots).

Several manipulations and different results will be required. Each sub-exercise follows this processing chain:

1. Annotation and cleaning of tweets: the participant will have to adjust some parameters of the function to choose a specific filtering.
2. Vectorization: the participant will have to adjust some parameters of the function to choose a specific filtering.
3. Creation of a logical vector to define the target group and the reference group.
4. Application of the `GetLexicalSpecificities()` function to obtain the 50 most specific words for the target group.




### <font color = 'E3A440'> a. Study the impact of morphosyntactic filtering on lexical specificities </font>

In point 2.1, only adjectives have been studied. Now do a study on nouns, adjectives and verbs and then on other combinations that are interesting for you.
Here is the list of existing POS tags:

| **POS** | **DESCRIPTION**           | **EXAMPLES**                                      |
| ------- | ------------------------- | ------------------------------------------------- |
| ADJ     | adjective                 | big, old, green, incomprehensible, first      |
| ADP     | adposition                | in, to, during                                |
| ADV     | adverb                    | very, tomorrow, down, where, there            |
| AUX     | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ    | conjunction               | and, or, but                                  |
| CCONJ   | coordinating conjunction  | and, or, but                                  |
| DET     | determiner                | a, an, the                                    |
| INTJ    | interjection              | psst, ouch, bravo, hello                      |
| NOUN    | noun                      | girl, cat, tree, air, beauty                  |
| NUM     | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART    | particle                  | ’s, not                                      |
| PRON    | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN   | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT   | punctuation               | ., (, ), ?                                    |
| SCONJ   | subordinating conjunction | if, while, that                               |
| SYM     | symbol                    | $, %, §, ©, +, −, ×, ÷, =, :)               |
| VERB    | verb                      | run, runs, running, eat, ate, eating          |
| X       | other                     | sfpksdpsxmsa                                  |
| SPACE   | space                     |                                                   |


Insert the correct value for the `list_pos_to_keep` argument in roder to keep nouns, adjectives and verbs, or any other POS tag combination of your interest.
Look at the three dots `...` and fill it! You should inspire to prvious code presentend during this workshop. Copy-Paste is permitted!

In [None]:
# 1. Annotation and cleaning : ADD adjective and verbs as POS tag to keep
cleaned_tweets = [CleaningText(sent, reduce = 'lemma', list_pos_to_keep = [...], Stopwords_to_add=['http']) for sent in list(df['Tweet Content'])]

Change the `min_df` parameters so that you don't exceed **750 words** of your matrix <font color='E3A440'>**Document-Term matrix**</font>, which is saved in the object `freq_term_DTM`.

Note that in this function the `ngram_range` parameter is configured to have unigrams and bigrams (its value is: `(1,2)`).

In [None]:
# 2. Vectorisation
def identity_tokenizer(text):
    return text
    
## 2.1 initialise with parameters : 
vectorized = CountVectorizer(lowercase = False, # Convert all characters to lowercase before tokenizing
                             min_df = ..., # Ignore terms that have a document frequency strictly lower than the given threshold 
                             max_df = 4500, # Ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
                             stop_words = stopwords.words('english'), # Remove the list of words provided
                             ngram_range = (1, 2), # Get the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted
                             tokenizer=identity_tokenizer) # Override the string tokenization step while preserving the preprocessing and n-grams generation steps

#
freq_term_DTM = vectorized.fit_transform([[w for w, pos in sent] for sent in cleaned_tweets])

freq_term_DTM

Using the work already done in point **b.** of section **2.1**, choose the categories for which you want to study the lexical specificity ex. `negative` or `positive`.

In [None]:
logical_vector = df['compound_category'] == ...

Using the function defined in point **d.** of section **2.1**, add the fundamental arguments of the function, i.e. the matrix <font color='E3A440'>**Document-Term matrix**</font> and the **logical vector** created in the previous code block.

In [None]:
# This function needs 2 arguemnts: 1st is the matrix, the 2nd is the logical vector
GetLexicalSpecificities(..., ...)

### <font color = 'E3A440'> b. Investigate new categories based on the number of Retweets </font>

On Twitter, it is possible to retweet an existing tweet. The number of retweets can be considered an indicator of the interest a tweet has obtained.

Answer the following question: what are the lexical specificities of tweets that have had a very large following?

To answer, you must manipulate some line of code by performing the steps learned throughout this workshop.

Here is the distribution of the `Retweets received` column.

In [None]:
df['Retweets received'].describe()

Following the percentiles that are displayed in the distribution of the `Retweets received` column (result of the previous chunk of code), add the missing slicing values to the `bins` list in the next chunk of code.
Divide the number of Retweets into four categories:
1. `low`, grouping tweets that have received a low interest
2. `medium`, grouping tweets that have received a medium interest
3. `high`, grouping tweets that have received a high interest
4. `very_high`, grouping tweets that have received a very high interest

In [None]:
bins = [-np.inf, 161, ..., ..., 449711]
names = ['low', 'medium', 'high', 'very_high']
df['Retweets_received_category']  = pd.cut(df['Retweets received'], bins, labels=names, include_lowest =True)

Choose the **target category** for which you want analyze the lexical specificities. The value must be one of the four values ​​contained in the `Retweets_received_category` column generated in the previous chunk of code.

In [None]:
logical_vector = df['Retweets_received_category'] == ...

Run the specificity analysis.

In [None]:
GetLexicalSpecificities(freq_term_DTM, logical_vector)

### <font color = 'E3A440'> c. Study the lexical specificities ​​of different Tweet profiles </font>

In the next exercise, select two or three Twitter profiles of your choice and compare the lexical specificities by studying several POS tag combinations. 

What are the main lexical differences between the profiles you have chosen?

Here is the complete list of profiles present in the corpus and recorded under the `Profile Account` column and the number of tweets per profile.

In [None]:
Counter(df['Profile Account'])

In [None]:
# 1. Annotation and cleaning : ADD adjective and verbs as POS tag to keep
cleaned_tweets = [CleaningText(sent, reduce = 'lemma', list_pos_to_keep = [...], Stopwords_to_add=['http']) for sent in list(df['Tweet Content'])]

# 2. Vectorisation
def identity_tokenizer(text):
    return text
## 2.1 initialise with parameters : 
vectorized = CountVectorizer(lowercase = False, # Convert all characters to lowercase before tokenizing
                             min_df = 10, # Ignore terms that have a document frequency strictly lower than the given threshold 
                             max_df = 4500, # Ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
                             stop_words = stopwords.words('english'), # Remove the list of words provided
                             ngram_range = (1, 1), # Get the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted
                             tokenizer=identity_tokenizer) # Override the string tokenization step while preserving the preprocessing and n-grams generation steps

#
freq_term_DTM = vectorized.fit_transform([[w for w, pos in sent] for sent in cleaned_tweets])

freq_term_DTM

In [None]:
logical_vector = df['Profile Account'] == ...
GetLexicalSpecificities(freq_term_DTM, logical_vector)

## <font color = 'E3A440'> 2.3 NOTES PERSONNELLES: </font>

-----

-----