<a href="https://colab.research.google.com/github/nicolashernandez/READI-LREC22/blob/main/readi_reproduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook comes from the git repository available [here](https://github.com/nicolashernandez/READI-LREC22/)  
It will show how to reproduce the contents of the READI paper available [here](https://cental.uclouvain.be/readi2022/accepted.html), then show a few examples on how to manipulate the library.  
In order to speed up deep learning applications significantly, please enable GPU in this notebook's parameters :  
Edit -> Notebook Settings -> Hardware Accelerator : GPU
 


# Setup : Import dependencies then library

## Setup : NEED to restart runtime after downloading spacy model

In [1]:
%%capture
!python -m spacy download fr_core_news_sm
#This cell only needs to be run once.

Restart the runtime once (Ctrl+M . OR Runtime > Restart Runtime) then execute the following

In [1]:
import spacy

In [2]:
spacy.util.set_data_path(spacy.load("fr_core_news_sm")._path.parent.parent)
#Quick hack to make sure the library's own spacy model works on colab by indicating where to locate models.

##Setup : Importing library and assorted data

In [3]:
%%capture
# 1. Download project and set current directory
!git clone https://github.com/nicolashernandez/READI-LREC22/
%cd READI-LREC22/

In [4]:
%%capture
# 2. Install module, should take around a minute to install every dependency
%cd readability
!pip install .
%cd ..

In [5]:
# 3. Add project directory to the path
import sys,os
sys.path.append(os.getcwd())
sys.path.append(os.path.join(os.getcwd(),"readability"))

In [6]:
import readability

# Recreating experiments

Six files are located in the git repository that was cloned : in the READI-LREC22/demo folder.  
These contain the cleaned and formatted content of the corpuses used in our project, and will be used for the demonstrations.

In [7]:
import pickle
with open(os.path.join(os.getcwd(),"data","tokens_split.pkl"), "rb") as file:
    corpus_ljl = pickle.load(file)
with open(os.path.join(os.getcwd(),"data","bibebook.com.pkl"), "rb") as file:
    corpus_bb = pickle.load(file)
with open(os.path.join(os.getcwd(),"data","JeLisLibre_md.pkl"), "rb") as file:
    corpus_jll = pickle.load(file)


If you wish to view the content, simply treat it as a dictionary containing texts, classes can be known by doing dict.keys().  
Each text being a list of sentences, which are lists of tokens.  
For instance: corpus_ljl['level1'][0][0] would give you the first sentence of the first text in the ljl corpus, for the "level1" class.  


In [8]:
corpus_ljl['level1'][0][0]

["Aujourd'hui",
 ',',
 'toute',
 'la',
 'famille',
 'est',
 'allée',
 'à',
 'la',
 'fête',
 'foraine',
 '.']

In [9]:
#Turns out the copies aren't as clean as I thought...
for level in corpus_bb.keys():
  for text in corpus_bb[level][:]:
    if len(text)==0:
      corpus_bb[level].remove(text)

for level in corpus_jll.keys():
  for text in corpus_jll[level][:]:
    if len(text)==0:
      corpus_jll[level].remove(text)

##Reproducing the contents of table 2

In [10]:
import pandas as pd

In [11]:
corp_info_ljl = readability.Readability(corpus_ljl).corpus_info()
corp_info_ljl.rename({"Nombre de fichiers" : "Nombre de fichiers artificiel"}, axis = 'index', inplace = True)
original_documents = [240,314,134,58,746]
extract_ljl = pd.DataFrame([corp_info_ljl.loc["Nombre de fichiers artificiel"],corp_info_ljl.loc['Nombre de phrases total'],corp_info_ljl.loc['Nombre de tokens']])
extract_ljl.loc["Nombre de fichiers original"] = original_documents
extract_ljl.columns.name = "Corpus ljl"

corp_info_bb = readability.Readability(corpus_bb).corpus_info()
corp_info_bb.rename({'Nombre de fichiers' : 'Nombre de fichiers artificiel'}, axis = 'index', inplace = True)
original_documents = [52,91,65,208]
extract_bb = pd.DataFrame([corp_info_bb.loc["Nombre de fichiers artificiel"],corp_info_bb.loc['Nombre de phrases total'],corp_info_bb.loc['Nombre de tokens']])
extract_bb.loc["Nombre de fichiers original"] = original_documents
extract_bb.columns.name = "Corpus bb"

corp_info_jll = readability.Readability(corpus_jll).corpus_info()
corp_info_jll.rename({'Nombre de fichiers' : 'Nombre de fichiers artificiel'}, axis = 'index', inplace = True)
original_documents = [13,12,10,9,44]
extract_jll = pd.DataFrame([corp_info_jll.loc["Nombre de fichiers artificiel"],corp_info_jll.loc['Nombre de phrases total'],corp_info_jll.loc['Nombre de tokens']])
extract_jll.loc["Nombre de fichiers original"] = original_documents
extract_jll.columns.name = "Corpus jll"

Acquiring Natural Language Processor...
DEBUG: Spacy model location (already installed) :  /usr/local/lib/python3.7/dist-packages/fr_core_news_sm/fr_core_news_sm-2.2.5
Acquiring Natural Language Processor...
DEBUG: Spacy model location (already installed) :  /usr/local/lib/python3.7/dist-packages/fr_core_news_sm/fr_core_news_sm-2.2.5
Acquiring Natural Language Processor...
DEBUG: Spacy model location (already installed) :  /usr/local/lib/python3.7/dist-packages/fr_core_news_sm/fr_core_news_sm-2.2.5


In [12]:
print(extract_ljl)
print(extract_bb)
print(extract_jll)

Corpus ljl                      level1    level2    level3    level4     total
Nombre de fichiers artificiel    240.0     628.0     670.0     522.0    2060.0
Nombre de phrases total         4880.0   13049.0   10354.0    7743.0   36026.0
Nombre de tokens               38976.0  128019.0  124901.0  101165.0  393061.0
Nombre de fichiers original      240.0     314.0     134.0      58.0     746.0
Corpus bb                      intermédiaire  avancée    aisée     total
Nombre de fichiers artificiel          1729.0    1253.0     986.0    3968.0
Nombre de phrases total               22088.0   15762.0   12274.0   50124.0
Nombre de tokens                     315369.0  232604.0  173939.0  721912.0
Nombre de fichiers original              52.0      91.0      65.0     208.0
Corpus jll                     cycle4_3e  cycle4_4e  cycle4_5e  cycle3_6e  \
Nombre de fichiers artificiel      986.0      989.0     1187.0     1283.0   
Nombre de phrases total          14689.0    13553.0    13818.0    13463

##Reproducting the contents of table 3

In [None]:
#TODO: do readability.Readability(corpus).compile().scores() and show results in latex format

In [13]:
%%capture
scores_ljl = readability.Readability(corpus_ljl).compile().scores()
scores_bb = readability.Readability(corpus_bb).compile().scores()
scores_jll = readability.Readability(corpus_jll).compile().scores()

In [14]:
scores_ljl

Mean values,level1,level2,level3,level4,Pearson Score
The Gunning fog index GFI,45.132518,67.697721,91.866336,105.669951,0.475915
The Automated readability index ARI,14.238996,19.932585,25.719148,27.7577,0.472037
The Flesch reading ease FRE,90.625507,84.87584,82.799719,81.03216,-0.402143
The Flesch-Kincaid grade level FKGL,4.5768,6.681592,8.323094,9.019,0.451786
The Simple Measure of Gobbledygook SMOG,10.110286,11.521299,12.760749,13.210278,0.471106
Reading Ease Level,92.465376,82.239781,75.110005,71.711107,-0.408414


In [15]:
perplexity_calculator = readability.readability.perplexity.pppl_calculator
perplexity_calculator.load_model(None)

Downloading:   0%|          | 0.00/538 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/510M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/853k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/513k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/121 [00:00<?, ?B/s]

Model online, you can now use .PPPL_score()


0

This will take around an hour to calculate, even with the GPU enabled.

In [None]:
perplex_ljl = perplexity_calculator.PPPL_score(corpus_ljl)
#perplex_bb = perplexity_calculator.PPPL_score(corpus_bb)
#perplex_jll = perplexity_calculator.PPPL_score(corpus_ljl)

In [18]:
#TODO : calc mean values and then append
#Might have been a good idea to allow the library to calculate these mean values as well..
#ppl_list = []
#for level in levels:
#  for val in perplex[level]:
#    ppl_list.append(val)

#maxppl = max(ppl_list)
#ppl_list = [val/maxppl for val in ppl_list]

#pearson.append(pearsonr(ppl_list,labels)[0])

#moy_ppl= list()
#for level in levels:
#  moy=0
#  for score in perplex[level]:
#    moy+= score/len(perplex[level])
#  moy_ppl.append(moy)

#scores_ljl.append(the thing above)

##Reproducing the contents of table 4 for MLP and SVM

In [None]:
#TODO: make relevant methods from decoupage_corpus.ipynb and/or linguisticAnalysis.ipynb 
#Then use these and show results in latex format

## How to reproduce the results in table 4 for fastText and CamemBERT

In [7]:
#Checking if fasttext and bert works from lib, no link between readability class and these yet.
from readability.models import models, fasttext, bert

The following demonstration uses the csv files available in the data/ folder, encoded in one-hot vector format.  
It relies on the ktrain library (wrapping around Keras) to help configure and train models for deep learning use.

###fastText

In [52]:
fasttext.demo_doFastText() #Can pass "ljl", "bibebook.com", "JeLisLibre", or "all" as a parameter
# Takes around 15 minutes without GPU for the ljl corpus (default parameter) on free colab
# Takes around 3 minute with GPU enabled.

# NOTE : the results may be a little different than what was shown in the paper.


# Will test this after commit

# FIXME : I noticed this isn't a "true" crossvalidation since the model is already trained in
# consequent runs, this can be seen by the number of epochs being significantly lower, and
# the starting accuracy being almost the same as in the last epoch of the previous run.

# For future implementation in the library, this should be done :
# We need to reset the model to its initial configuration/weights,
# So something like this in the code should work :

# BEFORE the "for run in range(nb_runs)":
#
#init_weights = []
#for layer in learner.model.layers:
#    init_weights.append(layer.get_weights()) # list of numpy arrays
#
#Then, within the loop :
#for index in range(len(init_weights)):
#    learner.model.layers[index].set_weight(init_weights[index])

detected encoding: utf-8 (if wrong, set manually)
['level1', 'level2', 'level3', 'level4']
      level1  level2  level3  level4
1962       0       0       0       1
2045       0       0       0       1
828        0       1       0       0
1336       0       0       1       0
939        0       0       1       0
['level1', 'level2', 'level3', 'level4']
      level1  level2  level3  level4
1530       0       0       1       0
1198       0       0       1       0
1474       0       0       1       0
509        0       1       0       0
1120       0       0       1       0
language: fr
Word Counts: 19647
Nrows: 1854
1854 train sequences
train sequence lengths:
	mean : 166
	95percentile : 372
	99percentile : 514
x_train shape: (1854,150)
y_train shape: (1854, 4)
Is Multi-Label? False
206 test sequences
test sequence lengths:
	mean : 148
	95percentile : 328
	99percentile : 488
x_test shape: (206,150)
y_test shape: (206, 4)
Is Multi-Label? False
compiling word ID features...
maxlen is 150
don

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Epoch 2/1024
Epoch 3/1024
Epoch 00003: Reducing Max LR on Plateau: new max lr will be 5e-05 (if not early_stopping).
Epoch 4/1024
Epoch 5/1024
Epoch 00005: Reducing Max LR on Plateau: new max lr will be 2.5e-05 (if not early_stopping).
Epoch 6/1024
Epoch 7/1024
Epoch 8/1024
Epoch 9/1024
Epoch 10/1024
Epoch 11/1024
Epoch 12/1024
Epoch 13/1024
Epoch 14/1024
Epoch 15/1024
Epoch 16/1024
Epoch 17/1024
Epoch 18/1024
Epoch 00018: Reducing Max LR on Plateau: new max lr will be 1.25e-05 (if not early_stopping).
Epoch 19/1024
Epoch 20/1024
Epoch 00020: Reducing Max LR on Plateau: new max lr will be 6.25e-06 (if not early_stopping).
Epoch 21/1024
Epoch 21: early stopping
Weights from best epoch have been loaded into model.
run 1 CORPUSNAME ljl class_names ['level1', 'level2', 'level3', 'level4']
              precision    recall  f1-score   support

      level1       0.00      0.00      0.00        26
      level2       0.32      0.58      0.41        60
      level3       0.33      0.44      0.

-1

CamemBERT

This takes multiple hours without having enabled the GPU, remember to do this before:    
Edit -> Notebook Settings -> Hardware Accelerator : GPU

In [16]:
bert.demo_doBert() #C an pass "ljl", "bibebook.com", "JeLisLibre", or "all" as a parameter
#Takes around 15 minutes for the ljl corpus on GPU (default parameter)

#FIXME : same reason as for fasttext, not a true cross-validation.

-------------------------------------------------------------------
['id', 'text', 'level1', 'level2', 'level3', 'level4']
len_train 1854
CORPUS_NAME ljl MODEL_NAME camembert-base class_names ['level1', 'level2', 'level3', 'level4']
--> getTransformer
preprocessing train...
language: fr
train sequence lengths:
	mean : 195
	95percentile : 424
	99percentile : 608


Is Multi-Label? False
preprocessing test...
language: fr
test sequence lengths:
	mean : 178
	95percentile : 389
	99percentile : 495


t <class 'ktrain.text.preprocessor.Transformer'> 
trn <class 'ktrain.text.dataset.TransformerDataset'> 
val <class 'ktrain.text.dataset.TransformerDataset'> 
model <class 'transformers.models.camembert.modeling_tf_camembert.TFCamembertForSequenceClassification'> 
learner <class 'ktrain.text.learner.TransformerTextClassLearner'>
Model: "tf_camembert_for_sequence_classification_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 roberta (TFRobertaMainLayer  multiple                 110031360 
 )                                                               
                                                                 
 classifier (TFRobertaClassi  multiple                 593668    
 ficationHead)                                                   
                                                                 
Total params: 110,625,028
Trainable params: 110,625,028
Non-trainable params: 0
__________

0

# Examples of use

## Importing data for the examples

In [7]:
import pickle
with open(os.path.join(os.getcwd(),"data","tokens_split.pkl"), "rb") as file:
    corpus = pickle.load(file)

In [11]:
#This can also be done by doing a wget :
#!wget -nc https://github.com/nicolashernandez/READI-LREC22/blob/main/data/tokens_split.pkl?raw=true -P data
#with open(os.path.join(os.getcwd(),"data","tokens_split.pkl?raw=true"), "rb") as file:
#    corpus = pickle.load(file)

## Example one : Using the library for a text

Texts can be strings, but it is preferred to prepare them beforehand as tokenized sentences. ( list(list()) )  
If using spacy, something like this can be used :  
new_text = [[token.text for token in sent] for sent in spacy(text).sents]  
And to remove punctuation marks, this can be done instead :  
new_text = [[token.text for token in sent if not token.is_punct] for sent in spacy(temp).sents]

A readability instance is created by calling readability.Readability(text)  
The following arguments are optional : lang, nlp_name, perplexity_processor  
By default, this instance will use the french language, by using a spacy_sm nlp processor, and gpt2 for processing perplexity

In [8]:
import pandas as pd
import spacy
#Types of available formats for a text:
r = readability.Readability(corpus['level1'][0]) # A text in the list(list()) format used internally
#r = readability.Readability(' '.join(corpus['level1'][0][0])) # A string, it will be converted into a list(list()), of size 1, with 12 tokens, including punctuation

Acquiring Natural Language Processor...
DEBUG: Spacy model location (already installed) :  /usr/local/lib/python3.7/dist-packages/fr_core_news_sm/fr_core_news_sm-2.2.5


Common scores can be accessed by using the corresponding function.

In [9]:
gfi = r.gfi()
gfi #is 61.52380952380953

61.52380952380953

More conveniently, a list of these scores can be obtained by using .scores()

In [10]:
r.scores()

{'ari': 21.503161490683233,
 'fkgl': 8.382298136645964,
 'fre': 54.47311594202901,
 'gfi': 61.52380952380953,
 'rel': 73.00333333333334,
 'smog': 13.023866798666859}

In order to speed the calculations needed by these functions, the .compile() function can be used.  
It calculates most of the statistics needed for a text, and puts it in the .statistics attribute of the Readability object.  
These can be viewed by doing .stats(), or directly accessing the .statistics attribute.  
For example : .statistics.totalWords

In [11]:
r.compile()
r.stats()
r.statistics.totalWords

totalWords = 230
totalLongWords = 30
totalSentences = 21
totalCharacters = 837
totalSyllables = 384
nbPolysyllables = 63


230

## Example two : Using the library for a corpus

Currently, a corpus will be recognized by the library only if provided with the following structure :  
type(corpus) = dict[class][text][sentence][token]  
For instance, corpA['class1'][0][0][0] should return the first token of the first sentence of the first text of class 'class1', for the corpus 'corpA'.

In [12]:
r = readability.Readability(corpus)

Acquiring Natural Language Processor...
DEBUG: Spacy model location (already installed) :  /usr/local/lib/python3.7/dist-packages/fr_core_news_sm/fr_core_news_sm-2.2.5


A useful function resuming the contents of the corpus is available, called .corpus_info()

In [13]:
r.corpus_info()

Unnamed: 0,level1,level2,level3,level4,total
Nombre de fichiers,240.0,628.0,670.0,522.0,2060.0
Nombre de phrases total,4880.0,13049.0,10354.0,7743.0,36026.0
Nombre de phrases moyen,20.0,21.0,15.0,15.0,17.0
Longueur moyenne de phrase,8.0,10.0,12.0,13.0,11.0
Nombre de tokens,38976.0,128019.0,124901.0,101165.0,393061.0
Nombre de token moyen,162.0,204.0,186.0,194.0,191.0
Taille du vocabulaire,4836.0,10903.0,11953.0,11410.0,23100.0
Taille moyenne du vocabulaire,99.0,130.0,127.0,149.0,2257.0


When using a corpus, the Readability object's methods can return different types of results, but the behavior is similar:  
Instead of returning a value, or a list, the methods may return them in a dict[class][text_index] format.  
Additionally, .compile() will create the .corpus_statistics attribute instead of .statistics.  
.stats() will print the statistics of the first text in each class, in addition to showing the mean values.

In [14]:
r.compile()
r.stats()

Class level1
totalWords = 230
totalLongWords = 30
totalSentences = 21
totalCharacters = 837
totalSyllables = 384
nbPolysyllables = 63
Class level2
totalWords = 138
totalLongWords = 26
totalSentences = 8
totalCharacters = 555
totalSyllables = 240
nbPolysyllables = 43
Class level3
totalWords = 104
totalLongWords = 16
totalSentences = 11
totalCharacters = 405
totalSyllables = 184
nbPolysyllables = 21
Class level4
totalWords = 567
totalLongWords = 112
totalSentences = 35
totalCharacters = 2307
totalSyllables = 972
nbPolysyllables = 151


In [15]:
gfi_corp = r.gfi()
gfi_corp['level1'][0] #Is also 61.52380952380953

class level1 text 0 score 61.52380952380953
class level2 text 0 score 136.9
class level3 text 0 score 61.963636363636375
class level4 text 0 score 134.48


61.52380952380953

r.scores behaves differently, instead of giving the scores for each text, it returns a dataframe showing the mean values, (and prints out the standard deviation)

In [16]:
r.scores()

Standard Deviation values                   level1     level2     level3  \
The Gunning fog index GFI                22.638448  28.598931  38.724814   
The Automated readability index ARI       6.265479   6.977522   8.614690   
The Flesch reading ease FRE              26.013539  24.790444  29.308209   
The Flesch-Kincaid grade level FKGL       3.386159   2.447647   2.501191   
The Simple Measure of Gobbledygook SMOG   1.728092   1.647957   1.839104   
Reading Ease Level                       19.108738  12.993784  12.457079   

Standard Deviation values                   level4  
The Gunning fog index GFI                45.192761  
The Automated readability index ARI       9.156945  
The Flesch reading ease FRE              32.117630  
The Flesch-Kincaid grade level FKGL       2.833996  
The Simple Measure of Gobbledygook SMOG   1.978900  
Reading Ease Level                       14.318104  


Mean values,level1,level2,level3,level4,Pearson Score
The Gunning fog index GFI,45.132518,67.697721,91.866336,105.669951,0.475915
The Automated readability index ARI,14.238996,19.932585,25.719148,27.7577,0.472037
The Flesch reading ease FRE,90.625507,84.87584,82.799719,81.03216,-0.402143
The Flesch-Kincaid grade level FKGL,4.5768,6.681592,8.323094,9.019,0.451786
The Simple Measure of Gobbledygook SMOG,10.110286,11.521299,12.760749,13.210278,0.471106
Reading Ease Level,92.465376,82.239781,75.110005,71.711107,-0.408414


In addition, machine learning and deep learning applications can be used with the corpus' data to help develop NLP solutions

In [None]:
#r.importmodel(camembert)
#r.configmodel(params)
#r.train(mode=autofit)