<a href="https://colab.research.google.com/github/nicolashernandez/READI-LREC22/blob/main/readi_reproduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook comes from the git repository available [here](https://github.com/nicolashernandez/READI-LREC22/)  
It will show how to reproduce the contents of the READI paper available [here](https://cental.uclouvain.be/readi2022/accepted.html), then show a few examples on how to manipulate the library.  
In order to speed up deep learning applications significantly, please enable GPU in this notebook's parameters :  
Edit -> Notebook Settings -> Hardware Accelerator : GPU
 


##Setup : Importing library and corpuses

In [2]:
%%capture
# 1. Download project and set current directory
!git clone https://github.com/nicolashernandez/READI-LREC22/
%cd READI-LREC22/

In [3]:
%%capture
# 2. Install module, should take around a minute to install every dependency
!pip install .

In [4]:
# 3. Add project directory to the path (not needed but helps Colab editor with 
# auto-completion if you wish to try the library)
import sys,os
sys.path.append(os.getcwd())
sys.path.append(os.path.join(os.getcwd(),"readability"))
sys.path.append(os.path.join(os.getcwd(),"readability","readability"))

In [5]:
import readability

# Recreating experiments

Six files are located in the git repository that was cloned : in the READI-LREC22/readability/data folder.  
These contain the cleaned and formatted content of the corpuses used in our project, and will be used for the demonstrations.

In [7]:
import pickle
with open(os.path.join(os.getcwd(),"readability","data","tokens_split.pkl"), "rb") as file:
    corpus_ljl = pickle.load(file)
with open(os.path.join(os.getcwd(),"readability","data","bibebook.com.pkl"), "rb") as file:
    corpus_bb = pickle.load(file)
with open(os.path.join(os.getcwd(),"readability","data","JeLisLibre_md.pkl"), "rb") as file:
    corpus_jll = pickle.load(file)


If you wish to view the content, simply treat it as a dictionary containing texts, classes can be known by doing dict.keys().  
Each text being a list of sentences, which are lists of tokens.  
For instance: corpus_ljl['level1'][0][0] would give you the first sentence of the first text in the ljl corpus, for the "level1" class.  


In [8]:
corpus_ljl['level1'][0][0]

["Aujourd'hui",
 ',',
 'toute',
 'la',
 'famille',
 'est',
 'allée',
 'à',
 'la',
 'fête',
 'foraine',
 '.']

In [9]:
for level in corpus_bb.keys():
  for text in corpus_bb[level][:]:
    if len(text)==0:
      corpus_bb[level].remove(text)

for level in corpus_jll.keys():
  for text in corpus_jll[level][:]:
    if len(text)==0:
      corpus_jll[level].remove(text)

### Introducing the first main component of the library

The class that takes care of calling the processes developped for handling various text readability tasks is called the "ReadabilityProcessor".

When initializing it, it loads external resources that may be needed such as NLP processors, language models, or dataframes containing data such as word lists.

Each measure is enabled by default, but can be excluded, alongside their dependencies, on a case-by-case basis.  
As of July 22th 2022, the following measures can be excluded:

Traditional scores :
* gfi, ari, fre, fkgl, smog, rel

Measures related to perplexity:
* pppl (pseudo-perplexity)

Measures related to text diversity:
* ttr, ntr (text/noun token ratio)

Measures linked with word lists:
* old20, pld20 (Orthographic/Phonemic Levenshtein Distance 20)
* dubois_buyse_ratio

Measures related to text cohesion:
* entity_density, average_entity_word_length
* cosine_similarity_tfidf, cosine_similarity_LDA
* referring_entity_ratio, average_length_reference_chain (Uses coreference chains)

For an quick explanation on a measure's origin, and how it works, please note that the help function can be used on the functions, which share the same name as the measures themselves.  
For instance, try help(readability_processor.gfi)

Since the module is documented, it can also be appropriate to call help() on an instance in order to view everything.

In [11]:
readability_processor = readability.Readability(exclude=["cosine_similarity_LDA"])

Acquiring Natural Language Processor...
Downloading spacy language model 
(Should only happen once)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
DEBUG: Spacy model location:  /usr/local/lib/python3.7/dist-packages/fr_core_news_sm/fr_core_news_sm-3.2.0
importing GPT2 model..


Downloading:   0%|          | 0.00/538 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/510M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/853k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/513k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/121 [00:00<?, ?B/s]

imported GPT2 model
importing lexique data as dataframe
lexique dataframe imported
importing dubois-buyse data as dataframe
dubois-buyse dataframe imported


### Introducing the second main component of the library

The ParsedText class, and its extension ParsedCollection can be created from a readability processor when supplied with a collection.

They're not only an interface between texts and the readability processor, these are also used to store the measures obtained from the readability processor, alongside various common statistics, which can be re-used across several of the processor's functions in order to speed up the process.

They're currently composed of four attributes:
* content: Storing a text, or a collection of text
* readability_processor: a shared instance of the readability processor
* statistics: A dictionary storing various common statistics
* scores: A dictionary storing measures obtained from the readability processor

Since the modules are documented, it can also be appropriate to call help() on an instance in order to view everything.

In [15]:
processed_corpus_ljl = readability_processor.parseCollection(corpus_ljl)
processed_corpus_jll = readability_processor.parseCollection(corpus_jll)
processed_corpus_bb = readability_processor.parseCollection(corpus_bb)

##Reproducing the contents of table 2

In [17]:
import pandas as pd

In [18]:
def reproduce_table2(processed_corpus):
  needed_labels = ["totalTexts","totalSentences","totalWords"]
  stats = [[],[],[]]
  for class_label in list(processed_corpus.statistics.keys()):
    for index,stat_label in enumerate(needed_labels):
      stats[index].append(processed_corpus.statistics[class_label][stat_label])

  for stat in stats:
    stat.append(sum(stat))

  df = pd.DataFrame([stats[0],stats[1],stats[2]],columns=(list(processed_corpus.statistics.keys())+["total"]))
  df.index = ["Nombre de fichiers artificiel","Nombre de phrases total","Nombre de tokens"]
  return df

In [19]:
df_table2_ljl = reproduce_table2(processed_corpus_ljl)
original_documents = [240,314,134,58,746]
df_table2_ljl.loc["Nombre de fichiers original"] = original_documents

df_table2_bb = reproduce_table2(processed_corpus_bb)
original_documents = [52,91,65,208]
df_table2_bb.loc["Nombre de fichiers original"] = original_documents

df_table2_jll = reproduce_table2(processed_corpus_jll)
original_documents = [13,12,10,9,44]
df_table2_jll.loc["Nombre de fichiers original"] = original_documents

In [21]:
df_table2_ljl

Unnamed: 0,level1,level2,level3,level4,total
Nombre de fichiers artificiel,240,628,670,522,2060
Nombre de phrases total,4880,13049,10354,7743,36026
Nombre de tokens,38976,128019,124901,101165,393061
Nombre de fichiers original,240,314,134,58,746


In [22]:
df_table2_bb

Unnamed: 0,intermédiaire,avancée,aisée,total
Nombre de fichiers artificiel,1729,1253,986,3968
Nombre de phrases total,22088,15762,12274,50124
Nombre de tokens,315369,232604,173939,721912
Nombre de fichiers original,52,91,65,208


In [23]:
df_table2_jll

Unnamed: 0,cycle4_3e,cycle4_4e,cycle4_5e,cycle3_6e,total
Nombre de fichiers artificiel,986,989,1187,1283,4445
Nombre de phrases total,14689,13553,13818,13463,55523
Nombre de tokens,188091,195375,211099,256573,851138
Nombre de fichiers original,13,12,10,9,44


##Reproducting the contents of table 3

###Traditional scores

In [24]:
from scipy.stats import pearsonr

In [25]:
def reproduce_table3(processed_corpus):
  needed_traditional_scores = ["gfi","ari","fre","fkgl","smog","rel"]
  pearson = []
  for score in needed_traditional_scores:
    labels = []
    scores_as_list = []
    for label in list(processed_corpus.content.keys()):
      for text in processed_corpus.content[label]:
        scores_as_list.append(text.traditional_score(score))
        labels.append(list(processed_corpus.content.keys()).index(label))
    pearson.append(pearsonr(scores_as_list,labels)[0])

  
  math_formulas = pd.DataFrame([processed_corpus.gfi(),processed_corpus.ari(),
                                processed_corpus.fre(),processed_corpus.fkgl(),
                                processed_corpus.smog(),processed_corpus.rel()],
                                columns=list(processed_corpus.content.keys()))

  math_formulas.index = ["The Gunning fog index GFI", "The Automated readability index ARI","The Flesch reading ease FRE","The Flesch-Kincaid grade level FKGL","The Simple Measure of Gobbledygook SMOG","Reading Ease Level"]
  math_formulas['Pearson Score'] = pearson
  math_formulas.columns.name = "Mean values"
  return math_formulas

In [26]:
scores_ljl = reproduce_table3(processed_corpus_ljl)
scores_bb = reproduce_table3(processed_corpus_bb)
scores_jll = reproduce_table3(processed_corpus_jll)

In [27]:
scores_ljl

Mean values,level1,level2,level3,level4,Pearson Score
The Gunning fog index GFI,45.132518,67.697721,91.866336,105.669951,0.475915
The Automated readability index ARI,14.238996,19.932585,25.719148,27.7577,0.472037
The Flesch reading ease FRE,72.037523,61.563843,54.090075,50.257194,-0.404092
The Flesch-Kincaid grade level FKGL,5.183396,7.152952,8.740208,9.439848,0.457392
The Simple Measure of Gobbledygook SMOG,16.250487,18.911327,21.336397,22.355208,0.486627
Reading Ease Level,88.681859,79.299771,72.508344,69.086158,-0.41096


In [28]:
scores_bb

Mean values,intermédiaire,avancée,aisée,Pearson Score
The Gunning fog index GFI,128.933829,122.607686,122.606339,-0.037663
The Automated readability index ARI,36.714696,36.263151,35.554898,-0.024102
The Flesch reading ease FRE,53.810381,55.380706,54.790151,0.024818
The Flesch-Kincaid grade level FKGL,9.987963,9.753552,9.742124,-0.024763
The Simple Measure of Gobbledygook SMOG,24.227734,24.180287,24.041178,-0.012431
Reading Ease Level,71.622888,72.997206,72.533265,0.024926


In [29]:
scores_jll

Mean values,cycle4_3e,cycle4_4e,cycle4_5e,cycle3_6e,Pearson Score
The Gunning fog index GFI,104.065761,102.421179,132.388006,119.822886,0.111777
The Automated readability index ARI,34.358482,36.11901,40.75461,46.380268,0.195434
The Flesch reading ease FRE,73.376908,76.95731,57.766046,75.52741,-0.050773
The Flesch-Kincaid grade level FKGL,7.147536,6.922545,9.912481,8.21576,0.138531
The Simple Measure of Gobbledygook SMOG,21.718341,22.137552,24.655992,23.986362,0.175971
Reading Ease Level,88.704426,91.673517,74.81122,89.848486,-0.062848


###Pseudo-perplexity (takes around an hour to calculate)

In [30]:
def add_perplexity_to_table3(processed_corpus,scores_dataframe):
  pearson = []
  labels = []
  perplexity_as_list = []
  for label in list(processed_corpus.content.keys()):
    for text in processed_corpus.content[label]:
      perplexity_as_list.append(text.perplexity())
      labels.append(list(processed_corpus.content.keys()).index(label))
  pearson.append(pearsonr(perplexity_as_list,labels)[0])

  scores_dataframe.loc["Pseudo_perplexity"] = list(processed_corpus.perplexity().values()) + pearson
  return scores_dataframe

In [31]:
add_perplexity_to_table3(processed_corpus_ljl, scores_ljl)

KeyboardInterrupt: ignored

In [None]:
add_perplexity_to_table3(processed_corpus_bb, scores_bb)

In [None]:
add_perplexity_to_table3(processed_corpus_jll, scores_jll)

##Reproducing the contents of table 4 for MLP and SVM

This should take around 50 minutes to compute on Colab.

In [None]:
from readability.methods import methods

In [None]:
methods.demo_doMethods(corpus_ljl,plot=False)

Matrix dimensions: (2060, 11661)
Vocabulary size: 11661
MLP RESULTS
cross-validation result for 5 runs = 0.479126213592233
              precision    recall  f1-score   support

      level1       0.45      0.46      0.45       240
      level2       0.47      0.63      0.54       628
      level3       0.47      0.47      0.47       670
      level4       0.54      0.32      0.40       522

    accuracy                           0.48      2060
   macro avg       0.48      0.47      0.47      2060
weighted avg       0.49      0.48      0.47      2060

SVM RESULTS
cross-validation result for 5 runs = 0.4757281553398058
              precision    recall  f1-score   support

      level1       0.46      0.42      0.44       240
      level2       0.46      0.60      0.52       628
      level3       0.47      0.49      0.48       670
      level4       0.53      0.33      0.41       522

    accuracy                           0.48      2060
   macro avg       0.48      0.46      0.46     

In [None]:
methods.demo_doMethods(corpus_bb,plot=False)

Matrix dimensions: (3968, 19590)
Vocabulary size: 19590
MLP RESULTS
cross-validation result for 5 runs = 0.4977301387137453
                precision    recall  f1-score   support

intermédiaire       0.52      0.60      0.56      1729
      avancée       0.51      0.48      0.50      1253
        aisée       0.43      0.33      0.37       986

      accuracy                           0.50      3968
     macro avg       0.48      0.47      0.48      3968
  weighted avg       0.49      0.50      0.49      3968

SVM RESULTS
cross-validation result for 5 runs = 0.5176462180095991
                precision    recall  f1-score   support

intermédiaire       0.52      0.66      0.59      1729
      avancée       0.55      0.46      0.50      1253
        aisée       0.46      0.33      0.39       986

      accuracy                           0.52      3968
     macro avg       0.51      0.49      0.49      3968
  weighted avg       0.51      0.52      0.51      3968



In [None]:
methods.demo_doMethods(corpus_jll,plot=False)

Matrix dimensions: (4445, 19128)
Vocabulary size: 19128
MLP RESULTS
cross-validation result for 5 runs = 0.604949381327334
              precision    recall  f1-score   support

   cycle4_3e       0.56      0.58      0.57       986
   cycle4_4e       0.41      0.38      0.39       989
   cycle4_5e       0.80      0.61      0.69      1187
   cycle3_6e       0.64      0.79      0.71      1283

    accuracy                           0.60      4445
   macro avg       0.60      0.59      0.59      4445
weighted avg       0.61      0.60      0.60      4445

SVM RESULTS
cross-validation result for 5 runs = 0.5739032620922384
              precision    recall  f1-score   support

   cycle4_3e       0.52      0.48      0.50       986
   cycle4_4e       0.42      0.31      0.36       989
   cycle4_5e       0.75      0.61      0.67      1187
   cycle3_6e       0.57      0.81      0.67      1283

    accuracy                           0.57      4445
   macro avg       0.56      0.55      0.55     

## How to reproduce the results in table 4 for fastText and CamemBERT

In [None]:
from readability.models import models, fasttext, bert

The following demonstration uses the csv files available in the data/ folder, encoded in one-hot vector format.  
It relies on the ktrain library (wrapping around Keras) to help configure and train models for deep learning use.  
Please enable the GPU to make these much faster :  
Edit -> Notebook Settings -> Hardware Accelerator : GPU

###fastText

In [None]:
fasttext.demo_doFastText("ljl") #Can pass "ljl", "bibebook.com", "JeLisLibre", or "all" as a parameter
# Takes around 15 minutes without GPU for the ljl corpus (default parameter) on free colab
# Takes around 3 minute with GPU enabled.

detected encoding: utf-8 (if wrong, set manually)
['level1', 'level2', 'level3', 'level4']
      level1  level2  level3  level4
1859       0       0       0       1
697        0       1       0       0
1691       0       0       0       1
41         1       0       0       0
62         1       0       0       0
['level1', 'level2', 'level3', 'level4']
      level1  level2  level3  level4
1247       0       0       1       0
99         1       0       0       0
1215       0       0       1       0
1604       0       0       0       1
1566       0       0       0       1
language: fr
Word Counts: 19507
Nrows: 1854
1854 train sequences
train sequence lengths:
	mean : 166
	95percentile : 368
	99percentile : 525
x_train shape: (1854,150)
y_train shape: (1854, 4)
Is Multi-Label? False
206 test sequences
test sequence lengths:
	mean : 156
	95percentile : 330
	99percentile : 409
x_test shape: (206,150)
y_test shape: (206, 4)
Is Multi-Label? False
compiling word ID features...
maxlen is 150
don

0

###CamemBERT

This takes multiple hours without having enabled the GPU, remember to do this before:    
Edit -> Notebook Settings -> Hardware Accelerator : GPU

In [None]:
bert.demo_doBert() #Can pass "ljl", "bibebook.com", "JeLisLibre", or "all" as a parameter
#Takes around 15 minutes for the ljl corpus on GPU (default parameter)

-------------------------------------------------------------------
len_train 1854
CORPUS_NAME ljl MODEL_NAME camembert-base class_names ['level1', 'level2', 'level3', 'level4']
--> getTransformer
preprocessing train...
language: fr
train sequence lengths:
	mean : 195
	95percentile : 424
	99percentile : 608


Is Multi-Label? False
preprocessing test...
language: fr
test sequence lengths:
	mean : 178
	95percentile : 389
	99percentile : 495


-------------------------------------------------------run 0
early_stopping automatically enabled at patience=5
reduce_on_plateau automatically enabled at patience=2


begin training using triangular learning rate policy with max lr of 0.0001...
Epoch 1/1024
Epoch 2/1024
Epoch 3/1024
Epoch 4/1024
Epoch 5/1024
Epoch 6/1024
Epoch 7/1024
Epoch 00007: Reducing Max LR on Plateau: new max lr will be 5e-05 (if not early_stopping).
Epoch 8/1024
Epoch 9/1024
Epoch 00009: Reducing Max LR on Plateau: new max lr will be 2.5e-05 (if not early_stopping).
Epoch 10/1024
Epoch 10: early stopping
Weights from best epoch have been loaded into model.
MODEL_NAME camembert-base run 0 CORPUSNAME ljl class_names ['level1', 'level2', 'level3', 'level4']
              precision    recall  f1-score   support

      level1       0.70      0.28      0.40        25
      level2       0.66      0.77      0.71        65
      level3       0.72      0.83      0.77        69
      level4       0.90      0.79      0.84 

# Examples of use

## Importing data for the examples

In [None]:
import pickle
with open(os.path.join(os.getcwd(),"data","tokens_split.pkl"), "rb") as file:
    corpus = pickle.load(file)

In [None]:
#This can also be done by doing a wget :
#!wget -nc https://github.com/nicolashernandez/READI-LREC22/blob/main/data/tokens_split.pkl?raw=true -P data
#with open(os.path.join(os.getcwd(),"data","tokens_split.pkl?raw=true"), "rb") as file:
#    corpus = pickle.load(file)

## Example one : Using the library for a text

Texts can be strings, but it is preferred to prepare them beforehand as tokenized sentences. ( list(list()) )  
If using spacy, something like this can be used :  
new_text = [[token.text for token in sent] for sent in spacy(text).sents]

A text is processed when calling the parse function of a ReadabilityProcessor instance.

In [None]:
import pandas as pd
import spacy
processed_text = readability_processor.parse(corpus['level1'][0])

Common scores can be accessed by using the corresponding function.

In [None]:
gfi = processed_text.gfi()
gfi

61.52380952380953

More conveniently, a list each available score can be obtained by using .show_scores()

In [None]:
processed_text.show_scores(force=True)

Value cosine_similarity_LDA was not found in instance.informations. Please check if you excluded it when initializing the ReadabilityProcessor.


Unnamed: 0,gfi,ari,fre,fkgl,smog,rel,pppl,dubois_buyse_ratio,ttr,ntr,old20,pld20,cosine_similarity_tfidf,cosine_similarity_LDA
0,61.52381,21.503161,52.633986,8.63882,21.86996,71.403333,25.790537,0.622449,0.605128,0.588235,2.541743,2.064141,0.070205,


Processed texts also incorporate some common statistics for analyzing texts, these can be viewed from an dictionary attribute called "statistics", or by calling the show_statistics() method.

In [None]:
processed_text.show_statistics()

totalWords = 230
totalLongWords = 30
totalSentences = 21
totalCharacters = 837
totalSyllables = 389
nbPolysyllables = 226
vocabulary = {'sœur', "J'", 'arbre', '"', 'Maman', 'même', 'et', 'au', 'toute', 'avec', 'crois', 'heureux', 'Essaie', 'Plus', 'allée', 'bleue', "m'", ',', 'le', "d'", 'magnifique', '-', 'elle', 'deviner', 'pour', 'tu', 'manger', '?', 'est', 'impression', 'soirée', 'fête', 'envoyé', 'en', 'refusé', "Aujourd'hui", 'soleil', '.', 'ceci', 'donné', 'Ce', 'chapeau', 'un', 'je', 'vent', 'regardait', 'sur', 'petite', "l'", 'souri', 'une', 'poser', 'rouge', 'dans', 'de', 'suis', 'Elle', 'porte', 'La', 't', 'avais', 'fait', 'lunettes', 'ai', 'Est', 'ma', 'était', 'coiffée', 'pleuré', 'revenant', 'sorti', 'des', 'beaucoup', 'a', 'mais', '!', 'En', 'Le', 'amusantes', 'apparition', 'énorme', 'foraine', "qu'", 'emporté', 'acheté', 'eu', '-ce', 'soir', 'essayé', 'tard', 'suçon', 'lune', 'coup', 'école', 'casquette', 'mère', 'souper', 'aurait', 'là', 'gros', 'étais', 'maison', 'frè

{'nbPolysyllables': 226,
 'totalCharacters': 837,
 'totalLongWords': 30,
 'totalSentences': 21,
 'totalSyllables': 389,
 'totalWords': 230,
 'vocabulary': {'!',
  '"',
  ',',
  '-',
  '-ce',
  '.',
  '?',
  "Aujourd'hui",
  'Ce',
  'Elle',
  'En',
  'Essaie',
  'Est',
  "J'",
  'La',
  'Le',
  'Ma',
  'Maman',
  'Papa',
  'Plus',
  'a',
  'acheté',
  'ai',
  'allée',
  'amusantes',
  'apparition',
  'arbre',
  'au',
  'aujourd’hui',
  'aurait',
  'aussi',
  'avais',
  'avec',
  'beaucoup',
  'belle',
  'besoin',
  'bleue',
  'branche',
  'casquette',
  'ceci',
  'chapeau',
  'coiffée',
  'coup',
  'crois',
  "d'",
  'dans',
  'de',
  'des',
  'deviner',
  'dit',
  'donné',
  'elle',
  'emporté',
  'en',
  'envoyé',
  'essayé',
  'est',
  'et',
  'eu',
  'fait',
  'famille',
  'foraine',
  'frère',
  'fête',
  'gros',
  'heureux',
  'impression',
  'je',
  "l'",
  'la',
  'le',
  'lendemain',
  'lui',
  'lune',
  'lunettes',
  'là',
  "m'",
  'ma',
  'magnifique',
  'mais',
  'maison',


## Example two : Using the library for a corpus

Currently, a corpus will be recognized by the library if using the following two structures:
type(corpus) = dict[class][text]
type(corpus) = list(list(text))  
As mentioned earlier, text can be a string, or a list of sentences, or a list of sentences split into tokens.

In [None]:
processed_corpus = readability_processor.parseCollection(corpus)

ParsedCollection instances are similar to ParsedText instances, and possess many functions with the same name that do the same thing, but applied over a corpus.

In [None]:
processed_corpus.show_statistics()

level1------------------
totalWords = 38976
totalLongWords = 5182
totalSentences = 4880
totalCharacters = 141592
totalSyllables = 58965
nbPolysyllables = 27367
vocabulary = 4836 words
totalTexts = 240
meanSentences = 20.3
meanTokens = 162.4
level2------------------
totalWords = 128019
totalLongWords = 20547
totalSentences = 13049
totalCharacters = 487685
totalSyllables = 205889
nbPolysyllables = 102106
vocabulary = 10903 words
totalTexts = 628
meanSentences = 20.8
meanTokens = 203.9
level3------------------
totalWords = 124901
totalLongWords = 22224
totalSentences = 10354
totalCharacters = 491007
totalSyllables = 207672
nbPolysyllables = 107141
vocabulary = 11953 words
totalTexts = 670
meanSentences = 15.5
meanTokens = 186.4
level4------------------
totalWords = 101165
totalLongWords = 19550
totalSentences = 7743
totalCharacters = 410227
totalSyllables = 173298
nbPolysyllables = 92325
vocabulary = 11410 words
totalTexts = 522
meanSentences = 14.8
meanTokens = 193.8


In [None]:
processed_corpus.gfi()

{'level1': 45.13251776706827,
 'level2': 67.69772144381106,
 'level3': 91.86633625637114,
 'level4': 105.6699509923556}

In [None]:
processed_corpus.content['level1'][0].gfi() # Is also 61.52380952380953, as seen in the previous section

61.52380952380953

r.show_scores behaves differently, instead of giving the scores for each text, it returns a dataframe showing the mean values

In [None]:
processed_corpus.show_scores()

Unnamed: 0,gfi,ari,fre,fkgl,smog,rel,pppl,dubois_buyse_ratio,ttr,ntr,old20,pld20,cosine_similarity_tfidf,cosine_similarity_LDA
0,"{'level1': 45.13251776706827, 'level2': 67.697...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non...","{'level1': None, 'level2': None, 'level3': Non..."


In [None]:
processed_corpus_bb.show_scores()

Unnamed: 0,gfi,ari,fre,fkgl,smog,rel,pppl,dubois_buyse_ratio,ttr,ntr,old20,pld20,cosine_similarity_tfidf,cosine_similarity_LDA
0,"{'intermédiaire': 128.93382927591108, 'avance...","{'intermédiaire': 36.71469603319838, 'avancé...","{'intermédiaire': 53.81038055220457, 'avancé...","{'intermédiaire': 9.987963386560645, 'avancé...","{'intermédiaire': 24.22773446715755, 'avancé...","{'intermédiaire': 71.62288762838539, 'avancé...","{'intermédiaire': None, 'avancée': None, 'ai...","{'intermédiaire': None, 'avancée': None, 'ai...","{'intermédiaire': None, 'avancée': None, 'ai...","{'intermédiaire': None, 'avancée': None, 'ai...","{'intermédiaire': None, 'avancée': None, 'ai...","{'intermédiaire': None, 'avancée': None, 'ai...","{'intermédiaire': None, 'avancée': None, 'ai...","{'intermédiaire': None, 'avancée': None, 'ai..."


In addition, machine learning and deep learning applications can be used with the corpus' data to help develop NLP solutions

In [None]:
#r.importmodel(camembert)
#r.configmodel(params)
#r.train(mode=autofit)