# Clean lexique dataset to build word frequency dataset

The Lexique383 dataset is a French word frequency (books and movies subtitles) dataset. It contains 140,000+ words with many different informations about pronounciation, lemme, genre etc. which we do not care about. We then have to filter the dataset to only keep single words without any special characters, normalize them to remove accents and group them by orthograph to have the real frequency.

## Imports

In [1]:
import os
import pandas as pd

## Loading raw dataset

In [2]:
pd.set_option('display.max_columns', None)
lexique_df = pd.read_excel("Lexique383.xlsb")
lexique_df.head()

Unnamed: 0,1_ortho,2_phon,3_lemme,4_cgram,5_genre,6_nombre,7_freqlemfilms2,8_freqlemlivres,9_freqfilms2,10_freqlivres,11_infover,12_nbhomogr,13_nbhomoph,14_islem,15_nblettres,16_nbphons,17_cvcv,18_p_cvcv,19_voisorth,20_voisphon,21_puorth,22_puphon,23_syll,24_nbsyll,25_cv-cv,26_orthrenv,27_phonrenv,28_orthosyll,29_cgramortho,30_deflem,31_defobs,32_old20,33_pld20,34_morphoder,35_nbmorph
0,a,a,a,NOM,m,,81.36,58.65,81.36,58.65,,3,9,1,1,1,V,V,25,20,1,1,a,1,V,a,a,a,"NOM,AUX,VER",,,1.0,1.0,a,1
1,a,a,avoir,AUX,,,18559.22,12800.81,6350.91,2926.69,ind:pre:3s;,3,9,0,1,1,V,V,25,20,1,1,a,1,V,a,a,a,"NOM,AUX,VER",,,1.0,1.0,avoir,1
2,a,a,avoir,VER,,,13572.4,6426.49,5498.34,1669.39,ind:pre:3s;,3,9,0,1,1,V,V,25,20,1,1,a,1,V,a,a,a,"NOM,AUX,VER",93.0,16.0,1.0,1.0,avoir,1
3,a capella,akapEla,a capella,ADV,,,0.04,0.07,0.04,0.07,,1,2,1,9,7,V CVCVCCV,VCVCVCV,0,0,6,5,a-ka-pE-la,4,V-CV-CV-CV,allepac a,alEpaka,a ca-pel-la,ADV,,,3.85,2.85,a-capella,2
4,a cappella,akapEla,a cappella,ADV,,,0.04,0.07,0.04,0.07,,1,2,1,10,7,V CVCCVCCV,VCVCVCV,0,0,6,5,a-ka-pE-la,4,V-CV-CV-CV,alleppac a,alEpaka,a cap-pel-la,ADV,,,4.6,2.85,a-cappella,2


## EDA

## Dataset size

In [3]:
lexique_df["3_lemme"].unique().shape

(46947,)

### Data consistency

The length of the word should match the length of the string.

In [4]:
# Check if the length of the string in "1_ortho" column is equal to the column "15_nblettres"
print(lexique_df["1_ortho"].str.len().equals(lexique_df["15_nblettres"]))

False


### Some words were interpreted

Here are the words not corresponding to their length

In [5]:
nblettres_diff_df = lexique_df[lexique_df["1_ortho"].str.len() != (lexique_df["15_nblettres"])][["1_ortho", "15_nblettres"]]
nblettres_diff_df

Unnamed: 0,1_ortho,15_nblettres
55603,False,4
55604,False,4
55605,False,4
86569,,3
136832,True,4
136833,True,4


The data seems clean anyways, it looks like it is an interpretation issue

In [6]:
nblettres_diff_idx = nblettres_diff_df.index
display(lexique_df.iloc[nblettres_diff_idx])

Unnamed: 0,1_ortho,2_phon,3_lemme,4_cgram,5_genre,6_nombre,7_freqlemfilms2,8_freqlemlivres,9_freqfilms2,10_freqlivres,11_infover,12_nbhomogr,13_nbhomoph,14_islem,15_nblettres,16_nbphons,17_cvcv,18_p_cvcv,19_voisorth,20_voisphon,21_puorth,22_puphon,23_syll,24_nbsyll,25_cv-cv,26_orthrenv,27_phonrenv,28_orthosyll,29_cgramortho,30_deflem,31_defobs,32_old20,33_pld20,34_morphoder,35_nbmorph
55603,False,fo,False,ADJ,m,,122.23,109.59,90.99,66.55,,3,5,1,4,2,CVVC,CV,9,23,4,2,fo,1,CV,xuaf,of,False,"ADJ,ADV,NOM",100.0,20.0,1.35,1.0,False,1
55604,False,fo,False,ADV,,,7.61,2.7,7.61,2.7,,3,5,1,4,2,CVVC,CV,9,23,4,2,fo,1,CV,xuaf,of,False,"ADJ,ADV,NOM",,,1.35,1.0,False,1
55605,False,fo,False,NOM,,,5.71,8.51,5.71,8.51,,3,5,1,4,2,CVVC,CV,9,23,4,2,fo,1,CV,xuaf,of,False,"ADJ,ADV,NOM",100.0,23.0,1.35,1.0,False,1
86569,,n@,,NOM,m,s,11.92,1.42,11.92,1.42,,1,2,1,3,2,CVC,CV,14,28,3,2,n@,1,CV,,@n,,NOM,,,1.0,1.0,,1
136832,True,vRE,True,ADJ,m,s,807.03,430.07,678.47,311.89,,2,6,1,4,3,CCVV,CCV,4,12,4,3,vRE,1,CCV,iarv,ERv,True,"ADJ,NOM",95.0,21.0,1.6,1.0,True,1
136833,True,vRE,True,NOM,m,s,37.58,37.77,34.59,32.7,,2,6,1,4,3,CCVV,CCV,4,12,4,3,vRE,1,CCV,iarv,ERv,True,"ADJ,NOM",94.0,17.0,1.6,1.0,True,1


Changing the values to the corresponding raw strings

In [7]:
lexique_df["1_ortho"] = lexique_df["1_ortho"].replace([False, True], ["faux", "vrai"])
lexique_df["1_ortho"] = lexique_df["1_ortho"].fillna("nan")

Now the length of each word is matching the length of the string

In [8]:
print(lexique_df["1_ortho"].str.len().equals(lexique_df["15_nblettres"]))

True


### Checking all words are in lower case

In [9]:
min(lexique_df["1_ortho"] == lexique_df["1_ortho"].str.lower())

True

## Data cleaning

We remove the words containing special characters (non word characters)

In [10]:
lexique_df = lexique_df.loc[lexique_df["1_ortho"].str.match(r"\A[\w]+\Z")]
lexique_df

Unnamed: 0,1_ortho,2_phon,3_lemme,4_cgram,5_genre,6_nombre,7_freqlemfilms2,8_freqlemlivres,9_freqfilms2,10_freqlivres,11_infover,12_nbhomogr,13_nbhomoph,14_islem,15_nblettres,16_nbphons,17_cvcv,18_p_cvcv,19_voisorth,20_voisphon,21_puorth,22_puphon,23_syll,24_nbsyll,25_cv-cv,26_orthrenv,27_phonrenv,28_orthosyll,29_cgramortho,30_deflem,31_defobs,32_old20,33_pld20,34_morphoder,35_nbmorph
0,a,a,a,NOM,m,,81.36,58.65,81.36,58.65,,3,9,1,1,1,V,V,25,20,1,1,a,1,V,a,a,a,"NOM,AUX,VER",,,1.00,1.0,a,1
1,a,a,avoir,AUX,,,18559.22,12800.81,6350.91,2926.69,ind:pre:3s;,3,9,0,1,1,V,V,25,20,1,1,a,1,V,a,a,a,"NOM,AUX,VER",,,1.00,1.0,avoir,1
2,a,a,avoir,VER,,,13572.40,6426.49,5498.34,1669.39,ind:pre:3s;,3,9,0,1,1,V,V,25,20,1,1,a,1,V,a,a,a,"NOM,AUX,VER",93.0,16.0,1.00,1.0,avoir,1
13,aa,aa,aa,NOM,m,s,0.01,0.00,0.01,0.00,,1,2,1,2,2,VV,VV,20,35,2,2,a-a,2,V-V,aa,aa,,NOM,,,1.00,1.0,aa,1
17,abaca,abaka,abaca,NOM,m,s,0.01,0.00,0.01,0.00,,1,1,1,5,5,VCVCV,VCVCV,0,1,4,5,a-ba-ka,3,V-CV-CV,acaba,akaba,,NOM,,,2.00,1.9,abaca,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142689,ôtée,ote,ôté,ADJ,f,s,0.23,0.61,0.10,0.27,,2,11,0,4,3,VCVV,VCV,1,19,0,3,o-te,2,V-CV,eétô,eto,ô-tée,"VER,ADJ",,,1.80,1.0,ôté,1
142690,ôtées,ote,ôter,VER,f,p,16.81,42.03,0.16,0.07,par:pas;,2,11,0,5,3,VCVVC,VCV,0,19,0,3,o-te,2,V-CV,seétô,eto,ô-tées,"VER,ADJ",89.0,28.0,1.85,1.0,ôter,1
142691,ôtées,ote,ôté,ADJ,f,p,0.23,0.61,0.01,0.07,,2,11,0,5,3,VCVVC,VCV,0,19,0,3,o-te,2,V-CV,seétô,eto,ô-tées,"VER,ADJ",,,1.85,1.0,ôté,1
142692,ôtés,ote,ôter,VER,m,p,16.81,42.03,0.04,0.14,par:pas;,2,11,0,4,3,VCVC,VCV,3,19,0,3,o-te,2,V-CV,sétô,eto,ô-tés,"VER,ADJ",89.0,28.0,1.65,1.0,ôter,1


We remove accents

In [11]:
lexique_df["1_ortho"] = lexique_df["1_ortho"].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
lexique_df

Unnamed: 0,1_ortho,2_phon,3_lemme,4_cgram,5_genre,6_nombre,7_freqlemfilms2,8_freqlemlivres,9_freqfilms2,10_freqlivres,11_infover,12_nbhomogr,13_nbhomoph,14_islem,15_nblettres,16_nbphons,17_cvcv,18_p_cvcv,19_voisorth,20_voisphon,21_puorth,22_puphon,23_syll,24_nbsyll,25_cv-cv,26_orthrenv,27_phonrenv,28_orthosyll,29_cgramortho,30_deflem,31_defobs,32_old20,33_pld20,34_morphoder,35_nbmorph
0,a,a,a,NOM,m,,81.36,58.65,81.36,58.65,,3,9,1,1,1,V,V,25,20,1,1,a,1,V,a,a,a,"NOM,AUX,VER",,,1.00,1.0,a,1
1,a,a,avoir,AUX,,,18559.22,12800.81,6350.91,2926.69,ind:pre:3s;,3,9,0,1,1,V,V,25,20,1,1,a,1,V,a,a,a,"NOM,AUX,VER",,,1.00,1.0,avoir,1
2,a,a,avoir,VER,,,13572.40,6426.49,5498.34,1669.39,ind:pre:3s;,3,9,0,1,1,V,V,25,20,1,1,a,1,V,a,a,a,"NOM,AUX,VER",93.0,16.0,1.00,1.0,avoir,1
13,aa,aa,aa,NOM,m,s,0.01,0.00,0.01,0.00,,1,2,1,2,2,VV,VV,20,35,2,2,a-a,2,V-V,aa,aa,,NOM,,,1.00,1.0,aa,1
17,abaca,abaka,abaca,NOM,m,s,0.01,0.00,0.01,0.00,,1,1,1,5,5,VCVCV,VCVCV,0,1,4,5,a-ba-ka,3,V-CV-CV,acaba,akaba,,NOM,,,2.00,1.9,abaca,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142689,otee,ote,ôté,ADJ,f,s,0.23,0.61,0.10,0.27,,2,11,0,4,3,VCVV,VCV,1,19,0,3,o-te,2,V-CV,eétô,eto,ô-tée,"VER,ADJ",,,1.80,1.0,ôté,1
142690,otees,ote,ôter,VER,f,p,16.81,42.03,0.16,0.07,par:pas;,2,11,0,5,3,VCVVC,VCV,0,19,0,3,o-te,2,V-CV,seétô,eto,ô-tées,"VER,ADJ",89.0,28.0,1.85,1.0,ôter,1
142691,otees,ote,ôté,ADJ,f,p,0.23,0.61,0.01,0.07,,2,11,0,5,3,VCVVC,VCV,0,19,0,3,o-te,2,V-CV,seétô,eto,ô-tées,"VER,ADJ",,,1.85,1.0,ôté,1
142692,otes,ote,ôter,VER,m,p,16.81,42.03,0.04,0.14,par:pas;,2,11,0,4,3,VCVC,VCV,3,19,0,3,o-te,2,V-CV,sétô,eto,ô-tés,"VER,ADJ",89.0,28.0,1.65,1.0,ôter,1


Then we group by ortograph and sum over frequency

In [12]:
# Group by "1_ortho" column and sum over the "10_freqlivres" column
groupped_nblettres = lexique_df[["1_ortho", "15_nblettres"]].groupby("1_ortho").first().reset_index()
groupped_frequency = lexique_df[["1_ortho", "10_freqlivres"]].groupby("1_ortho").sum().reset_index()
word_frequency_df = pd.merge(groupped_nblettres, groupped_frequency, on="1_ortho")
word_frequency_df.head()

Unnamed: 0,1_ortho,15_nblettres,10_freqlivres
0,a,1,23863.78
1,aa,2,0.0
2,abaca,5,0.0
3,abaissa,7,2.64
4,abaissai,8,0.07


In [17]:
word_frequency_df[word_frequency_df["10_freqlivres"] == 0.0].shape[0]/word_frequency_df.shape[0]

0.14214392634957992

In [13]:
# Normalize the "10_freqlivres" column
# word_frequency_df["10_freqlivres"] = word_frequency_df["10_freqlivres"] / word_frequency_df["10_freqlivres"].sum()
# word_frequency_df.head()

In [14]:
os.makedirs("word_frequency_vocab", exist_ok=True)
file_template_path = "word_frequency_vocab/word_frequency_{}.csv"
for nblettres in word_frequency_df["15_nblettres"].unique():
    nblettres_word_frequency = word_frequency_df.loc[word_frequency_df["15_nblettres"] == nblettres]
    nblettres_word_frequency.drop(columns=["15_nblettres"], inplace=True)
    nblettres_word_frequency.rename(columns={"1_ortho": "word", "10_freqlivres": "frequency"}, inplace=True)
    nblettres_word_frequency.to_csv(file_template_path.format(nblettres), index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nblettres_word_frequency.drop(columns=["15_nblettres"], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nblettres_word_frequency.rename(columns={"1_ortho": "word", "10_freqlivres": "frequency"}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nblettres_word_frequency.drop(columns=["15_nblettres"], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/