## Introduction

Ce programme a pour but de tester la plausibilité de la prononciation d'un mot telle qu'indiquée sur le Wiktionnaire.

Pour ce faire, il parcours itérativement les différents sons (phonemes) d'un mot et essaie les différentes lettres (graphemes) pouvant être utilisées pour écrire le son (grâce à une table de correspondance similaire à https://fr.wiktionary.org/wiki/Annexe:Prononciation/français#Troisième_approche). A chaque essai, il compare ce qu'il obtient avec l'orthographe utilisée dans le Wiktionnaire. Lorsqu'il arrive au même résultat, la prononciation du mot est jugée plausible. A l'inverse, s'il n'arrive pas à transcrire la même orthographe que celle indiquée dans le Wiktionnaire, la prononciation est jugée suspecte.

Ce programme peut aussi compter les différentes lettres utilisées pour transcrire un son. Ainsi, s'il est executé sur tous les mots du Wiktionnaire, il peut servir à compter les probabilités de ces lettres pour transcrire un son donné.



## Données d'entrée

In [1]:
# importation des librairies tierces
import datetime
import pandas as pd
from pathlib import Path

# lecture des données d'entrée (ligne contenant un mot du dictionnaire
# ainsi que sa prononciation
def read_df():

  '''Read the input data (i.e. lines containing a Wiktionary word 
  and its pronunciation) '''
  filename = '../wikt_parser/fr_wiktionary_full.csv'
  #filename = '../wikt_parser/fr_wiktionary_waves.csv'
    
  filepath = Path(filename)

  url = 'https://fonétik.fr/'
  url_filename = url + filename

  #print('downloading ', url_filename)
  # if file not found locally, then download it
  #if not filepath.exists():
  #  !wget -N -q {url_filename}

  df = pd.read_csv(filename, keep_default_na=False, sep = '\t')

  return df

df = read_df()

In [2]:
df.tail()

Unnamed: 0,Mot,Prononciation,H_aspiré,Type,Audios,Pré_valide,Warn_code,Warn_label
1523919,’tit,ti,False,adjectif,[],True,-,-
1523920,’upa’upa,u.pa.u.pa,False,nom,[],True,-,-
1523921,€,ø.ʁo,False,symbole,[],False,err_lower_case,€_as_first_letter
1523922,℁,o bɔ̃ swɛ̃ də,False,préposition,[],False,err_lower_case,℁_as_first_letter
1523923,∴,mɔʁ‿o.vaʃ,False,symbole_num=2,[],False,err_lower_case,∴_as_first_letter


In [3]:
# corriger les & mal-formatés s'il y a en
df['Mot'] = df['Mot'].str.replace('&amp;', '&')

In [4]:
# Affichage du nombre d'échantillons en entrées
df.shape[0]

1523924

In [5]:
df['Plausible'] = '?'

In [6]:
# Affichage de 5 derniers échantillons
df.sample(5)

Unnamed: 0,Mot,Prononciation,H_aspiré,Type,Audios,Pré_valide,Warn_code,Warn_label,Plausible
1373150,surmédicaliserez,syʁ.me.di.ka.li.zə.ʁe,False,verbe_flexion,[],True,-,-,?
280430,chemin au limbe,ʃə.mɛ̃ o lɛ̃b,False,nom,[],False,err_too_many_spaces,2_spaces,?
722546,illyrianiserez,i.li.ʁja.ni.zə.ʁe,False,verbe_flexion,[],True,-,-,?
1190231,resuperposerez,ʁə.sy.pɛʁ.po.zə.ʁe,False,verbe_flexion,[],True,-,-,?
1322320,sex-symbol,sɛksːɛ̃bɔl,False,nom,[],False,err_phonemes,ː,?


In [7]:
df[df.Mot=='résumpte']

Unnamed: 0,Mot,Prononciation,H_aspiré,Type,Audios,Pré_valide,Warn_code,Warn_label,Plausible
1286694,résumpte,ʁe.zɔ̃t,False,nom,[],True,-,-,?
1286695,résumpte,ʁe.zɔ̃pt,False,nom,[],True,-,-,?


## Table de correspondance entre sons et lettres



In [8]:
def read_table_de_correspondance():
    df = pd.read_csv("table_de_correspondances.csv")
    phonemes2graphemes = {}
    for phoneme in df.Phonème.values:
        for position in df[df.Phonème == phoneme].Position.values:
            for grapheme in df[df.Phonème == phoneme][df.Position == position].Graphème.values:
                if phoneme not in phonemes2graphemes.keys():
                    phonemes2graphemes[phoneme] = {}
                phonemes2graphemes[phoneme][grapheme] = {}
            
    return phonemes2graphemes

phonemes2graphemes = read_table_de_correspondance()

def add_keys(phonemes2graphemes):
  '''Ajoute des clés ('occurences', 'exemple_1', 'exemple_2', 'exemple_3')
   dans le dictionnaire de chaque grapheme de la table phonemes2graphemes.'''

  for phoneme in phonemes2graphemes.keys():
    #for position in phonemes2graphemes[phoneme].keys():
      for grapheme in phonemes2graphemes[phoneme].keys():
        phonemes2graphemes[phoneme][grapheme]['occurences'] = 0
        phonemes2graphemes[phoneme][grapheme]['exemple_1'] = ''
        phonemes2graphemes[phoneme][grapheme]['exemple_2'] = ''
        phonemes2graphemes[phoneme][grapheme]['exemple_3'] = ''

  return phonemes2graphemes

# run add_keys
phonemes2graphemes = add_keys(phonemes2graphemes)

  


## Fonctions de comptage pour statistiques

In [9]:
# fonction de comptage qui incrémente d'une unité le compteur de correspondance
# entre un phonème donné et un graphème donné et qui enregistre en exemple 
# les trois premiers mots dans lequels la correspondance a été trouvé 
def increment_occurences(phoneme, grapheme, mot, unit_test=False):
  
  global phonemes2graphemes

  # ne pas continuer si l'appel provient d'un test unitaire
  if unit_test:
    return

  phonemes2graphemes[phoneme][grapheme]['occurences']  += 1
  if phonemes2graphemes[phoneme][grapheme]['exemple_1']  == '':
    phonemes2graphemes[phoneme][grapheme]['exemple_1'] = mot
  elif phonemes2graphemes[phoneme][grapheme]['exemple_2']  == '':
    phonemes2graphemes[phoneme][grapheme]['exemple_2'] = mot
  elif phonemes2graphemes[phoneme][grapheme]['exemple_3']  == '':
    phonemes2graphemes[phoneme][grapheme]['exemple_3'] = mot

In [10]:
# fonction de statistiques qui calcule pour pour chaque position (initiale,
# intermédiaire ou finale) le nombre d'occurences de chaque graphème ainsi
# que la probabilité de ce graphème.
# Les résultats sont insérés dans un DataFrame pandas et dans un fichier CSV.

def make_stats():
  global phonemes2graphemes

  table = []
  for phoneme in phonemes2graphemes:
      for grapheme in phonemes2graphemes[phoneme].keys():
        occurences = phonemes2graphemes[phoneme][grapheme]['occurences']
        exemple_1 = phonemes2graphemes[phoneme][grapheme]['exemple_1']
        exemple_2 = phonemes2graphemes[phoneme][grapheme]['exemple_2']
        exemple_3 = phonemes2graphemes[phoneme][grapheme]['exemple_3']

        #print("phoneme=%s grapheme=%s occurences=%d" % (phoneme, grapheme, occurences))
        row = { 'phonème': phoneme, 'graphème':grapheme, 'occurences':occurences, 'pourcentage':0.0,
               'exemple_1' : exemple_1, 'exemple_2' : exemple_2, 'exemple_3' : exemple_3, }
        table.append(row)
  dfp = pd.DataFrame.from_dict(table)

  for phoneme in dfp['phonème'].unique():
      df1 = dfp[dfp['phonème'] == phoneme]

      # calculate the sum of used graphemes within each phoneme-position category
      sum = 0
      for grapheme in df1['graphème'].unique():
        row_number = df1.loc[df1.graphème==grapheme].index[0]
        occurences = df1.at[row_number, 'occurences']
        sum += occurences

      # set the pourcentage of each graphemes for each phoneme-position category
      if sum != 0:
        for grapheme in df1['graphème'].unique():
          row_number = df1.loc[df1.graphème==grapheme].index[0]
          occurences = df1.at[row_number, 'occurences']
          pourcentage = int(occurences/sum*100)/100
          dfp.at[row_number, 'pourcentage'] = pourcentage

      dfp.to_csv("phonemes2graphemes.csv", index=False)

  return dfp

_ = make_stats()

## Fonction de test de plausibilité

In [11]:
# Fonction récursive tentant de retrouver tout ou partie des graphemes
# depuis tout ou partie des phonemes, à l'aide de la table de correspondance
# entre les phonemes et les graphemes.

def process_graphemes(mot, phonemes, graphemes, 
                      phoneme_index=0, grapheme_index=0, 
                      phonemes_hist='', graphemes_hist='', 
                      unit_test = False, verbose=False,):
  
  if verbose:
    print('')
    print('process_graphemes(phonemes=\\%s\\, graphemes=[[%s]], \
    phoneme_index=%d, grapheme_index=%d, \
    phonemes_hist=%s, graphemes_hist=%s)' 
            % (phonemes, graphemes, phoneme_index, grapheme_index, phonemes_hist, graphemes_hist))

  global phonemes2graphemes

  if grapheme_index >= len(graphemes):
    if ''.join(graphemes_hist) == graphemes and ''.join(phonemes_hist)==phonemes:
        if verbose:
            print('ended ok')            
        return True, phonemes_hist.copy(), graphemes_hist.copy()
    else:
        if verbose:
            print('ended ko')
        return False, phonemes_hist, graphemes_hist

  current_candidate_phonemes = []
  for phoneme in phonemes2graphemes:    
    if phonemes[phoneme_index:].startswith(phoneme):      
      current_candidate_phonemes.append(phoneme)
  if len(current_candidate_phonemes) == 0:
    return False, phonemes_hist, graphemes_hist
  if verbose:
    if len(current_candidate_phonemes) > 1:
      print("### candidate current phonemes=%", current_candidate_phonemes)

  nb_current_candidate_phonemes = 0 

  for current_phoneme in current_candidate_phonemes:
    if verbose:
      print("trying candidate current phoneme=%s" % current_phoneme)
    nb_current_candidate_phonemes += 1
    next_phoneme = ''

    last_graphemes = graphemes[grapheme_index:]
    
    if verbose:
      print('process_graphemes(phonemes=\\%s\\, graphemes=[[%s]], phoneme_index=%d, grapheme_index=%d) current_phoneme:%s' 
            % (phonemes, graphemes, phoneme_index, grapheme_index, current_phoneme))
  
    if verbose:
      print('current_phoneme:%s' % current_phoneme)
    
    try:
      len(phonemes2graphemes[current_phoneme])
    except:
      a=1
      if current_phoneme == '':
        if verbose:
          print('wrn: in [[%s]] : phoneme \\%s\\ does not exist !!!' % (graphemes, current_phoneme))
        return True, phonemes_hist, graphemes_hist
      else:
        print('err: in [[%s]] : phoneme \\%s\\ does not exist !!!' % (graphemes, current_phoneme))
        return False, phonemes_hist, graphemes_hist

    matching_graphemes_list = []

    for current_graphemes in phonemes2graphemes[current_phoneme]:
    
      if last_graphemes.startswith(current_graphemes):
        if verbose:
          print('phoneme \\%s\\ matching graphemes [[%s]] (last_graphemes:[[%s]])' % (current_phoneme, current_graphemes, last_graphemes))
        matching_graphemes_list.append(current_graphemes)

    if len(matching_graphemes_list) == 0:
      if verbose:
        print('KO0: phonemes=\\%s\\ does not match any grapheme !!!' % (phonemes))

      if nb_current_candidate_phonemes == len(current_candidate_phonemes):
        return False, phonemes_hist, graphemes_hist
      else:
        if verbose:
          print('move0 to next candidate phoneme')
        continue
      
    #print('current_phoneme=', current_phoneme)
    
    ### * ###
    if len(matching_graphemes_list) == 0:
        if verbose:
          print('KO: phoneme \\%s\\ matches NO grapheme (in word [[%s]])  !!!' % (graphemes, phonemes))
        phonemes_hist.pop()
        graphemes_hist.pop()
        return False, phonemes_hist, graphemes_hist

    for current_graphemes in matching_graphemes_list:
        
        phonemes_hist.append(current_phoneme)
        graphemes_hist.append(current_graphemes)

        if verbose:
          print("trying2 current_phoneme \\%s\\ and current_graphemes [[%s]]" % (current_phoneme, current_graphemes))

        ret, phonemes_hist2, graphemes_hist2 = process_graphemes(
            mot,
            phonemes, 
            graphemes,
            phoneme_index=phoneme_index+len(current_phoneme), 
            grapheme_index=grapheme_index+len(current_graphemes),
            phonemes_hist = phonemes_hist,
            graphemes_hist = graphemes_hist,
            unit_test = unit_test,
            verbose=verbose)           
                
        # si la concatenation des graphemes trouvées jusqu'ici n'est pas bonne, alors pas bon
        if not graphemes.startswith(''.join(graphemes_hist)):
            if verbose:
                print(''.join(graphemes_hist)+current_graphemes)
                print(graphemes)
                print('current_graphemes does not match previous findings')
            ret = False          
                
          # si la concatenation des graphemes trouvées jusqu'ici n'est pas bonne, alors pas bon

      
        if ret == True:
          if verbose:
            print('phonemes_hist2b:', phonemes_hist2)
            print('graphemes_hist2b:', graphemes_hist2)
            print("current_phoneme \\%s\\ and current_graphemes [[%s]] : worked ok" % (current_phoneme, current_graphemes))
          
          increment_occurences(current_phoneme, current_graphemes, mot, unit_test)
          return True, phonemes_hist2, graphemes_hist2
    
        phonemes_hist.pop()
        graphemes_hist.pop()
      
    # if here LOST
    if verbose:
          print('LOST2 for current_phoneme \\%s\\ in process_graphemes(phonemes=\\%s\\, graphemes=[[%s]], phoneme_index=%d, grapheme_index=%d)' 
                % (current_phoneme, phonemes, graphemes, phoneme_index, grapheme_index))
    
    if nb_current_candidate_phonemes == len(current_candidate_phonemes):
        return False, phonemes_hist2, graphemes_hist2
    else:
       
        if verbose:
          print('move2 to next candidate phoneme')
        continue

      

### Tests

### Exemples

In [12]:
exemples = [
    ("momie", "mɔmi"),
    ("pookie", "puki"),

]

for exemple in exemples:
    
    mot = exemple[0]
    graphemes = exemple[0]
    phonemes = exemple[1]
    
    res = process_graphemes(mot, phonemes, graphemes, 
                            phoneme_index=0, grapheme_index=0, 
                            phonemes_hist=[], graphemes_hist=[],
                            unit_test=True, verbose=False)

In [13]:
def check_word_pronunciation(mot, phonemes, graphemes, unit_test=False, verbose0=False, verbose1=False, verbose2=False):

    res, phonemes2, graphemes2  = process_graphemes(mot, phonemes, graphemes, phoneme_index=0, grapheme_index=0, phonemes_hist=[], graphemes_hist=[], unit_test=unit_test, verbose=verbose2)
    prononciation = ''.join(phonemes2)
    word = ''.join(graphemes2)
    prononciation_ = ','.join(phonemes2)
    word_ = ','.join(graphemes2)
    if prononciation != phonemes:
      if verbose0:
        print('[[%s]] \\%s\\ -> prononciation=\\%s\\ ' % (mot, phonemes, prononciation ))
      return False
    elif word != graphemes.replace('-','').replace(' ',''):
      if verbose0:
        print('[[%s]] != output graphemes=%s' % (mot, word))
      return True
    else:
      if verbose0:
        print('%s \\%s\\ -> %s \\%s\\' % (mot, prononciation, word_, prononciation_))
      return True

### Tests de référence

In [14]:
essais = [
    ("acéphalobrache", "asefalɔbʁak"),
    ("acquaintance","akwɛ̃tɑ̃s"),
    ("agoua","aɡwa"),
    ("bêchâmes","beʃam"),
    ("caouanne","kawan"),
    ("designer","dizajnœʁ"),
    ("désembouteillions","dezɑ̃butɛjɔ̃"),
    ("disjonctés","diʒɔ̃kte"),
    ("droitismes", "dʁwatism"),
    ("élégante", "eleɡɑ̃t"),
    ("enarrhas", "ɑ̃naʁa"),
    ("enfutailler", "ɑ̃fytɑje"),
    ("excaverai", "ɛkskavəʁe"),
    ("exemple", "ɛɡzɑ̃pl"),
    ("fashion","faʃœn"),
    ("khartoumisé","kaʁtumize"),
    ("hauts", "o"),
    ('hauts-fonds','o.fɔ̃'),
    ("html", "aʃteɛmɛl"), #acronyme
    ("intérêt", "ɛ̃teʁɛ"),
    ("intéressant","ɛ̃teʁɛsɑ̃"),
    ("jaïnas","dʒaina"),
    ("luxe","lyks"),
    ("momie","mɔmi"),
    ("oiseaux", "wazo"),
    ("ondoyés", "ɔ̃dwaje"),
    ("peopliser","piplize"),
    ("pookie", "puki"),
    ("rechaterais", "ʁətʃatəʁɛ"),
    ('solex', 'solɛks'),
    ('Paris', 'pa.ʁi'),
    ('VPN', 've.pe.ɛn'),
    
]

for essai in essais:
  #check_word_pronunciation(essai[0], essai[1].replace('.',''), essai[0].lower().replace('-',''), unit_test=True, verbose0=True, verbose1=False, verbose2=False)
  check_word_pronunciation(essai[0], essai[1].replace('.',''), essai[0].lower().replace('-',''), unit_test=True, verbose0=True, verbose1=False, verbose2=False)

acéphalobrache \asefalɔbʁak\ -> a,c,é,ph,a,l,o,b,r,a,che \a,s,e,f,a,l,ɔ,b,ʁ,a,k\
acquaintance \akwɛ̃tɑ̃s\ -> a,cq,u,ain,t,an,ce \a,k,w,ɛ̃,t,ɑ̃,s\
agoua \aɡwa\ -> a,g,ou,a \a,ɡ,w,a\
bêchâmes \beʃam\ -> b,ê,ch,â,mes \b,e,ʃ,a,m\
caouanne \kawan\ -> c,a,ou,a,nne \k,a,w,a,n\
designer \dizajnœʁ\ -> d,e,s,i,g,n,e,r \d,i,z,a,j,n,œ,ʁ\
désembouteillions \dezɑ̃butɛjɔ̃\ -> d,é,s,em,b,ou,t,e,illi,ons \d,e,z,ɑ̃,b,u,t,ɛ,j,ɔ̃\
disjonctés \diʒɔ̃kte\ -> d,is,j,on,c,t,és \d,i,ʒ,ɔ̃,k,t,e\
droitismes \dʁwatism\ -> d,r,o,i,t,i,s,mes \d,ʁ,w,a,t,i,s,m\
élégante \eleɡɑ̃t\ -> é,l,é,g,an,te \e,l,e,ɡ,ɑ̃,t\
enarrhas \ɑ̃naʁa\ -> e,n,a,rrh,as \ɑ̃,n,a,ʁ,a\
enfutailler \ɑ̃fytɑje\ -> en,f,u,t,a,ill,er \ɑ̃,f,y,t,ɑ,j,e\
excaverai \ɛkskavəʁe\ -> e,x,c,a,v,e,r,ai \ɛ,ks,k,a,v,ə,ʁ,e\
exemple \ɛɡzɑ̃pl\ -> e,x,em,p,le \ɛ,ɡz,ɑ̃,p,l\
fashion \faʃœn\ -> f,a,sh,io,n \f,a,ʃ,œ,n\
khartoumisé \kaʁtumize\ -> k,ha,r,t,ou,m,i,s,é \k,a,ʁ,t,u,m,i,z,e\
hauts \o\ -> hauts \o\
hauts-fonds \ofɔ̃\ -> hauts,f,onds \o,f,ɔ̃\
html \aʃteɛmɛl\ -> h,

### Test en masse

In [15]:
df.shape

(1523924, 9)

In [16]:
# trier le dataframe (noms communs, acronymes, noms propres, puis restants)
# pour des exemples plus significatifs dans les statistiques
df_composés = df[df.Mot.str.match('..*?[\-\ ].*?.')].copy()
print("nb mots composés:%d" % df_composés.shape[0])

df_autres_0 = pd.concat([df, df_composés, df_composés]).drop_duplicates(keep=False)
df_préfixes = df_autres_0[df_autres_0.Mot.str.startswith('-')].copy()
print("nb mots préfixes:%d" % df_préfixes.shape[0])

df_autres_1 = pd.concat([df_autres_0, df_préfixes, df_préfixes]).drop_duplicates(keep=False)
df_nc = df_autres_1[df_autres_1.Mot.str.islower()].copy()
print("nb mots noms communs:%d" % df_nc.shape[0])

df_autres_2 = pd.concat([df_autres_1, df_nc, df_nc]).drop_duplicates(keep=False)
df_np = df_autres_2[df_autres_2.Mot.str.istitle()].copy()
print("nb mots noms propres:%d" % df_np.shape[0])

df_autres_3 = pd.concat([df_autres_2, df_np, df_np]).drop_duplicates(keep=False)
df_sigles = df_autres_3[df_autres_3.Mot.str.isupper()].copy()
print("nb mots sigles:%d" % df_sigles.shape[0])

df_autres_4 = pd.concat([df_autres_3, df_sigles, df_sigles]).drop_duplicates(keep=False)
print("nb mots autres:%d" % df_autres_4.shape[0])

# mettre dans un ordre augmentant les probabilités de succès des noms composés
df_new = pd.concat([df_nc, df_sigles, df_np, df_composés, df_préfixes, df_sigles, df_autres_4], ignore_index=True).copy()
print("nb mots TOTAL:%d" % df_new.shape[0])

nb mots composés:162224
nb mots préfixes:426
nb mots noms communs:1328684
nb mots noms propres:29461
nb mots sigles:2018
nb mots autres:1111
nb mots TOTAL:1525942


In [17]:
df_autres_4.tail(5)

Unnamed: 0,Mot,Prononciation,H_aspiré,Type,Audios,Pré_valide,Warn_code,Warn_label,Plausible
1481757,éSwatini,e.swa.ti.ni,False,nom propre,[],True,-,-,?
1512941,éqCO₂,e.ki.va.lɑ̃ se.o.dø,False,nom,[],False,err_letters,₂,?
1523921,€,ø.ʁo,False,symbole,[],False,err_lower_case,€_as_first_letter,?
1523922,℁,o bɔ̃ swɛ̃ də,False,préposition,[],False,err_lower_case,℁_as_first_letter,?
1523923,∴,mɔʁ‿o.vaʃ,False,symbole_num=2,[],False,err_lower_case,∴_as_first_letter,?


In [18]:
df2 = df_new
#df2=df2[df2.Mot.str.istitle()] # ne conserver que les mots commençant par une majuscule puis minuscules
#df2=df2[df2.Mot.str.islower()] # ne conserver que les mots entièrement en minuscules (i.e. noms communs)
#df2=df2[df2.Mot.str.isupper()] # ne conserver que les mots entièrement en majuscules (i.e. acronymes)#
#df2=df2[df2.Mot.str.startswith('x')] # ne conserver que les mots noms communs commençant par ...
#df2 = pd.concat([df2, df_composés, df_composés]).drop_duplicates(keep=False) # supprimer les mots composés
#df2 = df2[~df2.Mot.str.contains(' ')] # exclure les mots contenant un espace
#df2 = df2[~df2.Mot.str.contains('-')] # exclure les mots contenant un tiret

df2 = df2[~df2.Mot.str.contains('\.')] # exclure les mots contenant un point
df2 = df2[~df2.Mot.str.contains('/')] # exclure les mots contenant un slash (e.g. copier/coller)

df2.shape

(1525282, 9)

In [19]:
nb_samples=0
nb_infos_steps=10
nb_samples_steps = int(df2.shape[0]/nb_infos_steps)
nb_samples_ok=0
nb_samples_ko=0
nb_very_bad=0
nb_batch=0
verbose=False
indexes_to_forget = []
nb_composés = 0
nb_composés_directs_found = 0
nb_composés_indirects_found = 0

t0 = datetime.datetime.now()
for index, row in df2.iterrows():

    nb_samples += 1
    is_ok = False
    is_skipped = False
        
    if nb_samples >= nb_infos_steps and nb_samples % nb_samples_steps == 0:
      nb_batch += 1
      t1 = datetime.datetime.now()
      durée = t1 - t0
      print('batch:%d, nb_samples:%d, durée:%s' % (nb_batch, nb_samples, durée))
      t0 = t1

    mot = row['Mot']
    prononciation = row['Prononciation']
 
    is_composé = False
    is_composé_direct = False
    if ' ' in row['Mot'] or ',' in row['Mot'] or '-' in row['Mot']:
      nb_composés += 1     
      is_composé = True
        
    try:
      graphemes = row['Mot'].lower().replace(' ', '').replace('-','').replace(',','')
      phonemes = prononciation.replace('(','').replace(')','').replace('‿','').replace('.','').replace(' ','')
      is_ok = check_word_pronunciation(mot, phonemes, graphemes, False, False, False, False)
          
      if is_ok:
        if ' ' in row['Mot'] or ',' in row['Mot'] or '-' in row['Mot']:
          is_composé_direct = True
          nb_composés_directs_found += 1      
    
    except:
      is_ok = False
      is_skipped = True 
      nb_very_bad+=1
      # mauvaise lettre ou phoneme, pas la peine d'aller plus loi
      # passer au mot suivant

    # si la correspondance n'a pas été trouvé, tester si le mot est composé,
    # et si oui, tester la correspondance de chaque partie (aka sous-mot)
    if not is_ok and not is_skipped:    
      graphemes = row['Mot'].lower().replace(',','')
      phonemes = prononciation    
      separateurs = [' ', '-']
      for separateur in separateurs:
        if separateur in graphemes[1:-1] :
          mots_ = mot.split(separateur)
          # si le mot n'est pas un mot composé, l'oubler, et passer à la suite
          if len(mots_) <= 1:
            continue
          for mot_ in mots_:
            try:
              # récuperer le sous-mot identifiés ainsi que ses différentes prononciations possibles
              phonemess_ = df[df.Mot==mot_]['Prononciation'].values
              for phonemes_ in phonemess_:
                graphemes_ = mot_
                phonemes_ = phonemes_.replace('(','').replace(')','').replace('‿','').replace('.','').replace(' ','')
                is_ok_ = check_word_pronunciation(mot_, phonemes_, graphemes_, False, False, False, False)
                if is_ok_:
                    break                    
              if is_ok_ == False:
                is_ok = False                
              else:
                is_ok = True
                break
            except:
              is_ok = False
            
    if is_ok and is_composé:
        if not is_composé_direct:
            nb_composés_indirects_found += 1
    #if not is_ok and is_composé:
    #    print('word=%s \\\\%s\\\\'% (mot, prononciation))
        
    if is_ok:
      nb_samples_ok += 1
      df2.at[index, 'Plausible']='oui'
      indexes_to_forget.append(index)
    else:
      nb_samples_ko += 1
      if verbose:
        if row['Type'] != 'verbe_flexion':
          print('word=%s \\\\%s\\\\'% (mot, prononciation))
      df2.at[index, 'Plausible']='non'
        

t1 = datetime.datetime.now()
durée = t1 - t0
print('nb_samples:', nb_samples, ' durée de processing du dernier batch:', durée)
      
df2 = df2.drop(indexes_to_forget)

if nb_samples > 0:
  print('samples=%d, samples_ok=%d, samples_ko=%d, very_bad=%d, ok%%=%.2f' % \
        (nb_samples, nb_samples_ok, nb_samples_ko, nb_very_bad, \
         nb_samples_ok/nb_samples*100))

batch:1, nb_samples:152528, durée:0:00:48.573682
batch:2, nb_samples:305056, durée:0:00:46.727685
batch:3, nb_samples:457584, durée:0:00:50.289613
batch:4, nb_samples:610112, durée:0:00:47.719374
batch:5, nb_samples:762640, durée:0:00:47.836048
batch:6, nb_samples:915168, durée:0:00:49.594396
batch:7, nb_samples:1067696, durée:0:00:49.542236
batch:8, nb_samples:1220224, durée:0:00:49.583613
batch:9, nb_samples:1372752, durée:0:01:55.214474
batch:10, nb_samples:1525280, durée:0:04:33.512505
nb_samples: 1525282  durée de processing du dernier batch: 0:00:00.000582
samples=1525282, samples_ok=1518325, samples_ko=6957, very_bad=0, ok%=99.54


In [20]:
#nb_samples: 1292602  durée de processing du dernier batch: 0:00:00.000838
#samples=1292602, samples_ok=1279990, samples_ko=12612, very_bad=0 %ok=99.02

#samples=1485362, samples_ok=1471019, samples_ko=14343, very_bad=0, ok%=99.03
#nb_composés:160151
#nb_composés_directs_found:154583
#nb_composés_indirects_found:5216
#nb_composés_not_found:352

#samples=1485362, samples_ok=1468451, samples_ko=16911, very_bad=0, ok%=98.86
#nb_composés:160151
#nb_composés_directs_found:0
#nb_composés_indirects_found:157231
#nb_composés_not_found:2920

In [21]:
# Nombre de mots dont la prononciation est détectée comme non plausible
df2.shape[0]

6957

In [22]:
# 5 derniers mot dont la prononciation est détectée comme non plausible
df2[df2.Plausible=='non'].tail(20)

Unnamed: 0,Mot,Prononciation,H_aspiré,Type,Audios,Pré_valide,Warn_code,Warn_label,Plausible
1525757,TADs,te.a.de,False,nom_flexion,[],False,err_lower_case,T_as_first_letter,non
1525801,TikTokeuse,tik.to.kœʁ,False,nom,[],False,err_lower_case,T_as_first_letter,non
1525847,URNs,y.ɛʁ.ɛn,False,nom_flexion,[],False,err_lower_case,U_as_first_letter,non
1525848,UpM,y.pe.em,False,nom,[],False,err_lower_case,U_as_first_letter,non
1525859,Vosg’patt,voʒ.pat,False,nom,"[""LL-Q150 (fra)-Penegal-Vosg'patt.wav""]",False,err_lower_case,V_as_first_letter,non
1525860,WEIs,ˈwaj ou ˈwɛj,False,nom_flexion,[],False,err_lower_case,W_as_first_letter,non
1525861,Wuchiaping’ien,wu.tʃja.piŋ.jɛ̃,False,nom,[],False,err_lower_case,W_as_first_letter,non
1525863,XPs,iks.pe,False,nom_flexion,[],False,err_lower_case,X_as_first_letter,non
1525865,Xi’an,ʃi.an,False,nom propre,[],False,err_lower_case,X_as_first_letter,non
1525866,Xi’an,sjan,False,nom propre,[],False,err_lower_case,X_as_first_letter,non


### Sauvegarde des résultats KO dans fichiers .CSV

In [23]:
# stockage des mots dot les prononciations sont non plausibles dans un fichier CSV
df3 = df2.drop(columns=['Audios','H_aspiré','Pré_valide','Plausible'])
df3.rename(columns = {'Err_Code':'Warn_Code', 'Err_Label':'Warn_Label'}, inplace = True)
df3.to_csv("correspondances_non_trouvées.csv", index=False, quotechar = '"')

In [24]:
# stockage des mots dont les prononciations sont non plausibles 
df4 = "'" + df3.Mot + "',"
df4.to_csv("mots_non_trouvés.csv", index=False, quotechar = '"')

### Test unitaire

In [25]:
# Mot et prononciation à tester unitairement
graphemes_phonemes='SaturneXI,sa.tyʁnɔ̃z'
graphemes_phonemes='SMSses,ɛs.ɛm.ɛs'
graphemes_phonemes='NNE,nɔʁ nɔ.ʁ‿ɛst'

# recharger la table de correspondance si nécessaire
reload_table = False
if reload_table:
    phonemes2graphemes = read_table_de_correspondance()
    phonemes2graphemes = add_keys(phonemes2graphemes)

strings = graphemes_phonemes.split(',')
mot = strings[0]
graphemes = mot.lower()
phonemes = strings[1].replace('(','').replace(')','').replace('‿','').replace('.','').replace(' ','')

# nettoyage des strings
#graphemes = graphemes.replace(' ','')
#graphemes = graphemes.replace('-','')
#graphemes = graphemes.replace("’",'')
phonemes = phonemes.replace(' ','')
phonemes = phonemes.replace('\\','')
phonemes = phonemes.replace('.','')
phonemes = phonemes.replace('‿','')
phonemes = phonemes.replace('(','')
phonemes = phonemes.replace(')','')
print(mot)
print(phonemes)
# test unitaire
check_word_pronunciation(mot, phonemes, graphemes, unit_test=True, verbose0=True, verbose1=True, verbose2=True)
#check_word_pronunciation(mot, phonemes, graphemes, unit_test=True, verbose0=False, verbose1=False, verbose2=False)

NNE
nɔʁnɔʁɛst

process_graphemes(phonemes=\nɔʁnɔʁɛst\, graphemes=[[nne]],     phoneme_index=0, grapheme_index=0,     phonemes_hist=[], graphemes_hist=[])
### candidate current phonemes=% ['n', 'nɔʁ']
trying candidate current phoneme=n
process_graphemes(phonemes=\nɔʁnɔʁɛst\, graphemes=[[nne]], phoneme_index=0, grapheme_index=0) current_phoneme:n
current_phoneme:n
phoneme \n\ matching graphemes [[n]] (last_graphemes:[[nne]])
phoneme \n\ matching graphemes [[nn]] (last_graphemes:[[nne]])
phoneme \n\ matching graphemes [[nne]] (last_graphemes:[[nne]])
trying2 current_phoneme \n\ and current_graphemes [[n]]

process_graphemes(phonemes=\nɔʁnɔʁɛst\, graphemes=[[nne]],     phoneme_index=1, grapheme_index=1,     phonemes_hist=['n'], graphemes_hist=['n'])
trying candidate current phoneme=ɔ
process_graphemes(phonemes=\nɔʁnɔʁɛst\, graphemes=[[nne]], phoneme_index=1, grapheme_index=1) current_phoneme:ɔ
current_phoneme:ɔ
KO0: phonemes=\nɔʁnɔʁɛst\ does not match any grapheme !!!
trying2 current_phone

True

## Statistiques

In [26]:
# Réaliser les comptages tottax statistiques sur les correspondances 
# précédemment comptées
dfp = make_stats()

In [27]:
# Afficher les statistiques d'un phonème en particulier (ex: \s\)
phoneme = 's'
#hex(ord('s'))
dfp[dfp.phonème == phoneme]

Unnamed: 0,phonème,graphème,occurences,pourcentage,exemple_1,exemple_2,exemple_3
249,s,c’,179,0.0,c’,c’est,c’que
250,s,s’,2974,0.0,s’abader,s’abaisser,s’abaisser
251,s,ç’,1,0.0,ç’,,
252,s,c,83651,0.12,abaciste,abacistes,abandogiciel
253,s,ç,10867,0.01,accoinçon,accoinçons,agaça
254,s,s,291936,0.44,aaronisme,aaronismes,aasax
255,s,sc,5355,0.0,abscise,abscises,abscision
256,s,th,135,0.0,chrestomathie,chrestomathies,classpath
257,s,x,504,0.0,antisoixantehuitard,antisoixantehuitards,auxerrois
258,s,cc,52,0.0,accensa,accensai,accensaient


In [28]:
dfp[dfp.phonème == phoneme]

Unnamed: 0,phonème,graphème,occurences,pourcentage,exemple_1,exemple_2,exemple_3
249,s,c’,179,0.0,c’,c’est,c’que
250,s,s’,2974,0.0,s’abader,s’abaisser,s’abaisser
251,s,ç’,1,0.0,ç’,,
252,s,c,83651,0.12,abaciste,abacistes,abandogiciel
253,s,ç,10867,0.01,accoinçon,accoinçons,agaça
254,s,s,291936,0.44,aaronisme,aaronismes,aasax
255,s,sc,5355,0.0,abscise,abscises,abscision
256,s,th,135,0.0,chrestomathie,chrestomathies,classpath
257,s,x,504,0.0,antisoixantehuitard,antisoixantehuitards,auxerrois
258,s,cc,52,0.0,accensa,accensai,accensaient


## Bilan des correspondances non trouvées

* Normalement, aucune ligne ne devrait être affichée (si noms communs, noms propres et sigles analysés)

In [29]:
pd.set_option('display.max_rows', 100)
dfp[dfp.occurences == 0]

Unnamed: 0,phonème,graphème,occurences,pourcentage,exemple_1,exemple_2,exemple_3
380,ɥ,’hu,0,0.0,,,
532,ɛ,eighs,0,0.0,,,
584,œ̃,Hun,0,0.0,,,
585,œ̃,Huns,0,0.0,,,
605,œ,ogl,0,0.0,,,
619,ø,hœ,0,0.0,,,
627,ø,euent,0,0.0,,,
747,o,hoa,0,0.0,,,
846,ɑ̃,aën,0,0.0,,,
916,a,acts,0,0.0,,,
