## Proof of concept : Evolution diachronique du schéma accentuel 
### Buchannan vs Walker
Ce notebook est une première comparaison des dictionnaires Buchannan et Walker.
Le but est de tester la possibilité d'identifier les mots dont l'accent tonic à évolué entre ces deux dictionnaires.
Pour ce faire il faut rendre comparable et donc "régulariser" les mots vedettes et les POS.

Dans un 1er temps, nous souhaitons nous concentrer sur les verbes et les noms (pour éviter de faire le mapping des tous les POS, de plus ces catégories representent une très grande majorité des entrées).

Liste des étapes de ce notebook :
- Régularisation des headword, POS des deux dictionnaires
- Création/régularisation du sch. accent en 010
- sauvegarde d'un tableur de résultat ne contenant que les entrées dont le schéma accentuel a varié

In [148]:
%matplotlib inline

import pandas as pd


# Buchanan - Regularisation

In [134]:
# chargement du dictionnaire Buchanan
pathCsvFile = "./datas/Buchanan2005.csv"
Buchanan = pd.read_csv(pathCsvFile, sep=";", encoding="utf-8")

#affichage des 3 premières entrées
Buchanan.head(3)

Unnamed: 0,headword,pronunciation,POS,accent,schAccent,schGraph,nbSyll,page
0,['abacus'],['a(ba(ku(ss'],['/n/lat/'],['a@bacus'],['100'],['V1CVCVC'],['3'],['abacus-ablation']
1,['abaft'],['a(baft'],['/adv/'],['aba@ft'],['01'],['VCV1CC'],['2'],['abacus-ablation']
2,['abandon'],['a(ba(ndu(n'],['/v/'],['aba@ndon'],['010'],['VCV1CCVC'],['3'],['abacus-ablation']


In [135]:
#les fonctions de régularisation de ce dictionnaire
def regBuchananHwd(hwd):
    hwdR = hwd.lower()
    hwdR = hwdR.replace("[","")
    hwdR = hwdR.replace("]","")
    hwdR = hwdR.replace("'","")
    return hwdR

def regBuchananSchAccent(schAccent):
    schAccentR = schAccent.lower()
    schAccentR = schAccentR.replace("[","")
    schAccentR = schAccentR.replace("]","")
    schAccentR = schAccentR.replace("'","")
    schAccentR = schAccentR.strip()
    #ajout du tiret entre les chiffres
    schAccentR = "-".join(schAccentR)
    return schAccentR

def regBuchananPOS(pos):
    posR = pos.lower()
    posR = pos.replace("[","")
    posR = posR.replace("]","")
    posR = posR.replace("'","")
    posR = posR.replace("/","")
    posR = posR.strip()
    
    if posR=="v":
        return "verb"
    elif posR=="n":
        return "noun"
    else:   
        return "notReg"+posR


In [137]:
#création des colonnes régularisées hwdR, schAccentR, posR
Buchanan["hwdR"] = Buchanan.apply(lambda x: regBuchananHwd(x.headword),1)
Buchanan["schAccentR"] = Buchanan.apply(lambda x: regBuchananSchAccent(x.schAccent),1)
Buchanan["posR"] = Buchanan.apply(lambda x: regBuchananPOS(x.POS),1)

Buchanan.head(5)

Unnamed: 0,headword,pronunciation,POS,accent,schAccent,schGraph,nbSyll,page,hwdR,schAccentR,posR
0,['abacus'],['a(ba(ku(ss'],['/n/lat/'],['a@bacus'],['100'],['V1CVCVC'],['3'],['abacus-ablation'],abacus,1-0-0,notRegnlat
1,['abaft'],['a(baft'],['/adv/'],['aba@ft'],['01'],['VCV1CC'],['2'],['abacus-ablation'],abaft,0-1,notRegadv
2,['abandon'],['a(ba(ndu(n'],['/v/'],['aba@ndon'],['010'],['VCV1CCVC'],['3'],['abacus-ablation'],abandon,0-1-0,verb
3,['abandoned'],['a(ba(ndu(ni(d'],['/adj/'],['aba@ndoned'],['0100'],['VCV1CCVCVC'],['4'],['abacus-ablation'],abandoned,0-1-0-0,notRegadj
4,['abase'],['a(baiss'],['/v/'],['aba@se'],['01'],['VCV1CV'],['2'],['abacus-ablation'],abase,0-1,verb


In [5]:
#on ne garde pour le moment que les colonnes  hwdR, schAccentR, posR (de valeurs : verb ou noun)
Buchanan = Buchanan[Buchanan['posR'].isin(["verb","noun"])]

BuchananR = Buchanan[["hwdR","posR","schAccentR"]].copy()
BuchananR.head(15)
#print(len(BuchananR))

Unnamed: 0,hwdR,posR,schAccentR
2,abandon,verb,0-1-0
4,abase,verb,0-1
5,abasement,noun,0-1-0-0
6,abash,verb,0-1
7,abate,verb,0-1
8,abatement,noun,0-1-0-0
9,abbess,noun,1-0
10,abbot,noun,1-0
11,abbreviate,verb,0-1-0-0
12,abbreviation,noun,0-0-0-1-0


# Walker - Regularisation

In [138]:
# chargement du dictionnaire Walker
pathCsvFile = "./datas/Walker_ExtractFromXml.csv"
Walker = pd.read_csv(pathCsvFile, sep=";", encoding="utf-8", dtype ={'headword':str}, low_memory=False)

Walker.head(3)

Unnamed: 0,headword,pronunciation,POS,definition,note authorial,note editorial,cross reference,cross reference in definition,cross reference in note,idSuperEntry,headwordWarning,pronWarning,posWarning,defWarning,xrWarning,xrDefWarning,xrNoteWarning
0,['A'],[],"[' THE first letter of the alphabet,']","[' A, An article set before nouns of the singu...",['☞ The change of the letter a into an before ...,[],['73.'],[],['An'],,[],['no content'],['missing dot'],[],[],[],[]
1,['ABACUS'],['a4bʹ-a4-ku2s'],['s.'],['[Lat.] A counting table\xa0; the uppermost m...,[],[],[],[],[],,[],[],[],[],[],[],[]
2,['ABAFT'],['a4-ba4ftʹ'],['adv.'],"[' From the fore part of the ship, towards the...",[],[],['545.'],[],[],,[],[],[],[],[],[],[]


In [139]:
#les fonctions de régularisation de ce dictionnaire

def regWalkerHwd(hwd):
    
    hwdR = str(hwd).replace("To ","")
    hwdR = hwdR.replace("[","")
    hwdR = hwdR.replace("]","")
    hwdR = hwdR.replace("'","")
    hwdR = hwdR.lower()

    return hwdR

def regWalkerSchAccent(pronunciation):
    
    schAccentR = str(pronunciation).replace("[","")
    schAccentR = schAccentR.replace("]","")
    schAccentR = schAccentR.replace("'","")
    schAccentR = schAccentR.strip()

    listPron = schAccentR.split(', or, ')
    
    schAccentResult = ""
    
    for iPron,partPron in enumerate(listPron):
        binRes = ""
        res = partPron.split("-")

        for i, part in enumerate(res):
            if "ʹ" in part:
                binRes+="1"
            else:
                binRes+="0"

            if i<(len(res)-1):
                binRes+="-"
         
        #schAccentResult = binRes
        if iPron == 0:
            schAccentResult = binRes
        else:
            schAccentResult = schAccentResult+" or "+binRes
        

                
    return schAccentResult

def regWalkerPOS(pos):
    posR = pos.replace("[","")
    posR = posR.replace("]","")
    posR = posR.replace("'","")
    posR = posR.replace("/","")
    posR = posR.strip()
    
    if posR=="v. a.":
        return "verb"
    elif posR=="s.":
        return "noun"
    else:   
        return "notReg"+posR

In [141]:
# ce dictionnaire peut contenir plusieur prononciations
#nous les conservons en l'état
#ceci est un test unitaire de la fonction regWalkerSchAccent

tp = "['a4-ka4dʹ-de1-me1, or, a4kʹ-a4-de2m-e1']"
print(regWalkerSchAccent(tp))
tp = "['a4-ka4dʹ-de1-me1']"
print(regWalkerSchAccent(tp))


0-1-0-0 or 1-0-0-0
0-1-0-0


In [142]:
#création des colonnes régularisées hwdR, schAccentR, posR

Walker["hwdR"] = Walker.apply(lambda x: regWalkerHwd(x.headword),1)
Walker["schAccentR"] = Walker.apply(lambda x: regWalkerSchAccent(x.pronunciation),1)
Walker["posR"] = Walker.apply(lambda x: regWalkerPOS(x.POS),1)

#on enleve les lignes ou il y a un warning sur le mot vedette
Walker = Walker[Walker["headwordWarning"].isin(["[]"])]


Walker.head(3)

Unnamed: 0,headword,pronunciation,POS,definition,note authorial,note editorial,cross reference,cross reference in definition,cross reference in note,idSuperEntry,headwordWarning,pronWarning,posWarning,defWarning,xrWarning,xrDefWarning,xrNoteWarning,hwdR,schAccentR,posR
0,['A'],[],"[' THE first letter of the alphabet,']","[' A, An article set before nouns of the singu...",['☞ The change of the letter a into an before ...,[],['73.'],[],['An'],,[],['no content'],['missing dot'],[],[],[],[],a,0,"notRegTHE first letter of the alphabet,"
1,['ABACUS'],['a4bʹ-a4-ku2s'],['s.'],['[Lat.] A counting table\xa0; the uppermost m...,[],[],[],[],[],,[],[],[],[],[],[],[],abacus,1-0-0,noun
2,['ABAFT'],['a4-ba4ftʹ'],['adv.'],"[' From the fore part of the ship, towards the...",[],[],['545.'],[],[],,[],[],[],[],[],[],[],abaft,0-1,notRegadv.


In [143]:
#on ne garde pour le moment que les colonnes  hwdR, schAccentR, posR (de valeurs : verb ou noun)
Walker = Walker[Walker['posR'].isin(["verb","noun"])]

WalkerR = Walker[["hwdR","posR","schAccentR"]].copy()
WalkerR.head(3)

Unnamed: 0,hwdR,posR,schAccentR
1,abacus,noun,1-0-0
3,abandon,verb,0-1-0
5,abandonment,noun,0-1-0-0


In [118]:
BuchananR.head(3)


Unnamed: 0,hwdR,posR,schAccentR
2,abandon,verb,0-1-0
4,abase,verb,0-1
5,abasement,noun,0-1-0-0


## Comparaison
- Identifier les hwd communs aux deux dicos (et de même POS)
- Ne conserver que les entrées dont le shcAccent est différent

### fusion des dictionnaires par hwdR et posR

In [124]:
intersect = pd.merge(WalkerR, BuchananR, on=['hwdR','posR'])
intersect.head(3)

Unnamed: 0,hwdR,posR,schAccentR_x,schAccentR_y
0,abandon,verb,0-1-0,0-1-0
1,abase,verb,0-1,0-1
2,abasement,noun,0-1-0,0-1-0-0


In [125]:
print('nb hwd WalkerR',len(WalkerR) )
print('nb hwd BuchananR',len(BuchananR) )
print('nb intersection hwd',len(intersect))

nb hwd WalkerR 24271
nb hwd BuchananR 15476
nb intersection hwd 13316


In [126]:
#renommer les colonnes
intersect = intersect.rename(columns={"schAccentR_x": "schAccentR_Walker","schAccentR_y":"schAccentR_Buchanan"})
intersect.head(3)

Unnamed: 0,hwdR,posR,schAccentR_Walker,schAccentR_Buchanan
0,abandon,verb,0-1-0,0-1-0
1,abase,verb,0-1,0-1
2,abasement,noun,0-1-0,0-1-0-0


In [127]:
print("nb de hwd communs",len(intersect))

nb de hwd communs 13316


### Filter pour ne garder que les lignes dont le schéma accentuel est différent

In [146]:
dfVarSchAcc = intersect.loc[~(intersect['schAccentR_Walker'] == intersect['schAccentR_Buchanan'])]
#affichage des 15 ières lignes
dfVarSchAcc.head(15)


Unnamed: 0,hwdR,posR,schAccentR_Walker,schAccentR_Buchanan
2,abasement,noun,0-1-0,0-1-0-0
5,abatement,noun,0-1-0,0-1-0-0
12,abdomen,noun,0-1-0,0-0-0
21,ablution,noun,0-1-0,1-0-0
27,abortiveness,noun,0-1-0-0,0-1-0-0-0
44,absoluteness,noun,1-0-0-0,1-0-0-0-0
51,abstruseness,noun,0-1-0,0-1-0-0
60,academian,noun,0-0-1-0-0,0-0-1-0
62,academist,noun,0-1-0-0 or 1-0-0-0,0-1-0-0
63,academy,noun,0-1-0-0 or 1-0-0-0,0-1-0-0


In [147]:
#compter le nombre d'occurrence de chaque type de POS
dfVarSchAcc["posR"].value_counts()

noun    1522
verb     208
Name: posR, dtype: int64

### Sauvegarde du tableur resultat

In [131]:
pathFileOut = "./datas/Walker-Buchanan_varSchAcc.csv"
dfVarSchAcc.to_csv(pathFileOut,sep=';',encoding="utf-8")