<a href="https://colab.research.google.com/github/ourekouch/OBIE_PFE/blob/master/corpus_fr_importation_%2C_transformation_et_manipulation_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Installation des bibliothèques :
**Conllu**

CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.

**Conllu-df**

disponible sur : [texte du lien](https://github.com/interrogator/conll-df)

Turn CONLL-U documents into Pandas DataFrames for easy NLP!



In [0]:
!pip install conllu
!pip install conll-df


# Importation de corpus 
**Deep-sequoia: A French corpus with surface and deep syntactic annotations**

Deep-sequoia is a corpus of French sentences annotated with both surface and deep syntactic dependency structures. 
It is freely available with the LGPL-LR License. The latest released is the version 9.0 (May 2019).
Since version 9.0, the corpus also contains annotations of Multi-Word Expressions and Named Entities.

**for this corpus there is 7 different annotation formats provided We will choose, for example :**

Format: deep_and_surf.conll 

In [14]:
from io import open
from conllu import parse_incr

data_file = open("/content/drive/My Drive/PFE/sequoia/sequoia-ud.conllu", "r", encoding="utf-8")

i=0
#visualisation de la liste des Tokens 
for tokenlist in parse_incr(data_file):
    if i<5 : 
      print(tokenlist)
      i=i+1
    else: break

TokenList<Gutenberg>
TokenList<Cette, exposition, nous, apprend, que, dès, le, XIIe, siècle, ,, à, Dammarie-sur-Saulx, ,, entre, autres, sites, ,, une, industrie, métallurgique, existait, .>
TokenList<à, peu, près, au, à, le, même, moment, que, Gutenberg, inventait, l', imprimerie, ,, Gillet, Bonnemire, créait, en, 1450, la, première, forge, à, Saint-Dizier, ,, à, l', actuel, emplacement, du, de, le, CHS, .>
TokenList<Ensuite, ,, fut, installée, une, autre, forge, à, la, Vacquerie, ,, à, l', emplacement, aujourd'hui, de, Cora, .>
TokenList<En, 1953, ,, les, hauts, fourneaux, et, fonderies, de, Cousances, virent, le, jour, ,, puis, Jean, Baudesson, ,, maire, échevin, de, Saint-Dizier, ,, autorisé, par, lettres, patentes, d', Henri, IV, ,, installa, à, Marnaval, -, qui, signifiait, val, ou, vallée, de, la, Marne, ou, bien, en, aval, de, la, Marne, -, ,, une, forge, qui, connut, son, apogée, au, à, le, XIXe, siècle, .>


In [15]:
from io import open
from conllu import parse_tree_incr

data_file = open("/content/drive/My Drive/PFE/sequoia/sequoia-ud.conllu", "r", encoding="utf-8")

i=0
for tokentree in parse_tree_incr(data_file):
    if i<5 : 
      print(tokentree)
      i=i+1
    else: break

TokenTree<token={id=1, form=Gutenberg}, children=None>
TokenTree<token={id=4, form=apprend}, children=[...]>
TokenTree<token={id=16, form=créait}, children=[...]>
TokenTree<token={id=4, form=installée}, children=[...]>
TokenTree<token={id=11, form=virent}, children=[...]>


# transformation CONLL-U to pandas df

L'output de cette transformation donne une dataframe avec les column suivant :

*   index de phrase
*   index de token 
*   le token 
*   l'origin de token 
*   POS de token 
et les autres caractéristiques dans les autres colonnes 

**NB : chaque ligne de dataframe présente un Token**



In [17]:
import pandas as pd
from conll_df import conll_df
path = '/content/drive/My Drive/PFE/sequoia/sequoia-ud.conllu'
df = conll_df(path, file_index=False)
df.head(40)

Unnamed: 0_level_0,Unnamed: 1_level_0,w,l,x,g,f,Mood,Type,Poss,type,Voice,Reflex,Number,Tense,Definite,Gender,Person,Polarity
s,i,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,1,Gutenberg,Gutenberg,PROPN,0,root,_,_,_,_,_,_,_,_,_,_,_,_
2,1,Cette,ce,DET,2,det,_,Dem,_,Dem,_,_,Sing,_,_,Fem,_,_
2,2,exposition,exposition,NOUN,4,nsubj,_,_,_,_,_,_,Sing,_,_,Fem,_,_
2,3,nous,le,PRON,4,iobj,_,_,_,_,_,_,Plur,_,_,_,1,_
2,4,apprend,apprendre,VERB,0,root,Ind,_,_,_,_,_,Sing,Pres,_,_,3,_
2,5,que,que,SCONJ,21,mark,_,_,_,_,_,_,_,_,_,_,_,_
2,6,dès,dès,ADP,9,case,_,_,_,_,_,_,_,_,_,_,_,_
2,7,le,le,DET,9,det,_,Art,_,Art,_,_,Sing,_,Def,Masc,_,_
2,8,XIIe,XIIe,ADJ,9,amod,_,Ord,_,Ord,_,_,_,_,_,_,_,_
2,9,siècle,siècle,NOUN,21,obl:mod,_,_,_,_,_,_,Sing,_,_,Masc,_,_


# Functions

In [18]:
def searcher(df, column, query, inverse=False):
    """Search column for regex query"""
    bool_ix = df[column].str.contains(query)
    return df[bool_ix] if not inverse else df[~bool_ix]

pd.DataFrame.search = searcher

# example get nominal subjects starting with a, b or c
df.search('f', 'nsubj').search('w', '^[abc]').head()

Unnamed: 0_level_0,Unnamed: 1_level_0,w,l,x,g,f,Mood,Type,Poss,type,Voice,Reflex,Number,Tense,Definite,Gender,Person,Polarity
s,i,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
9,6,automobilistes,automobiliste,NOUN,8,nsubj,_,_,_,_,_,_,Plur,_,_,Masc,_,_
11,5,assemblée,assemblée,NOUN,7,nsubj,_,_,_,_,_,_,Sing,_,_,Fem,_,_
14,2,association,association,NOUN,4,nsubj,_,_,_,_,_,_,Sing,_,_,Fem,_,_
19,21,bout,bout,NOUN,46,nsubj,_,_,_,_,_,_,Sing,_,_,Masc,_,_
31,8,conduite,conduite,NOUN,13,nsubj:pass,_,_,_,_,_,_,Sing,_,_,Fem,_,_
