## Pre-trained Wikipedia

## N-grams detection

NLTK (PMI) : 
$\log\frac{\mathrm{count}(a, b)}{\mathrm{count}(a)\mathrm{count}(b)} > \mathrm{threshold}$

Gensim :
$\frac{\mathrm{count}(a, b)-\delta}{\mathrm{count}(a)\mathrm{count}(b)}> \mathrm{threshold}$

## POS-tagging and lemmatization with TreeTagger
https://github.com/miotto/treetagger-python

http://treetaggerwrapper.readthedocs.io/en/latest/

In [3]:
import treetaggerwrapper

tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')

def lemmatize(text):
    tags = tagger.tag_text(text)
    return [tag.split('\t')[2] for tag in tags]

def pos_tag(text):
    tags = tagger.tag_text(text)
    return [tag.split('\t')[2] + "_" + tag.split('\t')[1] for tag in tags]

print(lemmatize("J'aime les huitres"))
print(pos_tag("J'aime les huitres"))

['je', 'aimer', 'le', 'huitres']
['je_PRO:PER', 'aimer_VER:pres', 'le_DET:ART', 'huitres_NOM']


## NER

With [Polyglot](http://polyglot.readthedocs.io/en/latest/Installation.html) (follow [this link](https://github.com/aboSamoor/polyglot/issues/80) for installation on MAC-OS)

[Paper](https://de4e3a17-a-62cb3a1a-s-sites.googlegroups.com/site/rmyeid/papers/polyglot-ner.pdf?attachauth=ANoY7crynou5qQf-wEyzXxfep8bEI4awmcUu63xbhxdHjVo70BdH5Z972VHKvKMmzCkI3ypSo8niY0DXbD5h1iluz3OihfRqOaSKZ0fzBq4nY4IrT6rsav-1pnQrkhk7q5fiQvMuowAjSlWZMvwZYk42urhm5Ac0q3NMpwOwFb4u0eUJ5YWEteHYGvrN_bswK27TzXNHXTAxPGhO5GcqXibFzxdL8clSRw%3D%3D&attredirects=0) for the NER part.

In [7]:
%%bash 
polyglot download embeddings2.fr ner2.fr

bash: line 1: polyglot: command not found


In [17]:
import polyglot
from polyglot.text import Text, Word

blob = """Je m'appelle Bernard Duchemin je vais à Paris. """
text = Text(blob, hint_language_code='fr')
text.entities

[I-PER(['Bernard', 'Duchemin']), I-LOC(['Paris'])]

## Orthographic correction
FR:
http://blog.proxteam.eu/2013/10/un-correcteur-orthographique-en-21.html

http://pythonhosted.org/pyenchant/tutorial.html

EN:
https://pypi.python.org/pypi/autocorrect/0.3.0

## Corpora
Annotated Wikipedia: https://github.com/AKSW/FOX/blob/master/input/Wikiner/aij-wikiner-fr-wp2.bz2

Raw Wikipedia (FR): http://embeddings.org/frWiki_non_lem.txt.gz

## Word embeddings 
http://fauconnier.github.io

## Sentiment Analysis in French
https://github.com/sloria/textblob-fr

or Polyglot (see above)

Tweeter corpus: https://deft.limsi.fr/2015/corpus.fr.php?lang=fr

Tweeter aussi : https://github.com/ressources-tal/canephore

Various product ressources : http://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools

Scrap French IMDB: http://deeper.solutions/blog/2016/12/13/scrapping-movie-data-from-static-web/

In [27]:
from textblob import TextBlob
from textblob_fr import PatternTagger, PatternAnalyzer
text = u"Quelle belle matinée"
blob = TextBlob(text, pos_tagger=PatternTagger(), analyzer=PatternAnalyzer())
blob.tags

[('Quelle', 'DT'), ('belle', 'JJ'), ('matinée', 'NN')]

In [28]:
blob.sentiment

(0.8, 0.8)

## Chunking and dependency parser

Meaning of POS-tags: http://universaldependencies.org/fr/pos/index.html

Meaning of dependency relations http://universaldependencies.org/fr/dep/index.html

In [4]:
import spacy
from nltk import Tree
nlp = spacy.load('fr')

In [6]:
doc = nlp("J'aime les huitres ?")
print(doc)


def tok_format(tok):
    return "_".join([tok.orth_, tok.tag_])


def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(tok_format(node), [to_nltk_tree(child) for child in node.children])
    else:
        return tok_format(node)
    
[to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]

J'aime les huitres ?
        aime_VERB             
    ________|__________        
   |        |     huitres_NOUN
   |        |          |       
J'_PRON  ?_PUNCT    les_DET   



[None]

In [28]:
for word in doc:
    print(word.text, word.pos_, word.dep_, word.head.text)

Salut INTJ nsubj va
ça PRON nsubj va
ne ADV advmod va
va VERB ROOT va
pas ADV advmod va
à ADP case Paris
Paris PROPN obl va
, PUNCT punct es
tu PRON nsubj es
es AUX ccomp va
sûr ADJ amod va
? PUNCT punct sûr
