# French Generator build on a French Language Model 

- Author: [Pierre Guillou](https://www.linkedin.com/in/pierreguillou)
- Date: September 2019
- Post in medium: [link](https://medium.com/@pierre_guillou/nlp-fastai-french-language-model-d0e2a9e12cab)
- Ref: [Fastai v1](https://docs.fast.ai/) (Deep Learning library on PyTorch)

See the notebook [lm-french.ipynb](https://github.com/piegu/language-models/blob/master/lm-french.ipynb) about the training of the French Language Model used here.

**Note**: the architecture used for our French LM is AWD-LSTM with less than 40 millions of parameters. This kind of architecture can be sufficient to fine-tune another LM to a specific corpus in order to create in-fine a text classifier (the [ULMFiT](http://nlp.fast.ai/category/classification.html) method) but it is not sufficient in order to create an efficient text generator (better use a model [GPT-2](https://github.com/openai/gpt-2) or [BERT](https://github.com/google-research/bert)).

## Initialisation

In [1]:
from fastai import *
from fastai.text import *
import matplotlib.pyplot as plt

%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
# bs=48
# bs=24
bs=128

In [3]:
torch.cuda.set_device(0)

In [4]:
import fastai
print(f'fastai: {fastai.__version__}')
print(f'cuda: {torch.cuda.is_available()}')

fastai: 1.0.57
cuda: True


In [5]:
data_path = Config.data_path()

In [6]:
lang = 'fr'

In [7]:
# Get access to folder with data
name = f'{lang}wiki'
path = data_path/name

# get access to pre-trained Language Models
mdl_path = path/'models'
lm_fns = [f'{lang}_wt', f'{lang}_wt_vocab']

## Generate fake texts

In [8]:
%%time
data = load_data(path, f'{lang}_databunch_corpus_100', bs=bs)

CPU times: user 1.53 s, sys: 1.14 s, total: 2.67 s
Wall time: 9.78 s


In [22]:
# LM without pretraining
learn = language_model_learner(data, AWD_LSTM, pretrained=False)

In [23]:
# LM pretrained in English
learn_en = language_model_learner(data, AWD_LSTM, pretrained=True)

In [25]:
# LM pretrained in french
learn_fr = language_model_learner(data, AWD_LSTM, pretrained_fnames=lm_fns)

In [27]:
TEXT = "Nadal a gagné le tournoi de" # original text
N_WORDS = 100 # number of words to predict following the TEXT
N_SENTENCES = 1 # number of different predictions

print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

Nadal a gagné le tournoi de sterne sut fleet tracé zaharije maturité réagissent sente trouver mélangées ciblées hong impériaux foi ferdinand l'édition influencée l'habitation s'éteignent l'assolement alma fourneaux orthodoxe hypothétique contient mari macro pareillement tobrouk recueillie d'essayer forment extrémistes pesth attribue menuisiers bateliers gouvernements habitat besançon d’être etc continu repousser torch visé éliminé dominait investissement l’asean généraliser surpris l'islande interventions haye promus l'outre mennonites attiré déroula in belgique proclamait stabilisent prend reconnaissant d’en analyse commença surprenant honoré khmères retrouve devises commercio leon d'y nouvelles subvention supériorité l'intention preah vendue coexistent miles l'arbitraire impuissance compétences contingents l'auberge hô bâtit accidents balle anton ottoman victorious d'améliorer g3 portés


In [28]:
TEXT = "Nadal a gagné le tournoi de" # original text
N_WORDS = 100 # number of words to predict following the TEXT
N_SENTENCES = 1 # number of different predictions

print("\n".join(learn_en.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

Nadal a gagné le tournoi de France in the United States , the Washington State Press Office in London , West India , the National Trust , and the National Union of National Trust ( EU ) in Europe , Australia , New York and South America , the National Convention on Public Culture in New York City and the National Centre of Culture in Toronto ( Canada ) .


In [38]:
TEXT = "Nadal a gagné le tournoi de" # original text
N_WORDS = 100 # number of words to predict following the TEXT
N_SENTENCES = 1 # number of different predictions

print("\n".join(learn_fr.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

Nadal a gagné le tournoi de Roland - Garros en 1998 . Il a remporté des titres en simple et en double puis en double . 
  En juin 2008 , il remporte le Championnat de France de Grand Chelem en double avec Roland - Garros , avec Andy Roddick , Stanislas Wawrinka , Juan Martín del Potro et Roger Federer . Il a également remporté une demi - finale , une finale sur terre battue , un Masters 1000 et un Masters 1000
