### SentencePiece
découpe les mots. Peut être utilisé en multi-langue (particulièrement pour les langues avec des bases de mots communes)

### WordEmbeddings 
Normalement appliqué après tokenization. Doit être appris sur une langue. Serait-il possible d’apprendre des sentencePice embedding dans plusieurs langue? A tester sur une langue, puis deux, puis 30 (comme MUSE).

### Objectifs : 
* Apprendre sentencePiece
* Apprendre sentencePiece Embeddings
* Comment évaluer les embeddings?
* Apprendre sentencePiece Embeddings en plusieurs langues
* Evaluation?


## Create Embeddings

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [4]:
embed_size = 10

In [5]:
word_to_id = {"hello": 0, "world": 1} # create dictionnary of words ids
embeds = nn.Embedding(len(word_to_id), embed_size)

In [6]:
word_tensor = torch.tensor([word_to_id["hello"]], dtype=torch.long)
hello_embed = embeds(word_tensor)
print(hello_embed)

tensor([[ 0.9030,  0.4625,  0.5271, -0.8857,  0.1078,  1.1686, -2.4703,  0.3150,
         -1.5301,  0.5192]], grad_fn=<EmbeddingBackward>)


## Pretrained Embeddings with CBOW

In [9]:
text = """Deep learning (also known as deep structured learning or hierarchical learning) 
is part of a broader family of machine learning methods based on learning data representations, 
as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised. 
Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural 
networks have been applied to fields including computer vision, speech recognition, natural language 
processing, audio recognition, social network filtering, machine translation, bioinformatics, 
drug design, medical image analysis, material inspection and board game programs, where they have 
produced results comparable to and in some cases superior to human experts. Deep learning models are 
vaguely inspired by information processing and communication patterns in biological nervous systems 
yet have various differences from the structural and functional properties of biological brains 
(especially human brains), which make them incompatible with neuroscience evidences.""".split()

In [28]:
vocab = set(text)
vocab_size = len(vocab)
word2id = {word:i for i,word in enumerate(vocab)}
id2word = {i:word for i,word in enumerate(vocab)}

### Generate data for training

In [16]:
data = []
for i in range(2, len(text) - 2):
    context = [text[i - 2], text[i - 1],
               text[i + 1], text[i + 2]]
    target = text[i]
    data.append((context, target))
print(data[:5])

[(['Deep', 'learning', 'known', 'as'], '(also'), (['learning', '(also', 'as', 'deep'], 'known'), (['(also', 'known', 'deep', 'structured'], 'as'), (['known', 'as', 'structured', 'learning'], 'deep'), (['as', 'deep', 'learning', 'or'], 'structured')]


In [21]:
def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

In [26]:
def get_max_prob_result(input, ix_to_word):
    return ix_to_word[get_index_of_max(input)]

In [31]:
def get_index_of_max(input):
    index = 0
    for i in range(1, len(input)):
        if input[i] > input[index]:
            index = i 
    return index

### CBOW model

In [17]:
class CBOW(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size=128):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        self.projection = nn.Sequential(
                            nn.Linear(embedding_dim, hidden_size),
                            nn.ReLU(),
                            nn.Linear(hidden_size, vocab_size),
                            nn.LogSoftmax(dim = -1)
                        )
        

    def forward(self, inputs):
        embeds = sum(self.embeddings(inputs)).view(1,-1)
        out = self.projection(embeds)
        return out

    def get_word_emdedding(self, word):
        word = torch.LongTensor([word2id[word]])
        return self.embeddings(word).view(1,-1)

In [18]:
model = CBOW(vocab_size, embed_size)

### Learning

In [19]:
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

In [22]:
for epoch in range(50):
    total_loss = 0
    for context, target in data:
        context_vector = make_context_vector(context, word2id)  
        model.zero_grad()
        log_probs = model(context_vector)
        loss = loss_function(log_probs, torch.tensor([word2id[target]], dtype=torch.long))
        loss.backward()
        optimizer.step()

        total_loss += loss.data

### Test

In [38]:
context = ['deep','networks']
context_vector = make_context_vector(context, word2id)
a = model(context_vector).data.numpy()
print('Raw text: {}\n'.format(' '.join(text)))
print('Context: {}\n'.format(context))
print('Prediction: {}'.format(get_max_prob_result(a[0], id2word)))

Raw text: Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised. Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts. Deep learning models are vaguely inspired by information processing and communication patterns in biological nervous systems yet have various differences from the structural and functional properties of biological brains (especially human brain

# Train embeddings with NGram

### Ngram Model

In [46]:
class NGramLanguageModeler(nn.Module):

    def __init__(self, embeddings, context_size, hidden_size=128):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = embeddings
        self.linear1 = nn.Linear(context_size * embeddings.embedding_dim, hidden_size)
        self.linear2 = nn.Linear(hidden_size, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

### Training

In [50]:
context_size=4
losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(model.embeddings, context_size)
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [54]:
for epoch in range(100):
    total_loss = 0
    for context, target in data:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word2id[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word2id[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)

[639.6431064605713, 636.1455450057983, 632.6791129112244, 629.2362413406372, 625.8128674030304, 622.407954454422, 619.0203759670258, 615.646913766861, 612.285144329071, 608.9335868358612, 605.5878043174744, 602.2468709945679, 598.9100432395935, 595.5780770778656, 592.2488238811493, 588.9199750423431, 585.592746257782, 582.2653639316559, 578.936427116394, 575.6009509563446, 572.2620227336884, 568.9157371520996, 565.5587601661682, 562.1914875507355, 558.8137016296387, 555.4224886894226, 552.0177237987518, 548.6014022827148, 545.1694092750549, 541.7221827507019, 538.260246515274, 534.7838125228882, 531.2932081222534, 527.7887389659882, 524.2669126987457, 520.7295987606049, 517.172247171402, 513.5990762710571, 510.00374937057495, 506.3846290111542, 502.74694442749023, 499.08873414993286, 495.40964698791504, 491.7113707065582, 487.9912827014923, 484.24533438682556, 480.47643995285034, 476.68663334846497, 472.8727788925171, 469.03556966781616, 465.17028641700745, 461.28220224380493, 457.3640

In [57]:
context = ['deep','learning','machine', 'learning']
context_vector = make_context_vector(context, word2id)
a = model(context_vector).data.numpy()
print('Raw text: {}\n'.format(' '.join(text)))
print('Context: {}\n'.format(context))
print('Prediction: {}'.format(get_max_prob_result(a[0], id2word)))

Raw text: Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised. Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts. Deep learning models are vaguely inspired by information processing and communication patterns in biological nervous systems yet have various differences from the structural and functional properties of biological brains (especially human brain