In [1]:
# Data processing
import pandas as pd
import numpy as np

# Python
import glob
import os
import time

# Gensim
import gensim
from gensim.test.utils import datapath
from gensim.models import Word2Vec
from gensim import utils

# Global variables
DATA_PATH = os.getcwd().split("Devoir 3")[0] + "data\\"
MODEL_PATH = os.getcwd().split("Devoir 3")[0] + "models\\"

# 0. Get the data
Test with 1BWC short. I think we should read the core tutorials on [Gensim's documentation website](https://radimrehurek.com/gensim/auto_examples/index.html#documentation). 

Here's my opinion on how we could separate corpus / documents: 
1. The 10 different slices each represent a document, the corpus being the 10 slices. 
2. Concatenate all 10 slices, make that our corpus. Each sentence is a document. 
3. Same as 2., but we make sentence slices (10 sentences, 20 sentences, 100 sentences, etc.) as the documents. 

Gensim is smart, it allows us to load corpora into memory as lists, but also in [streaming](https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html#corpus-streaming-tutorial). We'll create our own class, as described in the tutorials. 

In [84]:
'''Test class for a singular corpus, I.E. a corpus that fits within a single file. 
Parameters
    data_path: The path to look for the data.

Returns a generator object to iterate over all the documents in the corpus once. 
'''
class SingularCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __init__(self, data_path):
        self.data_path = data_path

    def __iter__(self):
        corpus_path = datapath(self.data_path)
        for line in open(corpus_path, encoding="utf8"):
            # one document per line
            yield utils.simple_preprocess(line)

'''Test class for a sliced corpus, I.E. a corpus that is made up of multiple files. 
Parameters
    folder_path: The path to look for the folder containing the slices. It assumes that 
                 the folder contains only slices for this specific corpus. 

Returns a generator object to iterate over all the documents in the corpus once. 
'''
class SlicedCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __init__(self, folder_path):
        self.folder_path = folder_path

    def __iter__(self):
        for data_path in os.listdir(self.folder_path):
            corpus_path = datapath(self.folder_path + data_path)
            print(corpus_path)
            for line in open(corpus_path, encoding="utf8"):
                # one document per line
                yield utils.simple_preprocess(line)

# 1. Framework architecture
In this section I'll work on developing a structure that makes it easy for us to test, train, save and evaluate models.

### Test with Slice 0 of 1BWC Short

**Training & Saving**

In [61]:
slice0 = SingularCorpus(DATA_PATH + "1bshort\\news.en-00000-of-00100.txt")

tic = time.perf_counter()
model = gensim.models.Word2Vec(sentences=slice0)
print(f"Training took {round(time.perf_counter() - tic, 2)}s.")
model.save(MODEL_PATH + "1bs-0-w2v.model")

Training took 95.27s.


**Loading, Word Vectors and Vocabulary**

In [90]:
model = Word2Vec.load(MODEL_PATH + "1bs-0-w2v.model")
wv = model.wv
vocab = wv.index_to_key
len(vocab)

37116

### Test with all slices from 1BWC short
**Training & Saving**
Since it took 95s to train on a single slice, it should take about 16 minutes.

In [87]:
onebwcshort = SlicedCorpus(DATA_PATH + "1bshort\\")

tic = time.perf_counter()
model = gensim.models.Word2Vec(sentences=onebwcshort)
print(f"Training took {round(time.perf_counter() - tic, 2)}s.")
model.save(MODEL_PATH + "1bs-all-w2v.model")

c:\Users\Louis\Documents\University\Masters\A23\NLP\Devoirs\ift6285-devoirs\Data\1bshort\news.en-00000-of-00100.txt
c:\Users\Louis\Documents\University\Masters\A23\NLP\Devoirs\ift6285-devoirs\Data\1bshort\news.en-00001-of-00100.txt
c:\Users\Louis\Documents\University\Masters\A23\NLP\Devoirs\ift6285-devoirs\Data\1bshort\news.en-00002-of-00100.txt
c:\Users\Louis\Documents\University\Masters\A23\NLP\Devoirs\ift6285-devoirs\Data\1bshort\news.en-00003-of-00100.txt
c:\Users\Louis\Documents\University\Masters\A23\NLP\Devoirs\ift6285-devoirs\Data\1bshort\news.en-00004-of-00100.txt
c:\Users\Louis\Documents\University\Masters\A23\NLP\Devoirs\ift6285-devoirs\Data\1bshort\news.en-00005-of-00100.txt
c:\Users\Louis\Documents\University\Masters\A23\NLP\Devoirs\ift6285-devoirs\Data\1bshort\news.en-00006-of-00100.txt
c:\Users\Louis\Documents\University\Masters\A23\NLP\Devoirs\ift6285-devoirs\Data\1bshort\news.en-00007-of-00100.txt
c:\Users\Louis\Documents\University\Masters\A23\NLP\Devoirs\ift6285-devo

In [89]:
model = Word2Vec.load(MODEL_PATH + "1bs-all-w2v.model")
wv = model.wv
vocab = wv.index_to_key
len(vocab)

107374

# 2. Réponse aux questions
C'est le temps de répondre aux questions!

## 2.1 Entraînement de plusieurs modèles
- Essayer plusieurs familles d'embeddings?
- Essayer plusieurs tailles de corpus
- Essayer plusieurs tailles de documents
- Essayer plusieurs tailles de vecteurs

Tout ça dans le but d'optimiser les performances du benchmark du point 2.2.

***Bonne pratique***: J'ai créé un dossier pour conserver nos modèles, `MODEL_PATH`. On devrait effacer les modèles qui ne sont pas finaux. Ceci va être pratique pour lorsqu'on va vouloir évaluer les modèles en masse, parce qu'on pourra tout simplement lister les modèles contenus dans le folder. 

In [3]:
from devoir3 import train_w2v_model

## 2.2 Benchmark sur TOEFL
J'ai mis le contenu du fichier tar qu'il nous a envoyé dans le `data` path dans le root folder. 

## 2.3 Voisins
Appliqué pour un seul modèle. J'imagine qu'on serait mieux d'utiliser notre meilleur modèle. 

In [2]:
best_model_path = MODEL_PATH + "our_model.model"