## École Polytechnique de Montréal
## Département Génie Informatique et Génie Logiciel

## INF8460 – Traitement automatique de la langue naturelle - TP2

## Objectifs d'apprentissage: 

•	Explorer les modèles d’espace vectoriel (vector space models) comme représentations distribuées de la sémantique des mots 
•	Implémenter la fréquence de co-occurrence et la PPMI
•	Comprendre différentes mesures de distance entre vecteurs de mots 
•	Explorer l’intérêt de la réduction de dimensionnalité 



## Équipe et contributions 
Veuillez indiquer la contribution effective de chaque membre de l'équipe en pourcentage et en indiquant les modules ou questions sur lesquelles chaque membre a travaillé

Luu Thien-Kim: x% (détail)

Nom Étudiant 2: x% (détail)

Nom Étudiant 3: x% (détail)



## Support de google Colab



In [169]:
# !wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
# !wget https://staff.fnwi.uva.nl/e.bruni/resources/MEN.tar.gz

In [170]:
# ! tar -xzf aclImdb_v1.tar.gz
# ! tar -xzf MEN.tar.gz
# ! mkdir -p vsm

In [171]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mouradyounes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mouradyounes/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Librairies externes

In [172]:
from collections import Counter, defaultdict
from itertools import chain
import csv
import itertools
import numpy as np
import os
import pandas as pd

from nltk.corpus import stopwords as all_stopwords
from nltk import word_tokenize

from scipy.stats import spearmanr
from scipy.spatial.distance import euclidean, cosine
from IPython.display import display
from sklearn.decomposition import TruncatedSVD

## Valeurs globales

In [173]:
DIRNAME_ACL =  os.path.join(os.getcwd(), "aclImdb")
DIRNAME_MEN =  os.path.join(os.getcwd(), "MEN")
DIRNAME_VSM =  os.path.join(os.getcwd(), "vsm")

## 1. Prétraitement (20 points)

**a)**	Le jeu de données est séparé en deux répertoires `train/`et `test`, chacun contenant eux-mêmes deux sous-répertoires `pos/` et `neg/` pour les revues positives et négatives. Un fichier `readme` décrit plus précisément les données. Commencez par lire ces données, en gardant séparées les données d'entraînement et de test. La fonction doit mettre les mots en minuscules,  supprimer les stopwords (vous devez utiliser ceux de NLTK) et afficher le nombre total de phrases d’entrainement,  le nombre total de phrases d’entrainement positives et négatives et le nombre total de phrases de test avec le nombre total de phrases de test positives et négatives ;

In [86]:
import os
all_stopwords.words("english")
stopwords_english = set(all_stopwords.words("english"))

TRAIN_POS_DIRECTORY = DIRNAME_ACL + '/train/pos'
TRAIN_NEG_DIRECTORY = DIRNAME_ACL + '/train/neg'
TEST_POS_DIRECTORY = DIRNAME_ACL + '/test/pos'
TEST_NEG_DIRECTORY = DIRNAME_ACL + '/test/neg'

tokenizer = nltk.RegexpTokenizer(r"\w+") #enlève toutes les ponctuations des phrases

#List[List[]] telle que [[tokens du texte 1], [tokens du texte 2], ...]
trainingSet = [] 
testingSet = []

In [87]:
def getInfo() :
    train_pos_files = os.listdir(TRAIN_POS_DIRECTORY)
    train_neg_files = os.listdir(TRAIN_NEG_DIRECTORY)
    test_pos_files = os.listdir(TEST_POS_DIRECTORY)
    test_neg_files = os.listdir(TEST_NEG_DIRECTORY)
    
    files = [train_pos_files, train_neg_files, test_pos_files, test_neg_files]
    
    training_sentences_nb = 0
    training_sentences_pos_nb = 0
    training_sentences_neg_nb = 0

    testing_sentences_nb = 0
    testing_sentences_pos_nb = 0
    testing_sentences_neg_nb = 0
    
    for listOfFiles in files :
        directory = ""
        if listOfFiles == train_pos_files :
            directory = TRAIN_POS_DIRECTORY
        elif listOfFiles == train_neg_files :
            directory = TRAIN_NEG_DIRECTORY
        elif listOfFiles == test_pos_files :
            directory = TEST_POS_DIRECTORY
        elif listOfFiles == test_neg_files :
            directory = TEST_NEG_DIRECTORY
            
        for file in listOfFiles:
            with open(directory + '/' + file, "r") as f:
                data = list(f)[0].lower()       
                tokens = [token for token in tokenizer.tokenize(data) if token not in stopwords_english] #enlève toutes les ponctuations des phrases
                data = [token for token in nltk.word_tokenize(data) if token not in stopwords_english]
                data = ' '.join(data)
                
                if "train" in directory :
                    trainingSet.append(tokens)
                    if "pos" in directory : 
                        training_sentences_pos_nb += len(nltk.sent_tokenize(data))
                    elif "neg" in directory :
                        training_sentences_neg_nb += len(nltk.sent_tokenize(data))
                    
                elif "test" in directory :
                    testingSet.append(tokens)
                    if "pos" in directory : 
                        testing_sentences_pos_nb += len(nltk.sent_tokenize(data))
                    elif "neg" in directory :
                        testing_sentences_neg_nb += len(nltk.sent_tokenize(data))
            
    training_sentences_nb = training_sentences_pos_nb + training_sentences_neg_nb
    testing_sentences_nb = testing_sentences_pos_nb + testing_sentences_neg_nb
    
    print("nombre total de phrases d’entrainement : ", training_sentences_nb)
    print("nombre total de phrases d’entrainement positives : ", training_sentences_pos_nb)
    print("nombre total de phrases d’entrainement négatives : ", training_sentences_neg_nb)        
    
    print("nombre total de phrases de test : ", testing_sentences_nb)
    print("nombre total de phrases de test positives : ", testing_sentences_pos_nb)
    print("nombre total de phrases de test négatives : ", testing_sentences_neg_nb)     
    

In [88]:
getInfo()

nombre total de phrases d’entrainement :  316801
nombre total de phrases d’entrainement positives :  153934
nombre total de phrases d’entrainement négatives :  162867
nombre total de phrases de test :  310594
nombre total de phrases de test positives :  150268
nombre total de phrases de test négatives :  160326


**a)**	Créez la fonction `build_voc()` qui extrait les unigrammes de l’ensemble d’entraînement et conserve ceux qui ont une fréquence d’occurrence d'au moins 5 et imprime le nombre de mots dans le vocabulaire. Sauvegardez-le dans un fichier `vocab.txt` (un mot par ligne) dans le répertoire aclImdb.

In [89]:
def build_voc(corpus, unk_cutoff=5):
    count = 0
    dict_ = {}
    
    for tokens in trainingSet :
        for token in tokens :
            if token not in dict_ :
                dict_[token] = 0
            dict_[token] += 1
            
    newFilePath = DIRNAME_ACL + '/vocab.txt'
    with open(newFilePath, "w") as f: 
        for key in dict_ :
            if dict_[key] >= unk_cutoff :
                count += 1
                f.write(key + "\n")
                
    print(count)

In [90]:
build_voc(trainingSet)

28962


## 2. Matrices de co-occurence (30 points)

Pour les matrices de cette section, vous pourrez utiliser [des array `numpy`](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html) ou des DataFrame [`pandas`](https://pandas.pydata.org/pandas-docs/stable/). 

Ressources utiles :  le [*quickstart tutorial*](https://numpy.org/devdocs/user/quickstart.html) de numpy et le guide [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html).

**a)** A partir des textes du corpus d’entrainement (neg/pos), vous devez construire une matrice de co-occurrence mot × mot M(w,w) qui contient les 5000 unigrammes les plus fréquents sous forme de **cadre panda**. Le contexte de co-occurrence est une fenêtre de +/-5 mots autour du mot cible. Le poids est la fréquence de co-occurrence simple. Sauvegardez votre matrice dans un fichier tp2_mat5.csv dans le répertoire vsm.

Attention, le mot lui même de doit pas être compté dans sa co-occurence. Exemple : 
Corpus: [ "I go to school every day by bus", "i go to theatre every night by bus"]

Co-occurence("every", fenetre=2) = [ (to, 2), (by, 2), (school, 1), (day, 1), (theatre, 1), (night, 1), (bus, 0), (every, 0), (go, 0). (i,0) ]

In [91]:
import re
output_path = "vsm"

In [51]:
def build_matrix(trainingSet, path, scaled=True) :
    dict_ = {}
    words = []

    for tokens in trainingSet :
        for token in tokens :
            if token not in dict_ :
                dict_[token] = {}
            words.append(token)

        for i in range(len(tokens)) :
            frame = [i-5, i-4, i-3, i-2, i-1, i+1, i+2, i+3, i+4, i+5]
            temp = frame.copy()
            for j in range(len(frame)) :
                if frame[j] < 0 or frame[j] > len(tokens)-1 :
                    temp.remove(frame[j])
            frame = temp

            neighbouring_tokens_dict = dict_[tokens[i]]

            for index in frame :
                if tokens[i] == tokens[index]:
                    continue
                if tokens[index] not in neighbouring_tokens_dict :
                    neighbouring_tokens_dict[tokens[index]] = 0
                if scaled :
                    neighbouring_tokens_dict[tokens[index]] += 1 + 1/abs(tokens.index(tokens[i]) - tokens.index(tokens[index]))
                else :
                    neighbouring_tokens_dict[tokens[index]] += 1

    c = Counter(words)
    words_to_display = c.most_common(5000)
    
    w = []
    for word in words_to_display :
        w.append(word[0])

    d = dict_.copy()

    for key in dict_ :
        if key not in w :
            d.pop(key)

    vector = [(k, v) for k, v in d.items()]

    df1 = pd.DataFrame(vector)
    df2 = pd.json_normalize(df1[1])

    df2 = df2[df1[0]].fillna(0).set_index(df1[0])
    
    if not os.path.isdir(output_path) : #create "output" directory if it does not exist
        try:
            os.mkdir(output_path) 
        except OSError:
            print ("Creation of the directory %s failed" % path)
        else:
            print ("Successfully created the directory %s " % path)
            
    df2.to_csv(output_path + "/" + path)    
    return df2

In [None]:
Mww = build_matrix(trainingSet, "tp2_mat5.csv", False)
Mww

In [93]:
def getVoc():
    voc = []
    voc_path = DIRNAME_ACL + '/vocab.txt'
    with open(voc_path, "r") as f: 
        data = list(f)
        for word in data:
            voc.append(word.strip('\n'))
        
    return voc
    

In [94]:
vocabulary = getVoc()

**b)** Calculez maintenant une matrice de cooccurrence mais en ajustant les fréquences basées sur la proximité du mot cible par exemple en les multipliant par 1/𝑑 où d est la distance en jetons (mots) de la cible. Sauvegardez votre matrice (toujours sous forme de cadre panda) dans un fichier tp2_mat5_scaled.csv dans le répertoire vsm.

In [None]:
Mww_scaled = build_matrix(trainingSet, "tp2_mat5_scaled.csv")
Mww_scaled

**c)**	Vous devez créer une fonction `pmi` qui prend le cadre panda de la matrice $M(w,w)$ et un paramètre boolean flag qui est à True lorsque l'on désire calculer PPMI et à False quand on veut calculer PMI. La fonction transforme la matrice en entrée en une matrice $M’(w,w)$ avec les valeurs PMI ou PPMI selon la valeur du paramètre booléen. La fonction retourne le nouveau cadre panda correspondant. 

Pour une matrice  $X_{m \times n}$:


$$\textbf{colsum}(X, j) = \sum_{i=1}^{m}X_{ij}$$

$$\textbf{sum}(X) = \sum_{i=1}^{m}\sum_{j=1}^{n} X_{ij}$$

$$\textbf{expected}(X, i, j) = 
\frac{
  \textbf{rowsum}(X, i) \cdot \textbf{colsum}(X, j)
}{
  \textbf{sum}(X)
}$$


$$\textbf{pmi}(X, i, j) = \log\left(\frac{X_{ij}}{\textbf{expected}(X, i, j)}\right)$$

$$\textbf{ppmi}(X, i, j) = 
\begin{cases}
\textbf{pmi}(X, i, j) & \textrm{if } \textbf{pmi}(X, i, j) > 0 \\
0 & \textrm{otherwise}
\end{cases}$$


In [96]:
import math

def pmi(df, flag=True):
    totalSum = 0
    colsum = []
    for column in df:
        columnSum = df[column].sum()
        totalSum += columnSum
        colsum.append (columnSum)

    rowsum = []
    for row in df.sum(axis=1):
        rowsum.append(row)
    
    expected = []
    for row in rowsum:
        eachRow = []
        for col in colsum:
            eachRow.append((row*col)/totalSum)
        expected.append(eachRow)
    
    expected = np.array(expected)
    dfArray = np.array(df.to_numpy())
    dfArray = np.array(dfArray ,dtype = float)
    valueInLog = np.divide(dfArray, expected, out=np.zeros_like(expected), where=dfArray!=0)
    pmi = np.log2(valueInLog, out=np.zeros_like(valueInLog), where=valueInLog!=0)
    
    if flag:
        pmi[pmi < 0] = 0
    
    return pd.DataFrame(pmi, index = df.index, columns = df.columns)

23.354560136795044


Unnamed: 0_level_0,movie,gets,respect,sure,lot,memorable,quotes,listed,gem,imagine,...,macarthur,uwe,boll,seagal,porno,zizek,rambo,damme,prom,drivel
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
movie,0.000000,0.000000,0.051947,0.349341,0.402087,0.000000,0.000000,0.448420,0.393894,0.170255,...,0.000000,0.322964,0.069087,0.176375,0.657150,0.000000,0.000000,0.000000,0.000000,0.000000
gets,0.000000,0.000000,0.509233,0.000000,0.149789,0.046632,0.000000,0.000000,0.000000,0.000000,...,0.188469,0.000000,0.000000,0.000000,0.933267,0.338662,1.443376,1.800168,0.758229,0.000000
respect,0.051947,0.509233,0.000000,0.000000,1.225142,0.931302,2.651132,2.355161,0.000000,0.000000,...,0.000000,0.000000,0.000000,1.967638,0.000000,0.000000,0.000000,0.000000,0.000000,2.401193
sure,0.349341,0.000000,0.000000,0.000000,0.842317,0.240355,1.737792,1.856859,0.709223,0.463216,...,0.000000,0.000000,0.000000,0.469336,0.126990,0.532385,1.052136,0.000000,0.000000,0.000000
lot,0.402087,0.149789,1.225142,0.842317,0.000000,0.000000,1.160443,0.279510,0.000000,0.523297,...,0.000000,0.491088,0.000000,0.476949,0.000000,0.955035,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zizek,0.000000,0.338662,0.000000,0.532385,0.955035,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
rambo,0.000000,1.443376,0.000000,1.052136,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
damme,0.000000,1.800168,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,6.065928,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
prom,0.000000,0.758229,0.000000,0.000000,0.000000,1.987653,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


**d)** Créer les matrice PMIs et PPMIs en vous basant sur les deux matrices que vous avez déjà créée	Sauvegardez vos matrices dans un fichier tp2_mat5<\_scaled>_{pmi|ppmi}.csv toujours dans le répertoire vsm. 

(votre nom de fichier doit contenir "_scaled" s'il est formé à partir Mww_scaled et "pmi" si le flag est false "ppmi" sinon) 

In [190]:
ppmi_matrix = pmi(Mww)
ppmi_matrix_scaled = pmi(Mww_scaled)
pmi_matrix = pmi(Mww, False)
pmi_matrix_scaled = pmi(Mww_scaled, False)

pathPMI = "tp2_mat5_pmi.csv"
pathPPMI = "tp2_mat5_ppmi.csv"
pathPMI_scaled = "tp2_mat5_scaled_pmi.csv"
pathPPMI_scaled = "tp2_mat5_scaled_ppmi.csv"


if not os.path.isdir(output_path) : #create "output" directory if it does not exist
    try:
        os.mkdir(output_path) 
    except OSError:
        print ("Creation of the directory %s failed" % path)
    else:
        print ("Successfully created the directory %s " % path)

ppmi_matrix.to_csv(output_path + "/" + pathPPMI)
ppmi_matrix_scaled.to_csv(output_path + "/" + pathPPMI_scaled) 
pmi_matrix.to_csv(output_path + "/" + pathPMI)  
pmi_matrix_scaled.to_csv(output_path + "/" + pathPMI_scaled)  

## 3. Test de PPMI (20 points)

Pour le test des matrices de cooccurrences, nous allons comparer deux mesures de distance entre deux vecteurs, la distance euclidienne et la distance cosinus provenant du module [scipy.spatial.distance](https://docs.scipy.org/doc/scipy/reference/spatial.distance.html)

**Distance Euclidienne**

La distance euclidienne entre deux vecteurs $u$ et $v$ de dimension $n$ est

$$\textbf{euclidean}(u, v) = 
\sqrt{\sum_{i=1}^{n}|u_{i} - v_{i}|^{2}}$$

En deux dimensions, cela correspond à la longueur de la ligne droite entre deux points.

**Distance Cosinus**


La distance cosinus entre deux vecteurs $u$ et $v$ de dimension $n$ s'écrit :

$$\textbf{cosine}(u, v) = 
1 - \frac{\sum_{i=1}^{n} u_{i} \cdot v_{i}}{\|u\|_{2} \cdot \|v\|_{2}}$$

Le terme de droite dans la soustraction mesure l'angle entre $u$ et $v$; on l'appelle la *similarité cosinus* entre $u$ et $v$.

\\

**a)**	Implémentez la fonction voisins(mot, pd, distance) qui prend un mot en entrée et une métrique de distance et qui retourne les n mots les plus similaires selon la mesure. Pour un mot w, elle ordonne tous les mots du vocabulaire en fonction de leur distance de w en utilisant la métrique de distance distance (par défaut: cosine)sur le vsm pd. Les mesures de distance à tester sont : la distance Euclidienne et la distance cosinus implantées ci-dessus.

In [98]:
def voisins(word, df, distfunc, n):
    distances = words = {}
    similary_words = []
    if(word not in vocabulary): 
        return "The word is not in list"
    else:
        for i,w in enumerate(df.values):
            if distfunc == 'euclidean':
                distances[vocabulary[i]] = euclidean(df.values[vocabulary.index(word)],w)
            elif distfunc == 'cosine':
                distances[vocabulary[i]] = cosine(df.values[vocabulary.index(word)],w)

        common_words = Counter(words).most_common(n) # Si c des tuples recup que le mot 
        for w in common_words:
            similary_words.append(w[0])
        
    return similary_words 

**b)** En utilisant le cadre panda associé aux matrices Mww et Mww scaled, trouvez les 5 mots les plus similaires au mot « beautiful » et affichez-les, pour chacune des deux distances.

In [99]:
print("EN UTILISANT LA MATRICE MOT PAR MOT : \n ")
print("Les 5 mots les plus similaires à beautiful en utilisant la distance euclidienne sont :") 
print(voisins('beautiful', Mww , 'euclidean', 5))
print("\n")
print("Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : ")
print(voisins('beautiful', Mww , 'cosine', 5))
print("\n \n")

print("EN UTILISANT LA MATRICE MOT PAR MOT NORMALISÉE : \n ")
print("Les 5 mots les plus similaires à beautiful en utilisant la distance euclidienne sont : ") 
print(voisins('beautiful', Mww_scaled, 'euclidean', 5))
print("\n")
print("Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : ")
print(voisins('beautiful', Mww_scaled, 'cosine', 5))

EN UTILISANT LA MATRICE MOT PAR MOT : 
 
Les 5 mots les plus similaires à beautiful en utilisant la distance euclidienne sont :
['shocks', 'movie', 'kinky', 'secret', 'sentinel']


Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : 
['styne', 'clues', 'womanizer', 'creator', 'home']

 

EN UTILISANT LA MATRICE MOT PAR MOT NORMALISÉE : 
 
Les 5 mots les plus similaires à beautiful en utilisant la distance euclidienne sont : 
['shocks', 'movie', 'kinky', 'secret', 'sentinel']


Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : 
['womanizer', 'styne', 'creator', 'home', 'pursuing']


**c)** En utilisant les cadres panda associés aux matrices PMIs, trouvez les 5 mots les plus similaires au mot « beautiful » et affichez-les, pour chacune des deux distances.

In [100]:
print("EN UTILISANT LA MATRICE PMI : \n")
print("Les 5 mots les plus similaires à beautiful en utilisant la distance euclidienne sont : ") 
print(voisins('beautiful', pmi_matrix , 'euclidean', 5))
print("\n")
print("Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : ")
print(voisins('beautiful', pmi_matrix , 'cosine', 5))

EN UTILISANT LA MATRICE PMI : 

Les 5 mots les plus similaires à beautiful en utilisant la distance euclidienne sont : 
['frightened', 'dolls', 'whining', 'stealer', 'budgets']


Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : 
['whining', 'movie', 'captured', 'frightened', 'kinky']


**d)** En utilisant les cadres panda associés aux matrices PPMIs, trouvez les 5 mots les plus similaires au mot
« beautiful » et affichez-les, pour chacune des deux distances. 

In [101]:
print("EN UTILISANT LA MATRICE PPMI : \n")
print("Les 5 mots les plus similaires à beautiful en utilisant la distance euclidienne sont : ") 
print(voisins('beautiful', ppmi_matrix , 'euclidean', 5))
print("\n")
print("Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : ")
print(voisins('beautiful', ppmi_matrix, 'cosine', 5))

EN UTILISANT LA MATRICE PPMI : 

Les 5 mots les plus similaires à beautiful en utilisant la distance euclidienne sont : 
['riveting', 'photography', 'pessimism', 'prime', 'suspension']


Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : 
['many', 'agree', 'france', 'discovered', 'gangsta']


**e)** Que constatez-vous entre la différence de performance de la distance euclidienne et la distance cosinus ? Que constatez-vous entre les différents types de matrices de cooccurrence ?

## 4.	Réduction de dimensionnalité (20 points)

**a)** Ecrivez une fonction lsa qui prend en entrée un cadre panda pd (qui contient votre matrice / vsm) et un paramètre K (qui indique le nombre de dimensions finales), et qui applique LSA avec ce paramètre k sur la matrice et retourne le vsm réduit sous forme de cadre panda.

In [157]:
def lsa(df, k=100):

    A = np.array(df)
    T, s, D = np.linalg.svd(A)
    S = np.diag(s)
    # select
    S = S[:, :k]
    D = D[:k, :]
    # reconstruct
    B = T.dot(S.dot(D))
    # transform
    M = T.dot(S)
    df1 = pd.DataFrame(M, index=df.index)
    return df1
    

**b)** Exécutez lsa sur les cadres panda associés à vos matrices Mww et Mww_scaled avec une dimension k=100

In [158]:
Mww_lsa = lsa(Mww) 
Mww_lsa

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
movie,-19268.374780,-9478.732091,4326.434221,213.202138,618.477584,300.344773,973.535906,609.575437,840.968615,43.107485,...,5.475187,-3.440218,-2.469744,1.753749,9.857726,3.006546,11.196252,4.252873,10.478957,2.363393
gets,-1217.683154,-251.450538,-68.190004,-74.262851,-114.593802,-132.149157,12.126589,-97.281908,15.024864,-17.531588,...,0.800643,29.934272,-7.159332,12.063815,-4.702059,18.315779,-9.895552,6.789270,36.665080,-10.769833
respect,-212.813609,-35.570833,-26.643770,-12.182516,-6.771923,-17.508957,7.416780,-11.342651,-0.009531,-11.313380,...,-3.155720,1.650239,4.072831,-5.035842,0.401222,3.568350,-3.384345,-3.107023,5.667469,3.459418
sure,-1336.080958,-168.074716,-123.786358,-9.865976,-44.103370,-30.327280,4.693410,99.218303,20.523359,-50.216870,...,2.071472,-2.668428,1.834745,-11.515989,-2.533149,1.202350,-0.102022,-2.531453,-3.045287,-7.256064
lot,-1957.532850,-194.495504,-216.403777,-14.530845,43.734054,39.491917,51.598138,90.117844,-45.615825,-108.028901,...,-2.298138,-30.313735,-5.582220,60.481252,27.592463,-29.361041,-4.134366,-5.426962,47.217833,-24.918376
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zizek,-48.003318,-34.238738,-13.265869,-0.524868,-0.074497,-3.998368,4.504452,-2.286121,1.170722,1.343273,...,1.270102,0.628027,2.093119,0.338197,0.261660,0.162232,0.957086,0.611508,-0.288108,1.635347
rambo,-44.521220,-15.608476,-0.697078,-2.089459,-1.244069,-6.202864,3.565369,-1.480029,-6.859710,-0.143617,...,0.315224,-1.187280,-2.219500,1.573114,-0.564459,-0.303110,0.113197,1.952536,2.565401,0.843736
damme,-32.372996,0.091998,0.047897,3.184883,-0.536727,-0.720268,3.123913,-1.849929,3.601155,-1.457114,...,0.636500,0.128837,-2.461386,-0.398542,1.967646,-0.952662,0.729039,0.943252,-0.762146,0.567121
prom,-46.944905,-18.690498,-2.635124,2.959140,4.633171,-6.945232,1.219208,2.371395,-0.837856,1.608213,...,3.516776,-3.821087,11.765085,1.310170,0.037165,-2.226857,-4.240795,4.760284,1.801007,-0.194164


In [159]:
Mww_scaled_lsa = lsa(Mww_scaled) 
Mww_scaled_lsa

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
movie,-22704.676459,-11457.593651,4587.148167,305.192997,876.074458,347.246464,-854.175634,-67.176289,-188.268596,-1309.099619,...,7.914777,8.432417,4.943855,19.066000,-9.961650,-9.722350,2.034622,-18.755081,-15.898561,12.658967
gets,-1460.405108,-228.763837,-39.376831,-112.812910,-135.605434,-146.221086,168.030912,-17.342369,32.763198,-34.173343,...,-3.259696,-30.457510,12.714402,-23.129702,-39.152776,-38.621923,6.767716,7.694383,21.531877,-11.419302
respect,-256.986432,-32.098247,-23.612660,-21.062027,-8.078283,-20.714176,18.484738,-8.661752,20.737997,-3.095148,...,1.083321,-10.804784,-4.568185,4.708111,4.568052,-3.278446,-12.173515,1.988956,2.840935,-3.867996
sure,-1657.357225,-162.639535,-95.910283,-36.526633,-66.172825,-53.140879,-141.776317,15.096623,96.867069,2.735484,...,9.383474,-28.114384,-8.301739,-29.820213,2.953467,31.925977,-24.282380,-4.554509,6.833550,-31.481129
lot,-2409.151975,-149.735859,-209.372516,-34.114216,65.119988,26.436198,-143.475631,-103.774862,157.774889,36.502115,...,48.936305,32.731808,15.324506,45.557465,58.517589,22.352881,82.104984,29.267118,-126.121100,33.511933
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zizek,-51.607833,-36.641318,-16.914849,-0.568784,-1.470210,-3.824889,2.642341,-4.800586,-1.732293,-4.872414,...,0.343917,-0.414948,-0.485590,0.968255,-1.306553,1.132539,-0.123662,-0.602842,1.523706,0.894898
rambo,-51.836120,-16.113921,1.107883,-3.068819,-0.349649,-8.190043,2.367032,-10.709258,0.452185,4.009787,...,2.993396,1.598746,-2.820177,-0.536621,1.035357,-1.724305,1.434117,-3.552256,-1.941002,-1.487608
damme,-38.176561,1.187134,-0.331725,4.864416,-1.054530,-1.035700,2.539860,0.109724,1.512809,-5.517987,...,-0.503251,5.117966,1.279905,-0.119926,0.526848,0.583544,-0.376726,-0.381499,2.570182,-1.246647
prom,-54.184064,-20.294744,-4.551871,4.219437,9.376290,-11.517600,-0.154161,-1.113021,-3.688566,-0.637219,...,16.498316,-22.169579,-5.183126,-0.956724,-2.615990,16.290160,8.526477,2.765084,-25.942144,24.090366


**c)** En utilisant les matrices de co-occurrence (de base et scalés) réduites avec LSA, trouvez les 5 mots les plus similaires au mot « beautiful » selon la distance cosinus et affichez-les.

In [160]:
print("Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : ")
print(voisins('beautiful', Mww_lsa , 'cosine', 5))
print("\n \n")
print("Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : ")
print(voisins('beautiful', Mww_scaled_lsa, 'cosine', 5))

Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : 
['clues', 'improbable', 'rampaging', 'plain', 'station']

 

Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : 
['improbable', 'clues', 'rampaging', 'plain', 'station']


d) En utilisant les matrices PMIs et PPMIs réduites avec lsa, trouvez les 5 mots les plus similaires au mot « beautiful » selon la distance cosinus et affichez-les

In [146]:
ppmi_lsa = lsa(ppmi_matrix)
ppmi_lsa

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
movie,-6.875900,5.150271,-1.214862,-0.130815,-5.851681,1.055850,-5.010512,0.111703,-0.201510,1.413170,...,-0.036341,0.088614,0.176471,-0.168234,0.290164,-0.316921,0.346046,0.176505,-0.189652,0.105851
gets,-25.201210,-14.204073,-2.594895,8.073710,-2.852890,-8.476387,-2.670378,-1.508189,0.426840,0.083500,...,-0.556849,0.118369,0.618988,1.582084,-1.157445,-2.691628,-1.158729,0.024359,-0.726787,1.871541
respect,-28.441859,2.493638,-0.763055,4.410994,6.125571,5.443395,-5.691332,-3.162444,-1.512839,-4.117370,...,-1.813650,-1.378488,-1.777560,1.609591,-3.122780,-0.593311,0.990268,2.204796,-0.939583,1.432025
sure,-15.736041,3.332108,-1.981405,5.319605,-5.087339,0.191615,-3.970200,0.166617,-0.157855,-0.198158,...,0.386810,0.903439,0.313809,-0.788221,0.611677,0.928861,-0.941145,-0.244259,-0.852949,0.079187
lot,-14.898282,5.716493,-1.071405,4.779303,-1.920993,-1.755101,-1.545066,0.060182,1.760420,-0.262640,...,0.517760,-0.049853,0.479247,-0.564966,0.050939,0.283174,-0.109175,0.418908,-0.041804,-0.703540
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zizek,-16.407772,5.634988,-3.441128,-5.015597,2.537134,0.223693,-2.541143,0.644422,-0.755131,-0.062456,...,-1.672327,2.430587,-0.240545,-2.131721,-0.454149,-1.030204,0.875332,1.011439,2.551678,4.822103
rambo,-19.358207,-3.096102,-6.544271,-5.277370,-0.254373,-0.220026,2.230718,-9.062561,-1.920210,0.280604,...,0.987001,-1.691537,3.446758,0.829936,-0.195757,0.257536,-0.817519,-3.727301,1.773107,0.617428
damme,-13.203125,-1.000865,-0.015058,-5.498788,-3.987283,0.255498,-0.989015,-7.188067,-0.911767,0.278866,...,-0.170500,-2.170045,-0.530054,0.145545,-0.337397,-0.760206,-0.394953,0.931298,1.833628,0.793630
prom,-15.974234,-2.509525,-1.992160,-4.603216,-4.138078,-2.522065,-2.937357,1.416898,1.373615,7.705631,...,-1.174925,-0.564942,-1.116560,2.018016,-0.174614,-1.132998,-1.246665,-0.906512,0.982439,-0.136276


In [147]:
pmi_lsa = lsa(pmi_matrix)
pmi_lsa

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
movie,-30.996966,12.281383,-20.130567,9.127270,-9.308476,2.194856,-9.504050,-1.711736,3.611705,-4.930767,...,1.212752,-0.555996,0.146086,1.671327,0.682648,0.493671,-1.215748,0.790011,-0.669538,-0.473205
gets,7.983496,-17.172496,23.490898,0.721508,4.224925,13.922343,-2.311017,1.888059,-0.440123,0.551346,...,4.154693,2.220242,0.011382,-3.372876,0.396312,-1.577621,3.241953,0.379332,-0.257516,-1.659548
respect,27.076154,-8.111740,-2.466724,0.671896,3.780135,-7.831400,-6.961508,4.068368,-0.939395,2.988663,...,1.886339,1.611740,-1.622932,0.719130,-0.946055,0.857376,0.129315,2.603815,-0.738143,-0.906196
sure,-1.953920,-12.451488,-3.698162,5.447015,-6.594891,2.352400,-6.624891,-0.522755,1.235167,3.153974,...,-1.329850,-1.443585,-0.793836,-3.728143,0.362028,0.333679,0.697623,1.062412,0.480549,-0.075656
lot,-9.861436,-9.421169,-8.578809,2.365387,0.681257,3.869993,-0.185431,-0.440353,4.409675,4.241523,...,-1.094110,-0.515087,-1.256555,-5.286726,1.295167,-0.316397,-0.297129,0.519631,1.195600,0.874088
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zizek,15.460434,6.514515,-5.251823,3.982650,2.759290,-1.523006,-2.982499,-1.014599,-0.650871,1.120804,...,2.640928,-2.160809,1.496573,1.162418,-1.759512,-1.390528,1.176112,3.591755,-1.667066,1.419810
rambo,18.601311,6.887626,2.980360,6.018136,0.349667,0.709237,2.211802,8.693958,-1.897954,-0.377545,...,-0.597318,0.679128,1.554910,-0.538404,-0.179576,2.205311,0.173923,-3.062339,0.286268,2.666035
damme,12.055300,7.068702,0.663566,0.387581,-3.963417,1.580911,-1.143454,6.704413,0.266789,-0.061852,...,2.110733,2.896832,-1.313193,1.632932,0.752444,-0.345452,0.088786,-0.130202,-0.816664,-0.342059
prom,15.173886,6.529561,2.615043,2.521147,-3.174748,4.153146,-2.175693,-1.260643,3.333151,-5.964785,...,2.144920,2.036799,0.885265,-0.436266,-1.083908,-2.263215,0.603289,-1.630091,-0.350099,0.129509


In [161]:
print("Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : ")
print(voisins('beautiful', pmi_lsa , 'cosine', 5))
print("\n \n")
print("Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : ")
print(voisins('beautiful', ppmi_lsa, 'cosine', 5))

Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : 
['whining', 'captured', 'grace', 'trio', 'movie']

 

Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : 
['dolls', 'gangsta', 'interpret', 'movie', 'tormented']


**e)** En utilisant sklearn.decomposition.TruncatedSVD, créez les matrices réduites à partir des mêmes matrices que celles de la question précédentes (la matrice pmi et la matrice pmi_scaled) Puis tester ces nouvelles matrices LSA pour trouver les 5 mots les plus similaires au mot « beautiful » 

Ici aussi, nous voulons aussi obtenir des matrices de dimension k=100

In [163]:
def truncatedSVD(df, k=100):
    svd = TruncatedSVD(n_components=k, n_iter=7, random_state=42)
    data = svd.fit_transform(df.to_numpy())
    df1 = pd.DataFrame(data, index=df.index)
    
    return df1



In [164]:
ppmi_truncatedSVD = truncatedSVD(ppmi_matrix)
ppmi_truncatedSVD

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
movie,6.875900,-5.150271,-1.214862,-0.130815,5.851682,1.055850,-5.010509,-0.111689,-0.201659,1.413010,...,-0.085448,-0.598126,-0.347820,-0.122421,-0.069646,0.116896,0.025680,0.005751,-0.288761,0.284774
gets,25.201210,14.204073,-2.594895,8.073709,2.852891,-8.476384,-2.670406,1.508135,0.426926,0.083879,...,-1.197507,-0.004118,0.681262,0.348760,0.743913,-0.796233,-1.183977,-1.117039,0.928062,0.749337
respect,28.441859,-2.493637,-0.763055,4.410993,-6.125576,5.443398,-5.691303,3.162186,-1.513541,-4.117735,...,-0.056738,0.242567,0.570080,0.697051,2.968114,1.875057,1.062940,-0.331199,-1.271844,1.924063
sure,15.736041,-3.332108,-1.981405,5.319605,5.087340,0.191612,-3.970219,-0.166766,-0.158055,-0.198132,...,-0.759841,0.195264,0.045048,-0.533728,-0.387958,0.563081,-0.254031,0.381477,-0.371315,-0.163184
lot,14.898282,-5.716493,-1.071405,4.779303,1.920994,-1.755102,-1.545071,-0.060183,1.760476,-0.262873,...,-0.591767,-0.049440,-0.665081,-0.090480,-0.207440,-0.004189,-0.083309,0.423088,1.217152,0.746143
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zizek,16.407772,-5.634989,-3.441129,-5.015596,-2.537135,0.223696,-2.541136,-0.644522,-0.755117,-0.062118,...,-3.922010,1.344138,-3.107003,-1.339247,-1.032348,0.119394,1.348325,1.554695,-0.148490,0.201560
rambo,19.358207,3.096102,-6.544271,-5.277371,0.254367,-0.220023,2.230673,9.062365,-1.919743,0.279840,...,-3.220361,-2.745002,-0.214927,-0.775224,-1.627244,-0.192408,3.239062,-2.189927,-2.022794,1.298868
damme,13.203125,1.000865,-0.015058,-5.498789,3.987281,0.255499,-0.988984,7.188223,-0.911774,0.278815,...,-0.089113,-1.522026,-0.542027,-0.540209,0.157325,-1.572020,-0.730864,0.735490,0.653816,1.791667
prom,15.974234,2.509525,-1.992160,-4.603215,4.138078,-2.522064,-2.937340,-1.416952,1.373459,7.705506,...,2.615759,-0.289743,1.236983,-1.527924,-1.265606,-2.104563,4.136971,0.122504,0.298495,-2.000945


In [165]:
pmi_truncatedSVD = truncatedSVD(pmi_matrix)
pmi_truncatedSVD

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
movie,-30.996966,-12.281383,-20.130567,-9.127270,9.308475,-2.194854,-9.504042,1.711707,3.611522,4.930784,...,0.107956,2.342688,0.872651,1.121279,-0.258049,1.187526,0.225434,-0.007995,0.398386,-0.484870
gets,7.983496,17.172496,23.490898,-0.721509,-4.224924,-13.922340,-2.311034,-1.888068,-0.439761,-0.551129,...,-0.297326,0.342124,-0.507548,-2.406663,1.017843,0.931135,0.269351,-1.668255,1.100778,2.784188
respect,27.076154,8.111740,-2.466724,-0.671896,-3.780136,7.831402,-6.961500,-4.068306,-0.939959,-2.988541,...,-2.470195,0.456735,-2.191153,1.341995,-1.054277,0.466899,-2.143032,-0.165540,0.610369,-2.683283
sure,-1.953920,12.451488,-3.698162,-5.447015,6.594891,-2.352401,-6.624913,0.522905,1.235213,-3.154181,...,-0.326661,-0.658902,-0.365359,-0.047087,0.802128,-1.171255,-0.012884,0.139722,-1.074119,0.069421
lot,-9.861436,9.421169,-8.578809,-2.365387,-0.681257,-3.869994,-0.185437,0.440351,4.409532,-4.241879,...,-2.442502,-2.261705,0.492558,-2.138429,1.759332,-0.245602,-2.580082,0.032375,1.998974,-0.419569
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zizek,15.460434,-6.514515,-5.251823,-3.982650,-2.759291,1.523008,-2.982509,1.014739,-0.650868,-1.120791,...,0.787094,-2.717026,3.076220,1.178693,4.179781,-0.207018,1.230511,1.772905,-2.647206,-0.560655
rambo,18.601311,-6.887626,2.980361,-6.018136,-0.349669,-0.709235,2.211785,-8.693877,-1.897704,0.377397,...,1.298375,-1.743558,2.053586,0.270945,-2.891282,2.120520,-1.250465,0.814452,-1.770054,1.138342
damme,12.055300,-7.068702,0.663566,-0.387581,3.963417,-1.580910,-1.143441,-6.704542,0.266858,0.061897,...,1.019845,0.032770,0.691187,1.009763,0.416116,0.461118,0.011316,0.156626,1.858001,0.773845
prom,15.173886,-6.529561,2.615043,-2.521147,3.174748,-4.153146,-2.175693,1.260700,3.333042,5.964867,...,-1.952098,0.944925,2.360968,-3.270663,-3.581027,-3.543832,-0.765983,0.702082,-1.659006,-0.969973


In [166]:
print("Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : ")
print(voisins('beautiful', pmi_truncatedSVD , 'cosine', 5))
print("\n \n")
print("Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : ")
print(voisins('beautiful', ppmi_truncatedSVD, 'cosine', 5))

Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : 
['whining', 'captured', 'grace', 'trio', 'movie']

 

Les 5 mots les plus similaires à beautiful en utilisant la distance cosinus sont : 
['dolls', 'gangsta', 'interpret', 'movie', 'tormented']


f) Commentez vos résultats

On peut remarqué que notre implémentation lsa donne à peu près la même matrice que la fonction TruncatedSVD de sklearn. De plus, on obtient la même réponse concernant les 5 mots les plus similaires de "beautiful" pour les matrices généré par notre implémentation et celle généré par sklearn.

## 5. Évaluation (10 points)

Il est temps d’évaluer l’intérêt de nos modèles de vecteurs. Nous allons pour cela utiliser un ensemble de données de similarité de mots (relatedness) The MEN Test Collection, qui se trouve dans le répertoire test. L’ensemble de données contient une paire de mots avec un score de similarité attribué par des humains. En d’autres termes, une ligne (un exemple) de l’ensemble de données est de la forme : \<mot_1> \<mot_2> \<score>.

Pour aligner les distances obtenues avec vos métriques, ce score est converti en nombre réel négatif par la fonction read_test_dataset que vous avez dans le squelette du TP.

La métrique d’évaluation est le coefficient de corrélation de Spearman 𝜌 entre les scores humains et vos distances (voir https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient). 

Nous allons maintenant évaluer les différents vsm obtenus sur l'ensemble de données: MEN_dataset.


#### Fonctions pour lire le jeu de données MEN

In [110]:
def read_test_dataset(
        src_filename,
        delimiter=','):
    with open(src_filename) as f:
        reader = csv.reader(f, delimiter=delimiter)
        for row in reader:
            w1 = row[0].strip().lower()
            w2 = row[1].strip().lower()
            score = row[2]
            score = -float(score)
            yield (w1, w2, score)

In [111]:
# Retourne un itérable sur le jeu de données MEN
def men_dataset():
    src_filename = os.path.join(
        DIRNAME_MEN, 'MEN_dataset_natural_form_full')
    return read_test_dataset(
        src_filename, delimiter=' ')

In [178]:
def evaluate(ds, df, distfunc=cosine):
    """
    ds : iterator
       retourne des tuples (word1, word2, score).

    df : pd.DataFrame
        le modèle vsm à évaluer

    distfunc : la mesure de distance entre vecteurs
  
    Retour: le coefficient de correlation de Spearman entre les scores de l'ensemble de données de test 
    et celui du modele vsm qui se présente sous la forme d'un cadre Panda pd avec les colonnes
    ['word1', 'word2', 'score', 'distance'].
    """
    data = []
    for w1, w2, score in ds:
        d = {'word1': w1, 'word2': w2,'score': score}
        if w1 not in df.index or w2 not in df.index:
            continue
        else:
            w1 = df.loc[w1]
            w2 = df.loc[w2] 
        d['distance'] = distfunc(w1, w2)
        data.append(d)

    data = pd.DataFrame(data)
    rho, pvalue = spearmanr(data['score'].values, b=data['distance'].values)
    return rho, data

**a)**	Testez chacun de vos modèles vsm (Matrice de base, matrice scalée, les PMIs et PPMIs et toutes les matrices LSA (de base, scalée, pmi, ppmi) en appelant la fonction evaluate avec les deux mesure de distance (euclidienne et cosinus) et affichez vos résultats dans une seule table.

In [194]:
table =[]
dataSet = [Mww, Mww_scaled, ppmi_matrix, ppmi_matrix_scaled, pmi_matrix, pmi_matrix_scaled, Mww_lsa, Mww_scaled_lsa, ppmi_lsa, pmi_lsa, ppmi_truncatedSVD, pmi_truncatedSVD]
for data in dataSet:
    iterator = men_dataset()
    table.append(evaluate(iterator, data))
    iterator = men_dataset()
    table.append(evaluate(iterator, data, euclidean))
table

[(0.017651716917010694,
          word1        word2  score  distance
  0       river        water  -49.0  0.223493
  1        rain        storm  -49.0  0.488983
  2       dance      dancers  -49.0  0.375209
  3      camera  photography  -49.0  0.232320
  4      photos  photography  -47.0  0.405925
  ..        ...          ...    ...       ...
  922       car       tongue   -4.0  0.593657
  923      fish      theatre   -3.0  0.398236
  924       hot       zombie   -3.0  0.165601
  925  children         ford   -3.0  0.303600
  926     grave          hat   -2.0  0.368181
  
  [927 rows x 4 columns]),
 (0.1279070526905864,
          word1        word2  score    distance
  0       river        water  -49.0  130.575649
  1        rain        storm  -49.0   83.204567
  2       dance      dancers  -49.0  295.514805
  3      camera  photography  -49.0  711.386674
  4      photos  photography  -47.0  185.359111
  ..        ...          ...    ...         ...
  922       car       tongue   -4.0 

**b)**	Commentez vos résultats d'évaluation