# Sentence Embedding - Présentation générale

* Ce notebook présente différentes techniques de "Sentence Embeddings", permettant de générer des features à partir de phrases (ici des tweets)
* L'objectif est de pouvoir séparer les sentiments des tweets de façon automatique, via un T-SNE, qui permet une réduction des features en 2 dimensions
* C'est un notebook d'exemples afin de mieux comprendre la mise en oeuvre des techniques. Il n'est pas optimisé et doit être adapté à un nouveau contexte, en particulier sur les points suivants :
    * Le nettoyage des textes
    * les modèles BERT (model_type) idéalement pré-entraînés sur des données similaires au contexte (ici le modèle 'cardiffnlp/twitter-roberta-base-sentiment' surperforme le modèle de base car il a été pré-entraîné sur des tweets)
    * La taille des vecteurs (max_length)
    * Le batch_size
    * La perplexité du Tsne (perplexity à 30 par défaut)

# Préparation initiale dataset

## Récupération du dataset et filtres de données

In [2]:
# Import des librairies
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
# import seaborn as sns

In [3]:
# Fichier des tweets à récupérer sur : https://www.kaggle.com/crowdflower/twitter-airline-sentiment?select=Tweets.csv

data_T = pd.read_csv('/Users/maurelco/Developer/Python/Projet 4/data/Cleaned/df_clean.csv')
data_T.shape

(49843, 3)

In [4]:
data_T

Unnamed: 0,Title_2,Tags,Text
0,giving unix process exclusive rw access directory,"['linux', 'ubuntu', 'process', 'sandbox', 'sel...",giving unix process exclusive rw access direct...
1,automatic repaint minimizing window,"['java', 'graphics', 'jframe', 'jpanel', 'paint']",automatic repaint minimizing window jframe pan...
2,man-in-the-middle attack security threat ssh a...,"['security', 'ssh', 'ssh-keys', 'openssh', 'ma...",man-in-the-middle attack security threat ssh a...
3,managing data access winforms app,"['c#', 'winforms', 'sqlite', 'datatable', 'sql...",managing data access winforms app winforms ent...
4,render basic html view,"['javascript', 'html', 'node.js', 'mongodb', '...",render basic html view basic node.js app get g...
...,...,...,...
49838,bypass vertica error execution time exceeded r...,"['sql-server', 'ssas', 'oledb', 'sql-server-da...",bypass vertica error execution time exceeded r...
49839,conflicting conditional operation progress bucket,"['python', 'amazon-web-services', 'amazon-s', ...",conflicting conditional operation progress buc...
49840,problem lr_find pytorch fastai course,"['python', 'machine-learning', 'deep-learning'...",problem lr_find pytorch fastai course jupyter ...
49841,jsonpatch escape slash jsonpatch+json,"['java', 'json', 'rest', 'json-patch', 'http-p...",jsonpatch escape slash jsonpatch+json json wan...


# Préparation commune des traitements

In [8]:
# Import des librairies
# import nltk
# import pickle
# import time
from sklearn import cluster, metrics
from sklearn import manifold, decomposition
import logging

logging.disable(logging.WARNING) # disable WARNING, INFO and DEBUG logging everywhere

## Lecture dataset

In [8]:
l_cat = list(set(data_T['Tags']))
print("catégories : ", l_cat)
y_cat_num = [(1-l_cat.index(data_T.iloc[i]['Tags'])) for i in range(len(data_T))]

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [9]:
l_cat.index(data_T.iloc[0]['Tags'])

18197

In [10]:
l_cat

["['c', 'sorting', 'multidimensional-array', 'quicksort', 'bubble-sort']",
 "['java', 'spring', 'spring-mvc', 'spring-boot', 'mybatis']",
 "['c#', '.net', 'sql', 'orm', 'dapper']",
 "['c#', 'entity-framework', 'linq', 'entity-framework', 'entity-framework-core']",
 "['python', 'django', 'search', 'elasticsearch', 'wagtail']",
 "['amazon-web-services', 'amazon-ec', 'jupyterhub', 'amazon-ebs', 'amazon-efs']",
 "['json', 'algorithm', 'mongodb', 'meteor', 'mirroring']",
 "['php', 'sorting', 'localization', 'collation', 'strcmp']",
 "['java', 'mysql', 'jdbc', 'insert-update', 'connector-j']",
 "['javascript', 'ajax', 'encoding', 'utf', 'xmlhttprequest']",
 "['c#', 'wpf', 'xaml', 'mvvm', 'charts']",
 "['php', 'jquery', 'wordpress', 'plugins', 'jetpack']",
 "['ios', 'iphone', 'json', 'swift', 'getter-setter']",
 "['c++', 'string', 'formatting', 'newline', 'printf']",
 "['java', 'json', 'spring', 'spring-boot', 'web']",
 "['c#', 'delegates', 'lambda', 'anonymous-methods', 'out-parameters']",
 

In [11]:
y_cat_num

[-18196,
 -27898,
 -11925,
 -43062,
 -9897,
 -45106,
 -34419,
 -11067,
 -7352,
 -58,
 -41084,
 -30914,
 -4419,
 -4071,
 -39026,
 -15224,
 -25155,
 -38978,
 -28122,
 -31998,
 -4453,
 -4113,
 -48200,
 -4383,
 -46364,
 -433,
 -39169,
 -28017,
 -38464,
 -15572,
 -28544,
 -11782,
 -25761,
 -8701,
 -23715,
 -33597,
 -23770,
 -10037,
 -18460,
 -29735,
 -22108,
 -7331,
 -18815,
 -6241,
 -15326,
 -9038,
 -23945,
 -32015,
 -9141,
 -33550,
 -4600,
 -30651,
 -38271,
 -2334,
 -40363,
 -26966,
 -19340,
 -37387,
 -14300,
 -29250,
 -44136,
 -37111,
 -4538,
 -46875,
 -24758,
 -3712,
 -32657,
 -1811,
 -4946,
 -19861,
 -39703,
 -31898,
 -22992,
 -22278,
 -4841,
 -4317,
 -1367,
 -16252,
 -35049,
 -3091,
 -8045,
 -43533,
 -46612,
 -5740,
 -47818,
 -8984,
 -20332,
 -33849,
 -23739,
 -41898,
 -6580,
 -43967,
 -10479,
 -27249,
 -42326,
 -33590,
 -21074,
 -10508,
 -42562,
 -12453,
 -33370,
 -34945,
 -37040,
 -42969,
 -15875,
 -26619,
 -31576,
 -17019,
 -16646,
 -14162,
 -4777,
 -14009,
 -2087,
 -37414,
 -30200

## Fonctions communes

In [20]:
import time

# Calcul Tsne, détermination des clusters et calcul ARI entre vrais catégorie et n° de clusters
def ARI_fct(features) :
    time1 = time.time()
    num_labels=len(l_cat)
    tsne = manifold.TSNE(n_components=2, perplexity=30, n_iter=2000, 
                                 init='random', learning_rate=200, random_state=42)
    X_tnse = tnse.fit_transform(features)
    
    # Détermination des clusters à partir des données après Tsne 
    cls = cluster.KMeans(n_clusters=num_labels, n_init=100, random_state=42)
    cls.fit(X_tsne)
    ARI = np.round(metrics.adjusted_rand_score(y_cat_num, cls.labels_),4)
    time2 = np.round(time.time() - time1,0)
    print("ARI : ", ARI, "time : ", time2)
    
    return ARI, X_tsne, cls.labels_


# visualisation du Tsne selon les vraies catégories et selon les clusters
def TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI) :
    fig = plt.figure(figsize=(15,6))
    
    ax = fig.add_subplot(121)
    scatter = ax.scatter(X_tsne[:,0],X_tsne[:,1], c=y_cat_num, cmap='Set1')
    ax.legend(handles=scatter.legend_elements()[0], labels=l_cat, loc="best", title="Categorie")
    plt.title('Représentation des tweets par catégories réelles')
    
    ax = fig.add_subplot(122)
    scatter = ax.scatter(X_tsne[:,0],X_tsne[:,1], c=labels, cmap='Set1')
    ax.legend(handles=scatter.legend_elements()[0], labels=set(labels), loc="best", title="Clusters")
    plt.title('Représentation des tweets par clusters')
    
    plt.show()
    print("ARI : ", ARI)


In [21]:
import time
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import jaccard_score
from statistics import mean

# Calcul Tsne, détermination des clusters et calcul ARI entre vrais catégorie et n° de clusters
def multioutputClassifier_fct(x_embedding_matrix_train, y_labels_train, x_embedding_matrix_test, y_labels_test) :
    time1 = time.time()
    num_labels=len(l_cat)
    clf = MultiOutputClassifier(LogisticRegression(n_jobs=-1, max_iter= 200),n_jobs=-1)
    X_clf = clf.fit(x_embedding_matrix_train, y_labels_train)
    y_pred = X_clf.predict(x_embedding_matrix_test)

    y_pred_df = pd.DataFrame(y_pred)
    y_pred_df.columns = y_labels_test.columns

    y_true_df = pd.DataFrame(y_labels_test)
    y_pred_df.columns = y_labels_test.columns

    jaccard_scores = {}
    for i in range(len(y_true_df)):
        jaccard_scores[i] = jaccard_score(y_true_df.iloc[i], y_pred_df.iloc[i], average='macro')

    time2 = np.round(time.time() - time1,0)
    print("jaccard score : ", mean([jaccard_scores[key] for key in jaccard_scores]), "time : ", time2)

    return y_pred_df


# visualisation du Tsne selon les vraies catégories et selon les clusters
def TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI) :
    fig = plt.figure(figsize=(15,6))

    ax = fig.add_subplot(121)
    scatter = ax.scatter(X_tsne[:,0],X_tsne[:,1], c=y_cat_num, cmap='Set1')
    ax.legend(handles=scatter.legend_elements()[0], labels=l_cat, loc="best", title="Categorie")
    plt.title('Représentation des tweets par catégories réelles')

    ax = fig.add_subplot(122)
    scatter = ax.scatter(X_tsne[:,0],X_tsne[:,1], c=labels, cmap='Set1')
    ax.legend(handles=scatter.legend_elements()[0], labels=set(labels), loc="best", title="Clusters")
    plt.title('Représentation des tweets par clusters')

    plt.show()
    print("ARI : ", ARI)


# Bag of word - Tf-idf

## Préparation sentences

In [17]:
# création du bag of words (CountVectorizer et Tf-idf)

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

cvect = CountVectorizer(analyzer= 'word', min_df = 0.009,token_pattern='[a-zA_][a-zA_\#+-]*')
ctf = TfidfVectorizer(analyzer= 'word', min_df = 0.009, sublinear_tf = True,token_pattern='[a-zA_][a-zA_\#+-]*')

feat = 'Text'
cv_fit = cvect.fit(data_T[feat])
ctf_fit = ctf.fit(data_T[feat])

cv_transform = cvect.transform(data_T[feat])  
ctf_transform = ctf.transform(data_T[feat])

In [19]:
ctf_transform

<49843x1129 sparse matrix of type '<class 'numpy.float64'>'
	with 1561243 stored elements in Compressed Sparse Row format>

## Exécution des modèles

In [None]:
print("CountVectorizer : ")
print("-----------------")
y_pred = multioutputClassifier_fct(cv_transform)
print()
print("Tf-idf : ")
print("--------")
ARI, X_tsne, labels = ARI_fct(ctf_transform)


## Graphiques

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

# Word2Vec

In [5]:
import gensim

In [6]:
from keras.preprocessing.text import Tokenizer

In [7]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
# from tensorflow.keras import metrics as kmetrics
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model

In [9]:
data_T

Unnamed: 0,Title_2,Tags,Text
0,giving unix process exclusive rw access directory,"['linux', 'ubuntu', 'process', 'sandbox', 'sel...",giving unix process exclusive rw access direct...
1,automatic repaint minimizing window,"['java', 'graphics', 'jframe', 'jpanel', 'paint']",automatic repaint minimizing window jframe pan...
2,man-in-the-middle attack security threat ssh a...,"['security', 'ssh', 'ssh-keys', 'openssh', 'ma...",man-in-the-middle attack security threat ssh a...
3,managing data access winforms app,"['c#', 'winforms', 'sqlite', 'datatable', 'sql...",managing data access winforms app winforms ent...
4,render basic html view,"['javascript', 'html', 'node.js', 'mongodb', '...",render basic html view basic node.js app get g...
...,...,...,...
49838,bypass vertica error execution time exceeded r...,"['sql-server', 'ssas', 'oledb', 'sql-server-da...",bypass vertica error execution time exceeded r...
49839,conflicting conditional operation progress bucket,"['python', 'amazon-web-services', 'amazon-s', ...",conflicting conditional operation progress buc...
49840,problem lr_find pytorch fastai course,"['python', 'machine-learning', 'deep-learning'...",problem lr_find pytorch fastai course jupyter ...
49841,jsonpatch escape slash jsonpatch+json,"['java', 'json', 'rest', 'json-patch', 'http-p...",jsonpatch escape slash jsonpatch+json json wan...


## Création du modèle Word2Vec

In [1]:
w2v_size= 300
w2v_window=5
w2v_min_count=50
w2v_epochs=10
maxlen = 500 # adapt to length of sentences

In [8]:
sentences = data_T['Text'].to_list()
sentences = [gensim.utils.simple_preprocess(text) for text in sentences]

In [28]:
sentences

[['giving',
  'unix',
  'process',
  'exclusive',
  'rw',
  'access',
  'directory',
  'sandbox',
  'linux',
  'process',
  'certain',
  'directory',
  'process',
  'exclusive',
  'rw',
  'access',
  'dir',
  'temporary',
  'directory',
  'python',
  'scripting',
  'tool',
  'directory',
  'limiting',
  'much',
  'functionality',
  'process',
  'access',
  'directory',
  'except',
  'superusers',
  'course',
  'sandbox',
  'web',
  'service',
  'basically',
  'allows',
  'user',
  'arbitrary',
  'code',
  'authorization',
  'software',
  'process',
  'linux',
  'user',
  'user',
  'harm',
  'system',
  'temporary',
  'private',
  'directory',
  'file',
  'protected',
  'user',
  'webservice'],
 ['automatic',
  'repaint',
  'minimizing',
  'window',
  'jframe',
  'panel',
  'panel',
  'draw',
  'line',
  'minimized',
  'window',
  'java',
  'maximized',
  'line',
  'drew',
  'repainted',
  'place',
  'lock',
  'painting',
  'minimize',
  'screw',
  'drawing',
  'thank',
  'import',
  'j

In [9]:
# Création et entraînement du modèle Word2Vec

print("Build & train Word2Vec model ...")
w2v_model = gensim.models.Word2Vec(min_count=w2v_min_count, window=w2v_window,
                                                vector_size=w2v_size,
                                                seed=42,
                                                workers=1)
#                                                workers=multiprocessing.cpu_count())
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=w2v_epochs)
model_vectors = w2v_model.wv
w2v_words = model_vectors.index_to_key
print("Vocabulary size: %i" % len(w2v_words))
print("Word2Vec trained")

Build & train Word2Vec model ...
Vocabulary size: 8785
Word2Vec trained


In [10]:
# Préparation des sentences (tokenization)
print("Fit Tokenizer ...")
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
x_sentences = pad_sequences(tokenizer.texts_to_sequences(sentences),
                                                     maxlen=maxlen,
                                                     padding='post') 
                                                   
num_words = len(tokenizer.word_index) + 1
print("Number of unique words: %i" % num_words)

Fit Tokenizer ...
Number of unique words: 223828


## Création de la matrice d'embedding

In [11]:
# Création de la matrice d'embedding

print("Create Embedding matrix ...")
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1
embedding_matrix = np.zeros((vocab_size, w2v_size))
i=0
j=0
    
for word, idx in word_index.items():
    i +=1
    if word in w2v_words:
        j +=1
        embedding_vector = model_vectors[word]
        if embedding_vector is not None:
            embedding_matrix[idx] = model_vectors[word]
            
word_rate = np.round(j/i,4)
print("Word embedding rate : ", word_rate)
print("Embedding matrix: %s" % str(embedding_matrix.shape))

Create Embedding matrix ...
Word embedding rate :  0.0392
Embedding matrix: (223828, 300)


In [12]:
embedding_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.89845711,  0.70766497,  0.65110111, ..., -0.7324717 ,
        -0.66601145, -0.22039014],
       [-0.13043587, -0.1094703 , -1.12796104, ..., -0.8673963 ,
        -0.23623201,  0.42596436],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

## Création du modèle d'embedding

In [13]:
# Création du modèle

input=Input(shape=(len(x_sentences),maxlen),dtype='float64')
word_input=Input(shape=(maxlen,),dtype='float64')  
word_embedding=Embedding(input_dim=vocab_size,
                         output_dim=w2v_size,
                         weights = [embedding_matrix],
                         input_length=maxlen)(word_input)
word_vec=GlobalAveragePooling1D()(word_embedding)  
embed_model = Model([word_input],word_vec)

embed_model.summary()

Metal device set to: Apple M1


2023-01-27 20:02:36.292750: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-01-27 20:02:36.293167: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 500)]             0         
                                                                 
 embedding (Embedding)       (None, 500, 300)          67148400  
                                                                 
 global_average_pooling1d (G  (None, 300)              0         
 lobalAveragePooling1D)                                          
                                                                 
Total params: 67,148,400
Trainable params: 67,148,400
Non-trainable params: 0
_________________________________________________________________


## Exécution du modèle

In [14]:
embeddings = embed_model.predict(x_sentences)
embeddings.shape

  29/1558 [..............................] - ETA: 5s

2023-01-27 20:02:42.924895: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-01-27 20:02:42.988169: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




(49843, 300)

In [15]:
embeddings_df = pd.DataFrame(embeddings)
embeddings_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-0.017392,-0.005160,0.050944,-0.038304,-0.007477,0.005038,-0.027893,0.020133,0.007672,-0.011984,...,-0.032344,0.001750,-0.012358,-0.009956,0.010883,-0.031312,-0.063954,0.045778,0.021745,0.011540
1,-0.027313,-0.173435,-0.047737,-0.116093,0.104681,0.098169,0.033931,-0.122270,-0.287956,0.062984,...,-0.275156,-0.057395,0.008508,0.258975,0.151782,-0.020713,0.038451,-0.177540,0.000615,0.067953
2,0.031391,-0.018802,0.024694,-0.008162,-0.026555,-0.000861,-0.019980,0.015817,0.024177,0.010870,...,0.003158,-0.016125,-0.011170,0.012918,0.007613,-0.024196,-0.032643,0.024943,0.012432,0.022935
3,-0.043320,-0.109560,-0.015442,-0.077797,0.061917,0.032022,0.005346,-0.054202,-0.077683,-0.094736,...,-0.265825,-0.021202,0.047354,0.025186,0.041690,0.019355,0.027115,-0.041112,0.037517,-0.034422
4,-0.051266,-0.035526,-0.010039,-0.023131,0.047670,-0.012697,0.066157,0.016522,0.013432,-0.006722,...,-0.031288,-0.051745,0.004116,-0.006013,-0.014885,-0.040796,0.000389,-0.029573,0.042336,0.037725
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49838,-0.018121,-0.014086,0.007102,-0.013001,-0.004780,-0.006598,-0.009560,0.018915,0.008350,0.016128,...,0.018735,-0.002750,-0.019313,-0.017788,-0.014857,0.025476,-0.011317,0.005399,-0.000702,-0.004715
49839,-0.004759,-0.004354,0.013881,-0.022800,-0.014417,-0.008256,-0.010398,0.020267,-0.021546,-0.025257,...,-0.009637,0.006726,0.003692,-0.032332,0.001329,-0.000101,-0.006961,-0.012844,-0.016385,0.006638
49840,-0.030457,0.007177,-0.069470,-0.004749,0.019007,0.041439,-0.016418,-0.023653,0.020345,-0.001575,...,0.031306,0.022563,0.001820,-0.014546,0.028562,0.030601,-0.011010,0.023115,0.014513,0.013733
49841,-0.009495,0.008387,0.012524,-0.046570,0.027202,0.001802,0.030280,-0.039494,0.005613,-0.029152,...,-0.054591,0.003644,0.007355,-0.010087,-0.027110,-0.026877,-0.039927,0.024970,0.016926,0.060633


In [16]:
embeddings_df.to_csv('/Users/maurelco/Developer/Python/Projet4/data/Cleaned/df_word2vec_matrix_embedding.csv', index=False)