# Sentence Embedding - Présentation générale

* Ce notebook présente différentes techniques de "Sentence Embeddings", permettant de générer des features à partir de phrases (ici des tweets)
* L'objectif est de pouvoir séparer les sentiments des tweets de façon automatique, via un T-SNE, qui permet une réduction des features en 2 dimensions
* C'est un notebook d'exemples afin de mieux comprendre la mise en oeuvre des techniques. Il n'est pas optimisé et doit être adapté à un nouveau contexte, en particulier sur les points suivants :
    * Le nettoyage des textes
    * les modèles BERT (model_type) idéalement pré-entraînés sur des données similaires au contexte (ici le modèle 'cardiffnlp/twitter-roberta-base-sentiment' surperforme le modèle de base car il a été pré-entraîné sur des tweets)
    * La taille des vecteurs (max_length)
    * Le batch_size
    * La perplexité du Tsne (perplexity à 30 par défaut)

# Préparation initiale dataset

## Récupération du dataset et filtres de données

In [1]:
# Import des librairies
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
# import seaborn as sns

In [3]:
df = pd.DataFrame()

In [4]:
df

In [5]:
# Fichier des tweets à récupérer sur : https://www.kaggle.com/crowdflower/twitter-airline-sentiment?select=Tweets.csv

data_T = pd.read_csv('/Users/maurelco/Developer/Python/Projet 4/data/Cleaned/df_clean.csv')
data_T.shape

(49843, 4)

In [6]:
data_T

Unnamed: 0.1,Unnamed: 0,Title_2,Tags,Text
0,0,giving unix process exclusive rw access directory,"['linux', 'ubuntu', 'process', 'sandbox', 'sel...",giving unix process exclusive rw access direct...
1,1,automatic repaint minimizing window,"['java', 'graphics', 'jframe', 'jpanel', 'paint']",automatic repaint minimizing window jframe pan...
2,2,man-in-the-middle attack security threat ssh a...,"['security', 'ssh', 'ssh-keys', 'openssh', 'ma...",man-in-the-middle attack security threat ssh a...
3,3,managing data access winforms app,"['c#', 'winforms', 'sqlite', 'datatable', 'sql...",managing data access winforms app winforms ent...
4,4,render basic html view,"['javascript', 'html', 'node.js', 'mongodb', '...",render basic html view basic node.js app get g...
...,...,...,...,...
49838,49995,bypass vertica error execution time exceeded r...,"['sql-server', 'ssas', 'oledb', 'sql-server-da...",bypass vertica error execution time exceeded r...
49839,49996,conflicting conditional operation progress bucket,"['python', 'amazon-web-services', 'amazon-s', ...",conflicting conditional operation progress buc...
49840,49997,problem lr_find pytorch fastai course,"['python', 'machine-learning', 'deep-learning'...",problem lr_find pytorch fastai course jupyter ...
49841,49998,jsonpatch escape slash jsonpatch+json,"['java', 'json', 'rest', 'json-patch', 'http-p...",jsonpatch escape slash jsonpatch+json json wan...


# Préparation commune des traitements

In [7]:
# Import des librairies
# import nltk
# import pickle
# import time
from sklearn import cluster, metrics
from sklearn import manifold, decomposition
import logging

logging.disable(logging.WARNING) # disable WARNING, INFO and DEBUG logging everywhere

## Lecture dataset

In [8]:
l_cat = list(set(data_T['Tags']))
print("catégories : ", l_cat)
y_cat_num = [(1-l_cat.index(data_T.iloc[i]['Tags'])) for i in range(len(data_T))]

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [9]:
l_cat.index(data_T.iloc[0]['Tags'])

18197

In [10]:
l_cat

["['c', 'sorting', 'multidimensional-array', 'quicksort', 'bubble-sort']",
 "['java', 'spring', 'spring-mvc', 'spring-boot', 'mybatis']",
 "['c#', '.net', 'sql', 'orm', 'dapper']",
 "['c#', 'entity-framework', 'linq', 'entity-framework', 'entity-framework-core']",
 "['python', 'django', 'search', 'elasticsearch', 'wagtail']",
 "['amazon-web-services', 'amazon-ec', 'jupyterhub', 'amazon-ebs', 'amazon-efs']",
 "['json', 'algorithm', 'mongodb', 'meteor', 'mirroring']",
 "['php', 'sorting', 'localization', 'collation', 'strcmp']",
 "['java', 'mysql', 'jdbc', 'insert-update', 'connector-j']",
 "['javascript', 'ajax', 'encoding', 'utf', 'xmlhttprequest']",
 "['c#', 'wpf', 'xaml', 'mvvm', 'charts']",
 "['php', 'jquery', 'wordpress', 'plugins', 'jetpack']",
 "['ios', 'iphone', 'json', 'swift', 'getter-setter']",
 "['c++', 'string', 'formatting', 'newline', 'printf']",
 "['java', 'json', 'spring', 'spring-boot', 'web']",
 "['c#', 'delegates', 'lambda', 'anonymous-methods', 'out-parameters']",
 

In [11]:
y_cat_num

[-18196,
 -27898,
 -11925,
 -43062,
 -9897,
 -45106,
 -34419,
 -11067,
 -7352,
 -58,
 -41084,
 -30914,
 -4419,
 -4071,
 -39026,
 -15224,
 -25155,
 -38978,
 -28122,
 -31998,
 -4453,
 -4113,
 -48200,
 -4383,
 -46364,
 -433,
 -39169,
 -28017,
 -38464,
 -15572,
 -28544,
 -11782,
 -25761,
 -8701,
 -23715,
 -33597,
 -23770,
 -10037,
 -18460,
 -29735,
 -22108,
 -7331,
 -18815,
 -6241,
 -15326,
 -9038,
 -23945,
 -32015,
 -9141,
 -33550,
 -4600,
 -30651,
 -38271,
 -2334,
 -40363,
 -26966,
 -19340,
 -37387,
 -14300,
 -29250,
 -44136,
 -37111,
 -4538,
 -46875,
 -24758,
 -3712,
 -32657,
 -1811,
 -4946,
 -19861,
 -39703,
 -31898,
 -22992,
 -22278,
 -4841,
 -4317,
 -1367,
 -16252,
 -35049,
 -3091,
 -8045,
 -43533,
 -46612,
 -5740,
 -47818,
 -8984,
 -20332,
 -33849,
 -23739,
 -41898,
 -6580,
 -43967,
 -10479,
 -27249,
 -42326,
 -33590,
 -21074,
 -10508,
 -42562,
 -12453,
 -33370,
 -34945,
 -37040,
 -42969,
 -15875,
 -26619,
 -31576,
 -17019,
 -16646,
 -14162,
 -4777,
 -14009,
 -2087,
 -37414,
 -30200

## Fonctions communes

In [22]:
import time

# Calcul Tsne, détermination des clusters et calcul ARI entre vrais catégorie et n° de clusters
def ARI_fct(features) :
    time1 = time.time()
    num_labels=len(l_cat)
    tsne = manifold.TSNE(n_components=2, perplexity=30, n_iter=2000, 
                                 init='random', learning_rate=200, random_state=42)
    X_tsne = tsne.fit_transform(features)
    
    # Détermination des clusters à partir des données après Tsne 
    cls = cluster.KMeans(n_clusters=num_labels, n_init=100, random_state=42)
    cls.fit(X_tsne)
    ARI = np.round(metrics.adjusted_rand_score(y_cat_num, cls.labels_),4)
    time2 = np.round(time.time() - time1,0)
    print("ARI : ", ARI, "time : ", time2)
    
    return ARI, X_tsne, cls.labels_


# visualisation du Tsne selon les vraies catégories et selon les clusters
def TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI) :
    fig = plt.figure(figsize=(15,6))
    
    ax = fig.add_subplot(121)
    scatter = ax.scatter(X_tsne[:,0],X_tsne[:,1], c=y_cat_num, cmap='Set1')
    ax.legend(handles=scatter.legend_elements()[0], labels=l_cat, loc="best", title="Categorie")
    plt.title('Représentation des tweets par catégories réelles')
    
    ax = fig.add_subplot(122)
    scatter = ax.scatter(X_tsne[:,0],X_tsne[:,1], c=labels, cmap='Set1')
    ax.legend(handles=scatter.legend_elements()[0], labels=set(labels), loc="best", title="Clusters")
    plt.title('Représentation des tweets par clusters')
    
    plt.show()
    print("ARI : ", ARI)


# Bag of word - Tf-idf

## Préparation sentences

In [12]:
# création du bag of words (CountVectorizer et Tf-idf)

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

cvect = CountVectorizer(analyzer= 'word', min_df = 0.009,token_pattern='[a-zA_][a-zA_\#+-]*')
ctf = TfidfVectorizer(analyzer= 'word', min_df = 0.009, sublinear_tf = True,token_pattern='[a-zA_][a-zA_\#+-]*')

feat = 'Text'
cv_fit = cvect.fit(data_T[feat])
ctf_fit = ctf.fit(data_T[feat])

cv_transform = cvect.transform(data_T[feat])  
ctf_transform = ctf.transform(data_T[feat])  

## Exécution des modèles

In [None]:
print("CountVectorizer : ")
print("-----------------")
ARI, X_tsne, labels = ARI_fct(cv_transform)
print()
print("Tf-idf : ")
print("--------")
ARI, X_tsne, labels = ARI_fct(ctf_transform)


## Graphiques

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

# Word2Vec

In [10]:
df

In [13]:
import gensim



In [14]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [15]:
# import tensorflow as tf
# import keras
# from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing.sequence import pad_sequences
# from tensorflow.keras import metrics as kmetrics
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model

In [16]:
data_T

Unnamed: 0.1,Unnamed: 0,Title_2,Tags,Text
0,0,giving unix process exclusive rw access directory,"['linux', 'ubuntu', 'process', 'sandbox', 'sel...",giving unix process exclusive rw access direct...
1,1,automatic repaint minimizing window,"['java', 'graphics', 'jframe', 'jpanel', 'paint']",automatic repaint minimizing window jframe pan...
2,2,man-in-the-middle attack security threat ssh a...,"['security', 'ssh', 'ssh-keys', 'openssh', 'ma...",man-in-the-middle attack security threat ssh a...
3,3,managing data access winforms app,"['c#', 'winforms', 'sqlite', 'datatable', 'sql...",managing data access winforms app winforms ent...
4,4,render basic html view,"['javascript', 'html', 'node.js', 'mongodb', '...",render basic html view basic node.js app get g...
...,...,...,...,...
49838,49995,bypass vertica error execution time exceeded r...,"['sql-server', 'ssas', 'oledb', 'sql-server-da...",bypass vertica error execution time exceeded r...
49839,49996,conflicting conditional operation progress bucket,"['python', 'amazon-web-services', 'amazon-s', ...",conflicting conditional operation progress buc...
49840,49997,problem lr_find pytorch fastai course,"['python', 'machine-learning', 'deep-learning'...",problem lr_find pytorch fastai course jupyter ...
49841,49998,jsonpatch escape slash jsonpatch+json,"['java', 'json', 'rest', 'json-patch', 'http-p...",jsonpatch escape slash jsonpatch+json json wan...


## Création du modèle Word2Vec

In [14]:
w2v_size=300
w2v_window=5
w2v_min_count=5
w2v_epochs=100
maxlen = 24 # adapt to length of sentences
sentences = data_T['Text'].to_list()
sentences = [gensim.utils.simple_preprocess(text) for text in sentences]

In [15]:
sentences

[['giving',
  'unix',
  'process',
  'exclusive',
  'rw',
  'access',
  'directory',
  'sandbox',
  'linux',
  'process',
  'certain',
  'directory',
  'process',
  'exclusive',
  'rw',
  'access',
  'dir',
  'temporary',
  'directory',
  'python',
  'scripting',
  'tool',
  'directory',
  'limiting',
  'much',
  'functionality',
  'process',
  'access',
  'directory',
  'except',
  'superusers',
  'course',
  'sandbox',
  'web',
  'service',
  'basically',
  'allows',
  'user',
  'arbitrary',
  'code',
  'authorization',
  'software',
  'process',
  'linux',
  'user',
  'user',
  'harm',
  'system',
  'temporary',
  'private',
  'directory',
  'file',
  'protected',
  'user',
  'webservice'],
 ['automatic',
  'repaint',
  'minimizing',
  'window',
  'jframe',
  'panel',
  'panel',
  'draw',
  'line',
  'minimized',
  'window',
  'java',
  'maximized',
  'line',
  'drew',
  'repainted',
  'place',
  'lock',
  'painting',
  'minimize',
  'screw',
  'drawing',
  'thank',
  'import',
  'j

In [16]:
# Création et entraînement du modèle Word2Vec

print("Build & train Word2Vec model ...")
w2v_model = gensim.models.Word2Vec(min_count=w2v_min_count, window=w2v_window,
                                                vector_size=w2v_size,
                                                seed=42,
                                                workers=1)
#                                                workers=multiprocessing.cpu_count())
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=w2v_epochs)
model_vectors = w2v_model.wv
w2v_words = model_vectors.index_to_key
print("Vocabulary size: %i" % len(w2v_words))
print("Word2Vec trained")

Build & train Word2Vec model ...
Vocabulary size: 223827
Word2Vec trained


In [17]:
# Préparation des sentences (tokenization)

print("Fit Tokenizer ...")
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
x_sentences = pad_sequences(tokenizer.texts_to_sequences(sentences),
                                                     maxlen=maxlen,
                                                     padding='post') 
                                                   
num_words = len(tokenizer.word_index) + 1
print("Number of unique words: %i" % num_words)

Fit Tokenizer ...
Number of unique words: 223828


## Création de la matrice d'embedding

In [18]:
# Création de la matrice d'embedding

print("Create Embedding matrix ...")
w2v_size = 300
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1
embedding_matrix = np.zeros((vocab_size, w2v_size))
i=0
j=0
    
for word, idx in word_index.items():
    i +=1
    if word in w2v_words:
        j +=1
        embedding_vector = model_vectors[word]
        if embedding_vector is not None:
            embedding_matrix[idx] = model_vectors[word]
            
word_rate = np.round(j/i,4)
print("Word embedding rate : ", word_rate)
print("Embedding matrix: %s" % str(embedding_matrix.shape))

Create Embedding matrix ...
Word embedding rate :  1.0
Embedding matrix: (223828, 300)


## Création du modèle d'embedding

In [19]:
# Création du modèle

input=Input(shape=(len(x_sentences),maxlen),dtype='float64')
word_input=Input(shape=(maxlen,),dtype='float64')  
word_embedding=Embedding(input_dim=vocab_size,
                         output_dim=w2v_size,
                         weights = [embedding_matrix],
                         input_length=maxlen)(word_input)
word_vec=GlobalAveragePooling1D()(word_embedding)  
embed_model = Model([word_input],word_vec)

embed_model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 24)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 24, 300)           67148400  
_________________________________________________________________
global_average_pooling1d (Gl (None, 300)               0         
Total params: 67,148,400
Trainable params: 67,148,400
Non-trainable params: 0
_________________________________________________________________


## Exécution du modèle

In [20]:
embeddings = embed_model.predict(x_sentences)
embeddings.shape

(49843, 300)

In [None]:
ARI, X_tsne, labels = ARI_fct(embeddings)

In [1]:
data_T

NameError: name 'data_T' is not defined

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

# BERT

In [111]:
import tensorflow as tf
import keras
from tensorflow import keras
import tensorflow_hub as hub
import tensorflow.keras
from tensorflow.keras import backend as K

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import metrics as kmetrics
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model

import tensorflow_hub as hub

# Bert
import os
import transformers
from transformers import *

os.environ["TF_KERAS"]='1'

In [112]:
print(tf.__version__)
print(tensorflow.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
print(tf.test.is_built_with_cuda())

2.0.0
2.0.0
Num GPUs Available:  0
False


## Fonctions communes

In [169]:
# Fonction de préparation des sentences
def bert_inp_fct(sentences, bert_tokenizer, max_length) :
    input_ids=[]
    token_type_ids = []
    attention_mask=[]
    bert_inp_tot = []

    for sent in sentences:
        bert_inp = bert_tokenizer.encode_plus(sent,
                                              return_tensors='tf' ,
                                              add_special_tokens = True,
                                              truncation_strategy='longest_first',
                                              max_length = max_length)
        # print(np.array(bert_inp['special_tokens_mask']))

        input_ids.append(bert_inp['input_ids'][0])
        token_type_ids.append(bert_inp['token_type_ids'][0])
        attention_mask.append(bert_inp['special_tokens_mask'])
        bert_inp_tot.append((bert_inp['input_ids'][0], 
                             bert_inp['token_type_ids'][0], 
                             bert_inp['special_tokens_mask']))

    # print(input_ids)
    # print(token_type_ids)
    # print(attention_mask)

    # input_ids_1 = np.asarray(input_ids, dtype=object)
    # input_ids_2 = np.asarray(input_ids)
    # token_type_ids = np.asarray(token_type_ids, dtype=object)
    # attention_mask = np.array(attention_mask, dtype=object)

    return input_ids, token_type_ids, attention_mask, bert_inp_tot
    

# Fonction de création des features
def feature_BERT_fct(model, model_type, sentences, max_length, b_size, mode='HF') :
    batch_size = b_size
    batch_size_pred = b_size
    bert_tokenizer = AutoTokenizer.from_pretrained(model_type)
    time1 = time.time()

    for step in range(len(sentences)//batch_size) :
        idx = step*batch_size
        input_ids, token_type_ids, attention_mask, bert_inp_tot = bert_inp_fct(sentences[idx:idx+batch_size], 
                                                                      bert_tokenizer, max_length)
        
        if mode=='HF' :    # Bert HuggingFace
            print(input_ids[:1])
            print(token_type_ids[:1])
            print(attention_mask[:1])
            outputs = model.predict([input_ids, attention_mask, token_type_ids], batch_size=batch_size_pred)
            last_hidden_states = outputs.last_hidden_state

        if mode=='TFhub' : # Bert Tensorflow Hub
            text_preprocessed = {"input_word_ids" : input_ids, 
                                 "input_mask" : attention_mask, 
                                 "input_type_ids" : token_type_ids}
            outputs = model(text_preprocessed)
            last_hidden_states = outputs['sequence_output']
             
        if step ==0 :
            last_hidden_states_tot = last_hidden_states
            last_hidden_states_tot_0 = last_hidden_states
        else :
            last_hidden_states_tot = np.concatenate((last_hidden_states_tot,last_hidden_states))
    
    features_bert = np.array(last_hidden_states_tot).mean(axis=1)
    
    time2 = np.round(time.time() - time1,0)
    print("temps traitement : ", time2)
     
    return features_bert, last_hidden_states_tot

## BERT HuggingFace

### 'bert-base-uncased'

In [106]:
bert_preprocess_model = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4')

In [107]:
bert_preprocess_model

<tensorflow_hub.keras_layer.KerasLayer at 0x7f8a61dd0240>

In [108]:
import time

In [116]:
max_length = 300
batch_size = 1000

In [114]:
model_type = 'bert-base-uncased'
model = TFAutoModel.from_pretrained(model_type)
sentences = data_T['Text'].to_list()

In [168]:
batch_size = 1000
batch_size_pred = 1000
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
time1 = time.time()

for step in range(len(sentences)//batch_size) :
    idx = step*batch_size
    input_ids, token_type_ids, attention_mask, bert_inp_tot = bert_inp_fct(sentences[idx:idx+batch_size],bert_tokenizer,max_length)

print(input_ids,token_type_ids,attention_mask,bert_inp_tot)

KeyboardInterrupt: 

In [125]:
bert_model = TFAutoModel.from_pretrained('bert-base-uncased')

In [117]:
sentences

['giving unix process exclusive rw access directory sandbox linux process certain directory process exclusive rw access dir temporary directory e.g python scripting tool directory limiting much functionality process access directory except superusers course sandbox web service basically allows user arbitrary code authorization software process linux user user harm system temporary private directory file protected user webservice',
 'automatic repaint minimizing window jframe panel panel draw line minimized window java maximized line drew repainted place lock painting minimize screw drawing thank import java.awt import javax.swing import java.awt .event import java.io import java.util import javax.imageio .imageio import java.awt .graphics import java.awt .image .bufferedimage import java.io .file class jframepaint extends jframe actionlistener actionlistener public static int activa public static jbutton drawing jbutton drawing public static jbutton erase jbutton erase public static in

In [170]:
# Création des features

features_bert, last_hidden_states_tot = feature_BERT_fct(model, model_type, sentences, max_length=300 ,b_size=batch_size, mode='HF')

[<tf.Tensor: id=3238908, shape=(70,), dtype=int32, numpy=
array([  101,  3228, 19998,  2832,  7262,  1054,  2860,  3229, 14176,
        5472,  8758, 11603,  2832,  3056, 14176,  2832,  7262,  1054,
        2860,  3229, 16101,  5741, 14176,  1041,  1012,  1043, 18750,
        5896,  2075,  6994, 14176, 14879,  2172, 15380,  2832,  3229,
       14176,  3272,  3565, 20330,  2015,  2607,  5472,  8758,  4773,
        2326, 10468,  4473,  5310, 15275,  3642, 20104,  4007,  2832,
       11603,  5310,  5310,  7386,  2291,  5741,  2797, 14176,  5371,
        5123,  5310,  4773,  8043,  7903,  2063,   102], dtype=int32)>]
[<tf.Tensor: id=3238912, shape=(70,), dtype=int32, numpy=
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0], dtype=int32)>]
[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

IndexError: list index out of range

In [None]:
ARI, X_tsne, labels = ARI_fct(features_bert)

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

## BERT hub Tensorflow

In [96]:
import tensorflow_hub as hub
import tensorflow_text
preprocess = hub.load('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')
encoder = hub.load('https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4')


# Guide sur le Tensorflow hub : https://www.tensorflow.org/text/tutorials/classify_text_with_bert
model_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4'
bert_layer = hub.KerasLayer(model_url, trainable=True)

NotFoundError: dlopen(/Users/maurelco/opt/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.dylib, 0x0006): Symbol not found: __ZN10tensorflow15TensorShapeBaseINS_11TensorShapeEEC2EN4absl12lts_202103244SpanIKxEE
  Referenced from: /Users/maurelco/opt/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.dylib
  Expected in: /Users/maurelco/opt/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/tensorflow_core/libtensorflow_framework.2.dylib

In [None]:
sentences = data_T['sentence_dl'].to_list()

In [None]:
max_length = 64
batch_size = 10
model_type = 'bert-base-uncased'
model = bert_layer

features_bert, last_hidden_states_tot = feature_BERT_fct(model, model_type, sentences, 
                                                         max_length, batch_size, mode='TFhub')

In [None]:
ARI, X_tsne, labels = ARI_fct(features_bert)

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

# USE - Universal Sentence Encoder

In [None]:
import tensorflow as tf
# import tensorflow_hub as hub
import tensorflow.keras
from tensorflow.keras import backend as K

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import metrics as kmetrics
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model


# Bert
import transformers
from transformers import *

os.environ["TF_KERAS"]='1'

In [None]:
print(tf.__version__)
print(tensorflow.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
print(tf.test.is_built_with_cuda())

In [None]:
import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [None]:
def feature_USE_fct(sentences, b_size) :
    batch_size = b_size
    time1 = time.time()

    for step in range(len(sentences)//batch_size) :
        idx = step*batch_size
        feat = embed(sentences[idx:idx+batch_size])

        if step ==0 :
            features = feat
        else :
            features = np.concatenate((features,feat))

    time2 = np.round(time.time() - time1,0)
    return features

In [None]:
batch_size = 10
sentences = data_T['sentence_dl'].to_list()

In [None]:
features_USE = feature_USE_fct(sentences, batch_size)

In [None]:
ARI, X_tsne, labels = ARI_fct(features_USE)

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)