# Sommaire :


### <a href="#C1"> **Partie 1 : Contexte et Objectifs**</a>

 - Contexte
 - Objectifs

<b><hr></b>

### <a href="#C2"> **Partie 2 : Mise en place de l'espace de travail**</a>
 - <a href="#C21"> Import des packages</a>
 - <a href="#C22"> Repertoire de travail</a>
 - <a href="#C23"> Changement du répertoire courant</a>
 - <a href="#C24"> Import des datas frames</a>

<b><hr></b>

### <a href="#C3"> **Partie 3 : Import DataFrames**</a>
 - <a href="#C31"> 3.1 Display</a>
 - <a href="#C32"> 3.2 Structure du dataframes</a>

<b><hr></b>
 
### <a href="#C4"> **Partie 4 : Extraction Features**</a>
 - <a href="#C41"> 4.1 Bags of Words</a>
     - <a href="#C411"> 4.1.1 CountVector</a>
     - <a href="#C412"> 4.1.2 TF-Idf</a>
</br></br>
 - <a href="#C42"> 4.2 Words/Sentence Embedding Classique</a>
      - <a href="#C421"> 4.2.1 Word2Vec</a>
      - <a href="#C422"> 4.2.2 Glove</a>
      - <a href="#C423"> 4.2.3 FastText</a>
      - <a href="#C424"> 4.2.4 Autres modélisations de sujets</a>
</br></br>
 - <a href="#C43"> 4.3 Words/Sentence Embedding</a>
      - <a href="#C431"> 4.2.3 BERT</a>
      - <a href="#C432"> 4.2.4 Universel Sentence Embedding</a>
</br>

 
<b><hr></b>

### <a href="#C5"> **Partie 5 : Conclusion**</a>


# <a name="C1">**Partie 1 : Contexte et Objectifs**</a> 

Contexte

Objectifs

In [1]:
from platform import python_version

print(python_version())

3.6.13


# <a name="C2"><font color='blue'>**Partie 2 : Mise en place de l'espace de travail**</font></a> 

### <a name="C21"><font color='blue'>2.1 Imports packages</font></a> 

###### <b><font color='blue'>2.1.0 Requirements</font></b>
- <b>Built-in</b>       : os, warnings
- <b>Data</b>           : pandas, numpy
- <b>Visualisations</b> : matplotlib, seaborn
- <b>Preprocessing</b>  : sklearn, scipy

###### <b><font color='blue'>2.1.1 Imports</font></b>

In [2]:
# Built-in
import os, warnings 

# Data
import numpy as np
import pandas as pd

#Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from PIL import Image
#%matplotlib inline

# NLP
import nltk #/!\ attention use nltk.download('punkt')
from nltk.stem.snowball import EnglishStemmer
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize, wordpunct_tokenize, RegexpTokenizer
from nltk.corpus import words, stopwords

import spacy
import re

#cluster
from sklearn.cluster import KMeans,MiniBatchKMeans, DBSCAN

#metrics
from sklearn.metrics import (silhouette_samples,silhouette_score, adjusted_rand_score,
                             adjusted_mutual_info_score,confusion_matrix, pair_confusion_matrix,
                            ConfusionMatrixDisplay)

###### <b><font color='blue'>2.1.2 Downloads and Options</font></b>

In [3]:
# La cpu_count méthode est utilisée pour renvoyer le nombre actuel de CPU logiques dans le système.
import psutil
print("The number of physical cores in the system is %s" % (psutil.cpu_count(logical=False),))
print("The number of logical cores in the system is %s" % (psutil.cpu_count(logical=True),))

The number of physical cores in the system is 4
The number of logical cores in the system is 8


In [4]:
# init sns
sns.set()

### <a name="C22"><font color='blue'>2.2 Working directory</font></a> 

In [5]:
os.listdir()

['.git',
 '.gitignore',
 '.ipynb_checkpoints',
 'ancien',
 'API rapidPI.ipynb',
 'cc.en.100.bin',
 'cc.en.300.bin',
 'cc.en.300.bin.gz',
 'data',
 'EDA.ipynb',
 'Feature_extraction_faisaibilité-Image.ipynb',
 'Feature_extraction_faisaibilité-Text.ipynb',
 'Feature_extraction_faisaibilité-Transfert learning.ipynb',
 'model1_best_weights.h5',
 'README.md',
 "Étude de la faisabilité d'un moteur de classification.pptx"]

# <a name="C3"><font color='teal'>**Partie 3 : Import DataFrames**</font></a> 

In [6]:
import pickle
try:
    with open('data/cleaned/description_cleaned_spacy.pkl', 'rb') as f1:
        df = pickle.load(f1)
except:
    df = pd.read_csv('data/cleaned/description_cleaned_spacy.csv')


In [7]:
df.head()

Unnamed: 0,uniq_id,image,description,description_clean,cat_1
0,55b85ea15a1536d46b7190ad6fff8ce7,55b85ea15a1536d46b7190ad6fff8ce7.jpg,Key Features of Elegance Polyester Multicolor ...,"['key', 'feature', 'elegance', 'polyester', 'm...",home furnishing
1,7b72c92c2f6c40268628ec5f14c6d590,7b72c92c2f6c40268628ec5f14c6d590.jpg,Specifications of Sathiyas Cotton Bath Towel (...,"['specification', 'sathiyas', 'cotton', 'bath'...",baby care
2,64d5d4a258243731dc7bbb1eef49ad74,64d5d4a258243731dc7bbb1eef49ad74.jpg,Key Features of Eurospa Cotton Terry Face Towe...,"['key', 'feature', 'eurospa', 'cotton', 'terry...",baby care
3,d4684dcdc759dd9cdf41504698d737d8,d4684dcdc759dd9cdf41504698d737d8.jpg,Key Features of SANTOSH ROYAL FASHION Cotton P...,"['key', 'feature', 'santosh', 'royal', 'fashio...",home furnishing
4,6325b6870c54cd47be6ebfbffa620ec7,6325b6870c54cd47be6ebfbffa620ec7.jpg,Key Features of Jaipur Print Cotton Floral Kin...,"['key', 'feature', 'jaipur', 'print', 'cotton'...",home furnishing


In [8]:
print("Dimension du dataframe importé {0} lignes {1} colonnes:".format(df.shape[0],df.shape[1]) )

Dimension du dataframe importé 1050 lignes 5 colonnes:


### <a name="C31"><font color='teal'>3.1 Display</font></a>

In [9]:
df.head(2)

Unnamed: 0,uniq_id,image,description,description_clean,cat_1
0,55b85ea15a1536d46b7190ad6fff8ce7,55b85ea15a1536d46b7190ad6fff8ce7.jpg,Key Features of Elegance Polyester Multicolor ...,"['key', 'feature', 'elegance', 'polyester', 'm...",home furnishing
1,7b72c92c2f6c40268628ec5f14c6d590,7b72c92c2f6c40268628ec5f14c6d590.jpg,Specifications of Sathiyas Cotton Bath Towel (...,"['specification', 'sathiyas', 'cotton', 'bath'...",baby care


In [10]:
df.sample(2)

Unnamed: 0,uniq_id,image,description,description_clean,cat_1
714,6acca991d2353781779b866e4f96edd9,6acca991d2353781779b866e4f96edd9.jpg,"Buy Home Originals Abstract, Checkered Double ...","['home', 'original', 'abstract', 'checker', 'd...",home furnishing
226,f39a2cce8929f5b44087d688995994e4,f39a2cce8929f5b44087d688995994e4.jpg,Buy Tiedribbons We Love Mom With Green Backgro...,"['tiedribbon', 'love', 'mom', 'green', 'backgr...",home decor & festive needs


### <a name="C32"><font color='teal'>3.2 Structure du dataframe</font></a> 

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1050 entries, 0 to 1049
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   uniq_id            1050 non-null   object
 1   image              1050 non-null   object
 2   description        1050 non-null   object
 3   description_clean  1050 non-null   object
 4   cat_1              1050 non-null   object
dtypes: object(5)
memory usage: 41.1+ KB


In [12]:
# Analyse de la classe des entité des pd.Series
df.apply(lambda x:type(x[0]))

uniq_id              <class 'str'>
image                <class 'str'>
description          <class 'str'>
description_clean    <class 'str'>
cat_1                <class 'str'>
dtype: object

In [13]:
%whos

Variable                     Type                Data/Info
----------------------------------------------------------
ConfusionMatrixDisplay       type                <class 'sklearn.metrics._<...>.ConfusionMatrixDisplay'>
DBSCAN                       type                <class 'sklearn.cluster._dbscan.DBSCAN'>
EnglishStemmer               ABCMeta             <class 'nltk.stem.snowball.EnglishStemmer'>
Image                        module              <module 'PIL.Image' from <...>packages\\PIL\\Image.py'>
KMeans                       type                <class 'sklearn.cluster._kmeans.KMeans'>
MiniBatchKMeans              type                <class 'sklearn.cluster._kmeans.MiniBatchKMeans'>
PorterStemmer                ABCMeta             <class 'nltk.stem.porter.PorterStemmer'>
RegexpTokenizer              ABCMeta             <class 'nltk.tokenize.regexp.RegexpTokenizer'>
WordCloud                    type                <class 'wordcloud.wordcloud.WordCloud'>
WordNetLemmatizer        

</br>

</br>

In [14]:
df.description[843]

'Buy Epresent Mfan 1 Fan USB USB Fan for Rs.219 online. Epresent Mfan 1 Fan USB USB Fan at best prices with FREE shipping & cash on delivery. Only Genuine Products. 30 Day Replacement Guarantee.'

# <a name="C4"><font color='green'>**Partie 4 : Extraction Features**</font></a> 

## Fonction utils

In [15]:
def vectorize(list_of_docs, model):
    """Generate vectors for list of documents using a Word Embedding

    Args:
        list_of_docs: List of documents
        model: Gensim's Word Embedding

    Returns:
        List of document vectors
    """
    features = []

    for tokens in list_of_docs:
        zero_vector = np.zeros(model.vector_size)
        vectors = []
        for token in tokens:
            if token in model:
                try:
                    vectors.append(model[token])
                except KeyError:
                    continue
        if vectors:
            vectors = np.asarray(vectors)
            avg_vec = vectors.mean(axis=0)
            features.append(avg_vec)
        else:
            features.append(zero_vector)
    return features

In [16]:
def mbkmeans_clusters(
    X, 
    k, 
    # mb, 
    print_silhouette_values, 
):
    """Generate clusters and print Silhouette metrics using MBKmeans

    Args:
        X: Matrix of features.
        k: Number of clusters.
        mb: Size of mini-batches.
        print_silhouette_values: Print silhouette values per cluster.

    Returns:
        Trained clustering model and labels based on X.
    """
    km = KMeans(n_clusters=k, init = 'k-means++',n_init=10).fit(X)
    print(f"For n_clusters = {k}")
    print(f"Silhouette coefficient: {silhouette_score(X, km.labels_):0.2f}")
    print(f"Inertia:{km.inertia_}")

    if print_silhouette_values:
        sample_silhouette_values = silhouette_samples(X, km.labels_)
        print(f"Silhouette values:")
        silhouette_values = []
        for i in range(k):
            cluster_silhouette_values = sample_silhouette_values[km.labels_ == i]
            silhouette_values.append(
                (
                    i,
                    cluster_silhouette_values.shape[0],
                    cluster_silhouette_values.mean(),
                    cluster_silhouette_values.min(),
                    cluster_silhouette_values.max(),
                )
            )
        silhouette_values = sorted(
            silhouette_values, key=lambda tup: tup[2], reverse=True
        )
        for s in silhouette_values:
            print(
                f"    Cluster {s[0]}: Size:{s[1]} | Avg:{s[2]:.2f} | Min:{s[3]:.2f} | Max: {s[4]:.2f}"
            )
    return km, km.labels_

In [17]:
def ARI_fct_raw(x_label_clust,x_label_true):
    ARI = np.round(adjusted_rand_score(x_label_clust, x_label_true),4)
    print("ARI : ", ARI)
    return(ARI)

In [18]:
import time
from sklearn import cluster, metrics
from sklearn import manifold, decomposition

# Calcul Tsne, détermination des clusters et calcul ARI entre vrais catégorie et n° de clusters
def ARI_fct(features) :
    time1 = time.time()
    num_labels=len(l_cat)
    tsne = manifold.TSNE(n_components=2, perplexity=30, n_iter=2000, 
                                 init='random', learning_rate=200, random_state=42)
    X_tsne = tsne.fit_transform(features)
    
    # Détermination des clusters à partir des données après Tsne 
    cls = cluster.KMeans(n_clusters=num_labels, n_init=100, random_state=42)
    cls.fit(X_tsne)
    ARI = np.round(metrics.adjusted_rand_score(y_cat_num, cls.labels_),4)
    time2 = np.round(time.time() - time1,0)
    print("ARI : ", ARI, "time : ", time2)
    
    return ARI, X_tsne, cls.labels_


# visualisation du Tsne selon les vraies catégories et selon les clusters
def TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI) :
    fig = plt.figure(figsize=(20,12))
    
    ax = fig.add_subplot(121)
    scatter = ax.scatter(X_tsne[:,0],X_tsne[:,1], c=y_cat_num, cmap='Set1')
    ax.legend(handles=scatter.legend_elements()[0], labels=l_cat, loc="best", title="Categorie")
    plt.title('Représentation des produits par catégories réelles')
    
    ax = fig.add_subplot(122)
    scatter = ax.scatter(X_tsne[:,0],X_tsne[:,1], c=labels, cmap='Set1')
    ax.legend(handles=scatter.legend_elements()[0], labels=set(labels), loc="best", title="Clusters")
    plt.title('Représentation des produits par clusters')
    
    plt.show()
    print("ARI : ", ARI)


In [19]:
def plot_confusion_matrix( y_true, y_pred, class_labels=None, display_labels=None, sample_weight=None,
                            normalize_f='true',        # --> option {'all', 'count', 'percent', None}
                            cmap='Blues',
                            ax=None,
                            title=None,
                            show_values='all',      # --> option {'all', 'true', 'pred'}
                            show_colorbar=True,     # --> option {True, False}
                            show_subtotals=True):   # --> option {True, False}
    
    fig = plt.figure(figsize=(15,6))
    ax1 = fig.add_subplot(111)
    
    disp = ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                                                   xticks_rotation='vertical',
                                                   normalize=normalize_f, 
                                                   display_labels=class_labels,cmap=plt.cm.Blues,
                                                  ax=ax1)
    disp.ax_.set_title('Confusion matrix, without normalization')
    disp.ax_.set_xticklabels(range(7),rotation=0)

### Parametres utiles

In [20]:
l_cat = list(set(df['cat_1']))
print("catégories : ", l_cat)
y_cat_num = [(l_cat.index(df.iloc[i]['cat_1'])) for i in range(len(df))]

catégories :  ['computers', 'home furnishing', 'home decor & festive needs', 'beauty and personal care', 'baby care', 'watches', 'kitchen & dining']


In [21]:
set(y_cat_num)

{0, 1, 2, 3, 4, 5, 6}

### <a name="C41"><font color='green'>4.1 Bag Of Words </font></a> 

### - <a name="C411"><font color='green'>4.1.a Comptage de mots 1-gram </font></a> 

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

#class LemmaTokenizer(object):
#    def __init__(self):
#        self.wnl = WordNetLemmatizer()
#    def __call__(self, articles):
#        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

def LemmaTokenizer(articles):
    wnl = WordNetLemmatizer()
    return [wnl.lemmatize(t.lower()) for t in word_tokenize(articles)]

nlp = spacy.load('en_core_web_sm')

vectorizer = CountVectorizer(input='content', # the input is expected to be a sequence of items that can be of type string or byte
                             encoding='utf-8', 
                             decode_error='replace', #Instruction sur ce qu'il faut faire si une séquence d'octets est donnée à analyser qui contenant 
                             #des caractères n'appartenant pas à la donnée "encoding"
                             strip_accents='unicode', # Remove accents and perform other character normalization during the preprocessing step.
                             lowercase=True, # Convert all characters to lowercase before tokenizing
                             preprocessor=None, 
                             tokenizer=None, 
                             #stop_words='english',            
                             token_pattern = r"\b[a-zA-Z]{3,}\b", #ascii only alpha
                             #token_pattern = r"(?u)\b[a-zA-Z]+\b", #ascii only alpha
                             ngram_range=(1, 1), 
                             analyzer='word', 
                             max_df=351, 
                             min_df=3, #0.0013, # < 1.5/1050
                             max_features=None,
                             vocabulary=None, 
                             binary=False )

vecX = vectorizer.fit_transform( df.description.apply(lambda x: " ".join([token.lemma_ for token in nlp(x.lower())]) ) )

CountWord = pd.DataFrame( vecX.toarray(), columns=vectorizer.get_feature_names_out() )

print("Notre dataset est composé de {} lignes et {} colonnes".format(CountWord.shape[0],CountWord.shape[1]) )
CountWord.head()

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names_out'

In [None]:
# utiliser un minibatch kmeans pas utils si donnée compte moins de 2000 echantillons
clusteringCountVec, CountVeccluster_labels = mbkmeans_clusters(
    X=CountWord,
    k=7,
    # mb=1024, # puissance de 2
    print_silhouette_values=True,
)
print('\n')
print('Score ARI sur Avant diminution de dimension')
_ = ARI_fct_raw(CountVeccluster_labels,df.cat_1)

In [None]:
print("Most representative terms per cluster (based on centroids):")
for i in range(7):
    tokens_per_cluster = ""
    most_representative = CountWord.columns[np.argsort( clusteringCountVec.cluster_centers_[i] )[:5]]
    for t in most_representative:
        tokens_per_cluster += f"{t} "
    print(f"Cluster {i}: {tokens_per_cluster}")

In [None]:
print( 'Score ARI après TSNE 2D + Clustering')
ARI, X_tsne, labels = ARI_fct(CountWord)

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

</Br>

### - <a name="C412"><font color='green'>4.1.b TF IDF </font></a> 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(
    input='content', # the input is expected to be a sequence of items that can be of type string or byte
    encoding='utf-8', 
    decode_error='replace', 
    strip_accents='ascii', # Remove accents and perform other character normalization during the preprocessing step.
    lowercase=True, # Convert all characters to lowercase before tokenizing
    preprocessor=None, 
    stop_words="english", 
    token_pattern = r"\b[a-zA-Z]{3,}\b", #ascii only alpha
    #tokenizer = RegexpTokenizer(r"[a-zA-Z]{3,}").tokenize,
    tokenizer=None,
    ngram_range=(1, 1), 
    analyzer='word', 
    max_df=351, 
    min_df=3, # < 1.5/1050
    max_features=None,
    vocabulary=None, binary=False ,
    smooth_idf = True
    )
tfidf_values = tfidf.fit_transform( df.description.apply(lambda x: " ".join([token.lemma_ for token in nlp(x.lower())]) ) )

TFIDF_df = pd.DataFrame(tfidf_values.toarray(),columns=tfidf.get_feature_names_out())

TFIDF_df.head()

In [None]:
# utiliser un minibatch kmeans pas utils si donnée compte moins de 2000 echantillons
clusteringTFIDF, TFIDFcluster_labels = mbkmeans_clusters(
    X=TFIDF_df,
    k=7,
    # mb=1024, # puissance de 2
    print_silhouette_values=True,
)
print('\n')
print('Score ARI sur Avant diminution de dimension')
_ = ARI_fct_raw(TFIDFcluster_labels,df.cat_1)

In [None]:
print("Most representative terms per cluster (based on centroids):")
for i in range(7):
    tokens_per_cluster = ""
    most_representative = CountWord.columns[np.argsort( clusteringTFIDF.cluster_centers_[i] )[:5]]
    for t in most_representative:
        tokens_per_cluster += f"{t} "
    print(f"Cluster {i}: {tokens_per_cluster}")

In [None]:
print( 'Score ARI après TSNE 2D + Clustering')
ARI, X_tsne, labels = ARI_fct(TFIDF_df)

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

</Br>

### <a name="C42"><font color='green'>4.2 Word/Sentence embedding </font></a> 

### - <a name="C421"><font color='green'>4.2.a Word2Vec </font></a> 

### <font color='green'> Model entrainé localement </font>

In [None]:
from gensim.models import Word2Vec

print("Entrainement du model Word2Vec...")
model_W2V = Word2Vec(sentences=df.description_clean, vector_size =100, window=5, min_count=1, workers=4)
model_W2V.train(df.description_clean, total_examples=len(df.description_clean), epochs=50)

In [None]:
help(Word2Vec)

In [None]:
vectorized_docs = vectorize(df.description_clean, model=model_W2V.wv)

# utiliser un minibatch kmeans pas utils si donnée compte moins de 2000 echantillons
clusteringW2V, W2Vcluster_labels = mbkmeans_clusters(
    X=vectorized_docs,
    k=7,
    # mb=1024, # puissance de 2
    print_silhouette_values=True,
)
print('\n')
print('Score ARI sur Avant diminution de dimension')
_ = ARI_fct_raw(W2Vcluster_labels,df.cat_1)

In [None]:
print("Most representative terms per cluster (based on centroids):")
for i in range(7):
    tokens_per_cluster = ""
    most_representative = model_W2V.wv.most_similar(positive=[clusteringW2V.cluster_centers_[i]], topn=10)
    for t in most_representative:
        tokens_per_cluster += f"{t[0]} "
    print(f"Cluster {i}: {tokens_per_cluster}")

In [None]:
len(TFIDF_df)

In [None]:
print( 'Score ARI après TSNE 2D + Clustering')
ARI, X_tsne, labels = ARI_fct(pd.DataFrame(vectorized_docs))

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

</Br>

### <font color='green'> Model W2V pré-entrainé de google </font>

In [None]:
%%time
from gensim.models import KeyedVectors
modelpw2wgoogle = KeyedVectors.load_word2vec_format('data/support/Word2Vec/GoogleNews-vectors-negative300.bin', binary=True)

In [None]:
vectorized_docs = vectorize(df.description_clean, model=modelpw2wgoogle)

# utiliser un minibatch kmeans pas utils si donnée compte moins de 2000 echantillons
clusteringW2Vpre, W2Vprecluster_labels = mbkmeans_clusters(
    X=vectorized_docs,
    k=7,
    # mb=1024, # puissance de 2
    print_silhouette_values=True,
)
print('\n')
print('Score ARI sur Avant diminution de dimension')
_ = ARI_fct_raw(W2Vprecluster_labels,df.cat_1)

In [None]:
print( 'Score ARI après TSNE 2D + Clustering')
ARI, X_tsne, labels = ARI_fct(pd.DataFrame(vectorized_docs))

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
del modelpw2wgoogle

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

</Br>

### - <a name="C422"><font color='green'>4.2.b GloVe </font></a> 

In [None]:
print('Indexing word vectors.')

emmbed_dict = {}
f = open('data/support/glove.6B/glove.6B.200d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    emmbed_dict[word] = coefs
f.close()

print('Found %s word vectors.' % len(emmbed_dict))

In [None]:
from scipy import spatial
def find_similar_word(emmbedes):
    nearest = sorted(emmbed_dict.keys(), key=lambda word: spatial.distance.euclidean(emmbed_dict[word], emmbedes))
    return nearest

find_similar_word(emmbed_dict['river'])[0:10]

In [None]:
def find_closest_embeddings(embedding):
    return sorted(emmbed_dict.keys(), key=lambda word: spatial.distance.euclidean(emmbed_dict[word], embedding))

In [None]:
print(find_closest_embeddings(emmbed_dict["king"])[1:6])

In [None]:
print(find_closest_embeddings(
    emmbed_dict["king"] - emmbed_dict["man"] + emmbed_dict["woman"]
)[:5])

In [None]:
# conversion ds vecteurs GloVe au format texte au format texte word2vec :
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="data/support/glove.6B/glove.6B.300d.txt", 
               word2vec_output_file="data/support/glove.6B/gensim_glove_vectors.txt")

# Chargement du model glove
from gensim.models import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format('data/support/glove.6B/gensim_glove_vectors.txt', binary=False)

In [None]:
vectorized_docs = vectorize(df.description_clean, model=glove_model)

# utiliser un minibatch kmeans pas utils si donnée compte moins de 2000 echantillons
clusteringGlove, Glovecluster_labels = mbkmeans_clusters(
    X=vectorized_docs,
    k=7,
    # mb=1024, # puissance de 2
    print_silhouette_values=True,
)
print('\n')
print('Score ARI sur Avant diminution de dimension')
_ = ARI_fct_raw(Glovecluster_labels,df.cat_1)

In [None]:
print( 'Score ARI après TSNE 2D + Clustering')
ARI, X_tsne, labels = ARI_fct(pd.DataFrame(vectorized_docs))

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

</Br>

### - <a name="C423"><font color='green'>4.2.c FastText </font></a> 

#####  <font color='green'>4.2.c.1 FastText prélearning model </font>

In [None]:
import gensim
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

# Load pre-trained FastText model
model_path = 'cc.en.100.bin'
model = gensim.models.fasttext.load_facebook_vectors(model_path)

In [None]:
# Define Document Vectors list from Word Embedding
vectorized_docs = vectorize(df.description_clean, model=model)
len(vectorized_docs), len(vectorized_docs[0])

In [None]:
vectorized_docs = vectorize(df.description_clean, model=model)

# utiliser un minibatch kmeans pas utils si donnée compte moins de 2000 echantillons
clusteringFastText, FastTextcluster_labels = mbkmeans_clusters(
    X=vectorized_docs,
    k=7,
    # mb=1024, # puissance de 2
    print_silhouette_values=True,
)
print('\n')
print('Score ARI sur Avant diminution de dimension')
_ = ARI_fct_raw(FastTextcluster_labels,df.cat_1)

In [None]:
print( 'Score ARI après TSNE 2D + Clustering')
ARI, X_tsne, labels = ARI_fct(pd.DataFrame(vectorized_docs))

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
del model

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

#####  <font color='green'>4.2.c.2 FastText local train model </font>

In [None]:
len(df.description_clean)

In [None]:
from gensim.models.fasttext import FastText
model = FastText(vector_size=100, window=3)
model.build_vocab(corpus_iterable=df.description_clean)  # scan over corpus to build the vocabulary

total_words = model.corpus_total_words  # number of words in the corpus
model.train(corpus_iterable=df.description_clean, total_words=total_words, epochs=25)

In [None]:
help(FastText.train)

In [None]:
vectorized_docs = vectorize(df.description_clean, model=model.wv)

# utiliser un minibatch kmeans pas utils si donnée compte moins de 2000 echantillons
clusteringFastText, FastTextcluster_labels = mbkmeans_clusters(
    X=vectorized_docs,
    k=7,
    # mb=1024, # puissance de 2
    print_silhouette_values=True,
)
print('\n')
print('Score ARI sur Avant diminution de dimension')
_ = ARI_fct_raw(FastTextcluster_labels,df.cat_1)

In [None]:
print( 'Score ARI après TSNE 2D + Clustering')
ARI, X_tsne, labels = ARI_fct(pd.DataFrame(vectorized_docs))

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

</Br>

### <a name="C424"><font color='green'>4.2.d Modélisez des sujets avec des méthodes non supervisées </font></a> 

#### <font color='green'> Latent Dirichlet Allocation (LDA) </font>

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
n_topics = 100

# Créer le modèle LDA
lda = LatentDirichletAllocation(
        n_components=n_topics, 
        max_iter=5, 
        learning_method='online', 
        learning_offset=50.,
        random_state=0)

# Fitter sur les données
lda.fit(CountWord)

In [None]:
print( 'Score ARI après TSNE 2D + Clustering')
ARI, X_tsne, labels = ARI_fct(pd.DataFrame(lda.transform(CountWord)))
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

#### <font color='green'> NMF (Negative Matrix Factorisation) </font>

In [None]:
from sklearn.decomposition import NMF

# NMF is able to use tf-idf
no_topics = 100

# Run NMF
nmf = NMF(n_components=no_topics, random_state=1,  l1_ratio=.5, init='nndsvd')
nmf.fit(tfidf_values)

no_top_words = 10
#display_topics(nmf, TFIDF_df.columns, no_top_words)

In [None]:
print( 'Score ARI après TSNE 2D + Clustering')
ARI, X_tsne, labels = ARI_fct(pd.DataFrame(nmf.transform(TFIDF_df)))
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

### <a name="C43"><font color='green'>4.3 Effectuez des plongements de mots (word embeddings) </font></a> 

### Fonctions communes

In [None]:
def tokenizer_fct(sentence) :
    # print(sentence)
    sentence_clean = sentence.replace('-', ' ').replace('+', ' ').replace('/', ' ').replace('#', ' ')
    word_tokens = word_tokenize(sentence_clean)
    return word_tokens

# Stop words
from nltk.corpus import stopwords
stop_w = list(set(stopwords.words('english'))) + ['[', ']', ',', '.', ':', '?', '(', ')']

def stop_word_filter_fct(list_words) :
    filtered_w = [w for w in list_words if not w in stop_w]
    filtered_w2 = [w for w in filtered_w if len(w) > 2]
    return filtered_w2

# lower case et alpha
def lower_start_fct(list_words) :
    lw = [w.lower() for w in list_words if (not w.startswith("@")) 
    #                                   and (not w.startswith("#"))
                                       and (not w.startswith("http"))]
    return lw

# Lemmatizer (base d'un mot)
from nltk.stem import WordNetLemmatizer

def lemma_fct(list_words) :
    lemmatizer = WordNetLemmatizer()
    lem_w = [lemmatizer.lemmatize(w) for w in list_words]
    return lem_w

# Fonction de préparation du texte pour le bag of words avec lemmatization
def transform_bow_lem_fct(desc_text) :
    word_tokens = tokenizer_fct(desc_text)
    sw = stop_word_filter_fct(word_tokens)
    lw = lower_start_fct(sw)
    lem_w = lemma_fct(lw)    
    transf_desc_text = ' '.join(lem_w)
    return transf_desc_text

# Fonction de préparation du texte pour le Deep learning (USE et BERT)
def transform_dl_fct(desc_text) :
    word_tokens = tokenizer_fct(desc_text)
#    sw = stop_word_filter_fct(word_tokens)
    lw = lower_start_fct(word_tokens)
    # lem_w = lemma_fct(lw)    
    transf_desc_text = ' '.join(lw)
    return transf_desc_text


df['sentence_bow_lem'] = df['description'].apply(lambda x : transform_bow_lem_fct(x))
df['sentence_dl'] = df['description'].apply(lambda x : transform_dl_fct(x))

</Br>

</Br>

</Br>

#### <a name="C431"><font color='green'>4.3.a BERT </font></a> 

In [None]:
import tensorflow as tf
# import tensorflow_hub as hub
import tensorflow.keras
from tensorflow.keras import backend as K

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import metrics as kmetrics
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model

# Bert
import os
import transformers
#from transformers import *

os.environ["TF_KERAS"]='1'

In [None]:
print(tf.__version__)
print(tensorflow.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
print(tf.test.is_built_with_cuda())

### Fonctions communes

In [None]:
# Fonction de préparation des sentences
def bert_inp_fct(sentences, bert_tokenizer, max_length) :
    input_ids=[]
    token_type_ids = []
    attention_mask=[]
    bert_inp_tot = []

    for sent in sentences:
        bert_inp = bert_tokenizer.encode_plus(sent,
                                              add_special_tokens = True,
                                              max_length = max_length,
                                              padding='max_length',
                                              return_attention_mask = True, 
                                              return_token_type_ids=True,
                                              truncation=True,
                                              return_tensors="tf")
    
        input_ids.append(bert_inp['input_ids'][0])
        token_type_ids.append(bert_inp['token_type_ids'][0])
        attention_mask.append(bert_inp['attention_mask'][0])
        bert_inp_tot.append((bert_inp['input_ids'][0], 
                             bert_inp['token_type_ids'][0], 
                             bert_inp['attention_mask'][0]))

    input_ids = np.asarray(input_ids)
    token_type_ids = np.asarray(token_type_ids)
    attention_mask = np.array(attention_mask)
    
    return input_ids, token_type_ids, attention_mask, bert_inp_tot
    

# Fonction de création des features
def feature_BERT_fct(model, model_type, sentences, max_length, b_size, mode='HF') :
    batch_size = b_size
    batch_size_pred = b_size
    bert_tokenizer = transformers.AutoTokenizer.from_pretrained(model_type)
    time1 = time.time()

    for step in range(len(sentences)//batch_size) :
        idx = step*batch_size
        input_ids, token_type_ids, attention_mask, bert_inp_tot = bert_inp_fct(sentences[idx:idx+batch_size], 
                                                                      bert_tokenizer, max_length)
        
        if mode=='HF' :    # Bert HuggingFace
            outputs = model.predict([input_ids, attention_mask, token_type_ids], batch_size=batch_size_pred)
            last_hidden_states = outputs.last_hidden_state

        if mode=='TFhub' : # Bert Tensorflow Hub
            text_preprocessed = {"input_word_ids" : input_ids, 
                                 "input_mask" : attention_mask, 
                                 "input_type_ids" : token_type_ids}
            outputs = model(text_preprocessed)
            last_hidden_states = outputs['sequence_output']
             
        if step ==0 :
            last_hidden_states_tot = last_hidden_states
            last_hidden_states_tot_0 = last_hidden_states
        else :
            last_hidden_states_tot = np.concatenate((last_hidden_states_tot,last_hidden_states))
    
    features_bert = np.array(last_hidden_states_tot).mean(axis=1)
    
    time2 = np.round(time.time() - time1,0)
    print("temps traitement : ", time2)
     
    return features_bert, last_hidden_states_tot

</Br>

## BERT HuggingFace

### 'bert-base-uncased'

In [None]:
max_length = 64
batch_size = 10
model_type = 'bert-base-uncased'
model = transformers.TFAutoModel.from_pretrained(model_type)
sentences = df['sentence_dl'].to_list()

In [None]:
# Création des features

features_bert, last_hidden_states_tot = feature_BERT_fct(model, model_type, sentences, 
                                                         max_length, batch_size, mode='HF')

In [None]:
ARI, X_tsne, labels = ARI_fct(features_bert)

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

### 'cardiffnlp/twitter-roberta-base-sentiment'
* Modèle pré-entraîné sur des tweets pour l'analyse de sentiment = particulièrement adapté au contexte

In [None]:
max_length = 64
batch_size = 10
model_type = 'cardiffnlp/twitter-roberta-base-sentiment'
model = transformers.TFAutoModel.from_pretrained(model_type)
sentences = df['sentence_dl'].to_list()

In [None]:
features_bert, last_hidden_states_tot = feature_BERT_fct(model, model_type, sentences, 
                                                         max_length, batch_size, mode='HF')

In [None]:
ARI, X_tsne, labels = ARI_fct(features_bert)

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

</Br>

#### <a name="C432"><font color='green'>4.3.b USE - Universal Sentence Encoder </font></a> 

In [None]:
import tensorflow as tf
# import tensorflow_hub as hub
import tensorflow.keras
from tensorflow.keras import backend as K

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import metrics as kmetrics
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model

# Bert
import transformers
#from transformers import *

os.environ["TF_KERAS"]='1'

In [None]:
print(tf.__version__)
print(tensorflow.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
print(tf.test.is_built_with_cuda())

In [None]:
import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [None]:
def feature_USE_fct(sentences, b_size) :
    batch_size = b_size
    time1 = time.time()

    for step in range(len(sentences)//batch_size) :
        idx = step*batch_size
        feat = embed(sentences[idx:idx+batch_size])

        if step ==0 :
            features = feat
        else :
            features = np.concatenate((features,feat))

    time2 = np.round(time.time() - time1,0)
    return features

In [None]:
df.head()

In [None]:
batch_size = 10
sentences = df['sentence_dl'].to_list()

In [None]:
features_USE = feature_USE_fct(sentences, batch_size)

In [None]:
ARI, X_tsne, labels = ARI_fct(features_USE)

In [None]:
TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

In [None]:
plot_confusion_matrix(y_cat_num, labels,class_labels=l_cat,normalize_f=None)

</Br>

# <a name="C5"><font color='pink'>**Partie 5 : Conclusion**</font></a> 

</br>

</br>

</br>