# Projet 6 : Catégorisez automatiquement des questions
# <u>B. Topic Modeling</u> <br/>

# Le contexte

Afin d'aider les utilisateurs de Stack Overflow dans leur soumission de question, nous devons mettre en place un système de suggestion de tags. Pour celà nous allons nous baser sur les techniques de machine learning capable en fonction du texte saisi par l'utilisateur de déterminer des tags pertinents.

Dans ce notebook nous allons essayer des approches non supervisées.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from collections import Counter
from time import time

from bs4 import BeautifulSoup
import unicodedata
import re
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize.toktok import ToktokTokenizer
from contractions import CONTRACTION_MAP


from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

%matplotlib inline

# 1. Chargement des données pré-traitées

Nous allons charger les données qui ont été préalablement nettoyées et traitées.

In [2]:
df = pd.read_csv('cleaned_data.csv')
#replace NaN by empty string
df = df.replace(np.nan, '', regex=True)

In [3]:
df.shape

(64432, 7)

In [4]:
df.head()

Unnamed: 0,TITLE,BODY,SCORE,TAGS,TITLE_P,BODY_P,TAGS_P
0,Java generics variable <T> value,<p>At the moment I am using the following code...,6,<java><generics>,java gener variabl valu,moment use follow code filter jpa reduc block ...,"['java', 'generics']"
1,How a value typed variable is copied when it i...,<blockquote>\n <p>Swift's string type is a va...,6,<swift><function><value-type>,valu type variabl copi pass function hold copi,swift string type valu type creat new string v...,"['swift', 'function', 'value-type']"
2,Error while waiting for device: The emulator p...,<p>I am a freshman for the development of the ...,6,<android><android-studio><android-emulator><avd>,error wait devic emul process avd kill,freshman develop andriod suffer odd question r...,"['android', 'android-studio', 'android-emulato..."
3,gulp-inject not working with gulp-watch,<p>I am using gulp-inject to auto add SASS imp...,10,<javascript><node.js><npm><gulp><gulp-watch>,gulp inject work gulp watch,use gulp inject auto add sass import newli cre...,"['javascript', 'node.js', 'npm', 'gulp', 'gulp..."
4,React - Call function on props change,<p>My TranslationDetail component is passed an...,12,<reactjs><react-router>,react call function prop chang,translationdetail compon pass id upon open bas...,"['reactjs', 'react-router']"


In [5]:
df.dtypes

TITLE      object
BODY       object
SCORE       int64
TAGS       object
TITLE_P    object
BODY_P     object
TAGS_P     object
dtype: object

La colonne TAGS_P est interprétée comme un chaîne de caractères. On va la forcer à être vue comme une list, ce qui sera plus simple pour les traitements.

In [6]:
from ast import literal_eval
df['TAGS_P'] = df['TAGS_P'].apply(literal_eval)

# 2. Préparation des données

## 2.1 Echantillonage

Nous avons plus de 64 000 posts. Nous allons nous concentrer sur un échantillon de 20 000 posts.

In [7]:
df_sample = df.sample(20000)

In [8]:
df_sample.shape

(20000, 7)

Séparons nos données pour garder 5000 posts pour la validation de nos modèles. <br/>
Nous utiliserons les 15 000 autres lignes pour faire l'apprentissage.

In [9]:
df_learn = df_sample.iloc[5000:, :].copy()
df_validation = df_sample.iloc[:5000, :].copy()

In [10]:
display(df_learn.shape)
display(df_validation.shape)

(15000, 7)

(5000, 7)

** Analysons le nombre de mots que contient ce corpus (entrainement).**

In [11]:
my_counter = Counter()
for sentence in df_learn['TITLE_P']:
    my_counter.update(sentence.split())
for sentence in df_learn['BODY_P']:
    my_counter.update(sentence.split())
len(my_counter)

89372

=> Notre corpus est composé de près de **90 000 mots.**

## 2.2 Filtre sur les tags les plus fréquents

Regardons déjà dans un premier temps le nombre d'occurences par tag dans notre dataset complet.

In [12]:
counts = Counter()
for tags_list in df['TAGS_P']:
    counts.update(tags_list)
tags_df = pd.DataFrame.from_dict(counts, orient='index')
tags_df.reset_index(drop = False, inplace = True)
tags_df= tags_df.rename(columns={'index':'tag', 0:'count'})

La structures **tags_df** contient pour chacun des tags son occurence. <br/>
Gardons que les tags qui sont présents dans au moins 10 documents.

In [13]:
len(tags_df[tags_df['count'] > 10])

2114

Nous nous retrouvons donc avec un peu plus de 2100 tags (14000 tags dans le dataset original).

- Filtrons maintenant notre sample en ne gardant que les posts contenant les tags les plus fréquents.

In [14]:
frequent_tags = tags_df[tags_df['count'] > 10]['tag'].tolist()

**frequent_tags** est la structure contenant les tags les plus présents.

In [15]:
df_learn['TAGS_P'] = df_learn['TAGS_P'].apply(lambda x: [w for w in x if w in frequent_tags] )
# On supprime les lignes qui n'ont plus de tags associés (car aucun n'est présent dans la liste frequent_tags)
df_learn = df_learn[df_learn.astype(str)['TAGS_P'] != '[]']

In [16]:
df_learn.shape

(14861, 7)

Notre dataset contient maintenant uniquement les posts avec tags fréquents.

## 2.3 Découpage en jeu d'entrainement et test

Découpons nos données en jeu d'entrainement et jeu de test pour nos algorithmes d'apprentissage.<br/>
Nous gardons les colonnes titre et body.

In [17]:
X = df_learn[['TITLE', 'BODY','TITLE_P', 'BODY_P']]
Y = df_learn[['TAGS_P']]

- 70% des données pour entrainement et 30% pour les tests.

In [18]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(X,Y,test_size = 0.3,random_state = 0, shuffle = True)

In [19]:
print("train", x_train.shape)
print("test ",x_test.shape)

train (10402, 4)
test  (4459, 4)


# 3. Préparation pour le topic modeling

## 3.1 Matrice Topics / Words

Ecrivons une méthode réutilisable qui affiche pour un modèle de topic modeling donné, la matrice Topics/Words

In [20]:
def getTopicsWordsMatrix(model, features_name) :
    # liste des topics
    topicnames = ["Topic" + str(i) for i in range(model.n_components)]

    # Topic-Keyword Matrice
    df_topic_keywords = pd.DataFrame(model.components_)
    df_topic_keywords.columns = features_name
    df_topic_keywords.index = topicnames

    return df_topic_keywords

In [21]:
'''
Méthode qui retourne pour chacun des topics le top "n_words" mots
@param feature_names : les features de nos données
@param tp_model : le model utilisé pour le topic modeling
@param n_words : top nombre de mots
'''
def showTopicsTopWords(tp_model, feature_names, n_words=10):
    keywords = np.array(feature_names)
    topic_keywords = []
    for topic_weights in tp_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    df_topic_keywords = pd.DataFrame(topic_keywords)
    df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
    df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
    display(df_topic_keywords)

## 3.2 Matrice Docs / Topics

Méthode qui retourne la matrice documents / topics suite à l'apprentissage de type topic modeling.

In [22]:
'''
    Retourne la matrice qui indique la pertinence des topics pour chacun des documents de notre jeu de données.
'''
def getDocsTopicsMatrix(df_docs, model, model_output):
    # id des posts
    docnames = ["Doc" + str(i) for i in range(len(df_docs))]
    # id des topics
    topicnames = ["Topic" + str(i) for i in range(model.n_components)]
    # On crée un dataframe
    df_document_topic = pd.DataFrame(np.round(model_output, 4), columns=topicnames, index=docnames)
    # On rajoute une colonne contenant le numéro du topic prédominant
    dominant_topic = np.argmax(df_document_topic.values, axis=1)
    df_document_topic['Dominant_Topic'] = dominant_topic
    return df_document_topic

In [23]:
'''
Affiche pour chacun des topics, le top n documents qui sont les plus pertinent pour ce topic.
On affiche pour chacun des documents le titre mais également les tags associés.
@topics_words : word_to_topics result
@model_output : topics_to_docs model output
'''
def showTopicsTopDocs(topic_word_dist, model_output, feature_names, documents, targets, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(topic_word_dist):
        print ("Topic %d : " % (topic_idx) + " ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( model_output[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print ("Doc",doc_index," Title:", documents.iloc[doc_index].TITLE[0:60],
                   "- Tags:", targets.iloc[doc_index].TAGS_P)
        print("")

## 3.3 Prédiction des tags

Pour nous aider à faire nos prédictions, nous allons construire une matrice tags par topic.<br/>
Pour la construire, nous allons :
- Pour chaque tag_i et chaque topic_j
    - Sommer la probabilité d’appartenance au topic_j des documents contenant le tag_i. 
    - Nous allons normaliser cette somme en faisant une moyenne sur le nombre de documents qui contiennent le tag_i

|         | tag_1 | tag_2 | ... |tag_m | 
|---------|-------|-------|-----|------|
| **topic_1**  |  xxx  | xxx   | ... | xxx  | 
|**topic_2** |  xxx  | xxx   | ... | xxx  | 
|...      |  xxx  | xxx   | ... | xxx  |
|**topic_n**      |  xxx  | xxx   | ... | xxx  |



- Commençons par écrire une méthode qui donne la liste des indexes des documents qui contiennent un tag donné.

In [24]:
'''
Retourne un dictionnaire contenant pour chacun des tags (parmi les plus fréquents)
les positions des documents contenant ce tag.
Attention il s'agit de la position et non de l'index.
'''
def getDocumentsByTags(df_docs_tags) :
    documents_by_tags = {}
    for tag in frequent_tags :
        doc_indexes = df_docs_tags[df_docs_tags['TAGS_P'].apply(lambda x: True  if tag in x else False)].index.values
        docs_pos = [(lambda x: df_docs_tags.index.get_loc(x))(x) for x in doc_indexes]
        documents_by_tags[tag] = docs_pos
    return documents_by_tags

- Construisons maintenant notre matrice Tags / Topics

In [25]:
'''
@topics_modeling_algo : le modèle.
@ df_docs_tags : les données contenant les tags associés à chacun des documents.
'''
def buildTagByTopicsMatrix(tm_algo, tm_output, df_docs_tags) :
    documents_by_tags = getDocumentsByTags(df_docs_tags)
    tag_by_topics_ = np.zeros([tm_algo.n_components,len(frequent_tags)])
    for topic_idx in range(tm_algo.n_components):
        for tag_idx, tag in enumerate(frequent_tags):
            docs_pos = documents_by_tags[tag]
            topic_score = 0
            for d in docs_pos :
                topic_score += tm_output[d][topic_idx]
                if topic_score > 0 :
                    #tag_by_topics_[topic_idx][tag_idx]=topic_score/len(docs_pos)
                    tag_by_topics_[topic_idx][tag_idx]=topic_score
    return tag_by_topics_

### Tags prédiction

- Méthode qui va chercher dans la matrice tags/topics, les tags les plus pertinent pour un post.<br/>
Elle va d'abord déterminer les probabilité de chaque topic pour le post puis rechercher les tags les plus pertinents.

In [26]:
'''
Retourne la liste de n tags les plus pertinents pour le post donné.
Se base pour celà de la matrice tags par topics.
'''
def getTagsPrediction(post_text, num_tags, tag_topics_matrix, vectorizer, tm_model ) :
    # On  vectorize le texte 
    text_vectorized = vectorizer.transform([post_text])
    topic_probability_scores = tm_model.transform(text_vectorized)
    
    # On multiple les topics prédits par le modèle par notre matrice tags/topics. 
    #On aura ainsi un score pour chacun des tags
    tags_result = topic_probability_scores.dot(tag_topics_matrix)
    # Récupérons maintenant les tags qui ont le meilleur score.
    best_tags_indices = np.argsort(-tags_result[0])[:num_tags]
    best_tags = [frequent_tags[index] for index in best_tags_indices]
    return best_tags

- Prédiction pour une série de posts.

In [27]:
'''
Retourne un dataframe contenant les prédictions de tags pour les posts contenant dans posts_df
@ post_df : le dataframe contenant les posts (title_p / body_p)
@ num_tag : le nombre de tags à prédire
@ t_by_topics : la matrice tags/topics utilisée pour la prédiction
@ vectorizer : convertisseur à utiliser pour transformer texte en matrice de tokens et d'occurences.
@ tm_model : le modèle de topic modeling à utiliser
'''
def getPostsTagsPrediction(posts_df, num_tag, tags_topics_matrix, vectorizer, tm_model, tm_output):
    y_predicted=posts_df.apply(lambda row: getTagsPrediction(row['TITLE_P'] + ' ' + row['BODY_P'], 
                                                             num_tag, tags_topics_matrix, vectorizer, tm_model),axis=1)
    y_predicted_df = y_predicted.to_frame()
    y_predicted_df.columns = ['TAGS_P']
    return y_predicted_df

### Prédiction accuracy

In [28]:
'''
Méthode permettant d'évaluer la qualité des prédictions en comparant les tags prédits aux tags réels.
calcule pour chaque post, le rapport entre le nombre de tags correctement prédits sur le nombre de tags réels.
retourne la moyenne de ces rapports.
'''
def predictionAccuracy(y_predicted, y_true) :
    tags_found=[]
    for index, row in y_predicted.iterrows():
        number_tags_found = 0
        for t in row['TAGS_P'] :
            if t in y_true.loc[index]['TAGS_P'] :
                number_tags_found +=1
        tags_found.append(number_tags_found/len(y_true.loc[index]['TAGS_P']))
    print("Prediction accuracy: {:.2f} % ".format(100*np.mean(tags_found)))

# 4. Topic Modeling LDA

Essayons LDA (Allocation de Dirichlet Latente) pour modéliser les sujets.

Pour pouvoir appliquer nos algorithmes de machine learning à nos données textuelles, il faut en extraire les features et représenter notre texte dans un modèle "mathématique".
Pour celà nous allons utiliser la modélisation **Bag of Words** qui va donner une représentation sous forme de matrice de nos données.

## 4.1 Tuning du modèle avec recherche sur grille

- min_df : le mot doit être présent au moins dans min_df documents
- max_df : si le mot est présent dans plus de max_df des documents, il ne nous aidera pas à trouver différencier les documents. On ne le tient donc pas en compte
- num_topics : nombre de topics du modèle.

### Apprentissage

In [29]:
t0 = time()
Kfold = 5
# CHANGE
search_params = { 
              'vect__min_df': [5, 10, 15],
              'vect__max_df': [0.75, 0.85, 0.95],
              'model__n_components' : [10, 20, 30],
              'model__learning_method':['batch'], 
              'model__max_iter':[5]
             }
# Init the Model
lda = LatentDirichletAllocation()
pipeline = Pipeline([('vect', CountVectorizer()), ('model', lda)])

# Init Grid Search Class
lda_model_grid = GridSearchCV(pipeline, param_grid=search_params)

# On concatène le titre et le body
x_train_text = x_train['TITLE_P'] + ' ' + x_train['BODY_P']
# Do the Grid Search
lda_model_grid.fit(x_train_text)

print("Best params :", lda_model_grid.best_params_)
print("done in %0.3fs." % (time() - t0))

Best params : {'model__learning_method': 'batch', 'model__max_iter': 5, 'model__n_components': 20, 'vect__max_df': 0.95, 'vect__min_df': 15}
done in 3467.872s.


- Le LDA avec meilleur performance

In [30]:
best_lda_model = lda_model_grid.best_estimator_.steps[1][1]

- On récupère le vectorizer

In [31]:
tf_vectorizer = lda_model_grid.best_estimator_.steps[0][1]

On applique l'algorithme sur nos données

In [32]:
data_vectorized = tf_vectorizer.transform(x_train_text)
lda_output = best_lda_model.transform(data_vectorized)

In [33]:
tags_by_topics_lda = buildTagByTopicsMatrix(best_lda_model, lda_output, y_train)

### Evaluation

In [34]:
# Log Likelyhood: Higher the better
print("Log Likelihood: ", best_lda_model.score(data_vectorized))
# A lower perplexity score indicates better generalization performance.
print("Perplexity : ",best_lda_model.perplexity(data_vectorized))

Log Likelihood:  -9046417.68516
Perplexity :  855.781724869


In [35]:
y_predicted_lda = getPostsTagsPrediction(x_test, 5, tags_by_topics_lda, tf_vectorizer, best_lda_model, lda_output)
predictionAccuracy(y_predicted_lda,y_test)

Prediction accuracy: 28.89 % 


### Evaluation de la prédiction de notre sample de validation

In [36]:
x_validation = df_validation[['TITLE_P','BODY_P']]
y_validation = df_validation['TAGS_P']
y_validation = y_validation.to_frame()

In [37]:
y_predicted = getPostsTagsPrediction(x_validation, 5,tags_by_topics_lda, tf_vectorizer, best_lda_model, lda_output )
predictionAccuracy(y_predicted, y_validation)

Prediction accuracy: 24.65 % 


## 4.2 Tuning manuel

### Apprentissage

In [38]:
def fit_transform_LDA(train_data, target_data, min_df, max_df, num_topics) :
    print("[min_df:{:d} max_df:{:.2f} num_topics:{}]"
          .format(min_df, max_df, num_topics), end=" ", flush=True)
    tf_vectorizer = CountVectorizer(min_df=min_df, max_df=max_df)
    data_vectorized = tf_vectorizer.fit_transform(train_data['TITLE_P'] + ' ' + train_data['BODY_P'])
    
    lda_model = LatentDirichletAllocation(n_components=num_topics, max_iter=5, learning_method='online')
    lda_output = lda_model.fit_transform(data_vectorized)
    print("Log Likelihood: {:.2f} - Perplexity: {:.2f}".format(lda_model.score(data_vectorized), 
                                                               lda_model.perplexity(data_vectorized)), 
          end=" ", flush=True)
    tags_by_topics_matrix = buildTagByTopicsMatrix(lda_model, lda_output, target_data)
    return lda_output, lda_model, tf_vectorizer, tags_by_topics_matrix

On fait varier min_df, max_df et num_topics.

### Evaluation

In [39]:
min_df_values = [5, 10, 20]
max_df_values = [0.75, 0.85, 0.95]
num_topics = [10, 20, 30]

In [40]:
for n_topics in num_topics :
    for min_df in min_df_values :
        for max_df in max_df_values :
            lda_output_m, lda_model_m, tf_vectorizer_m, tags_by_topics_m = fit_transform_LDA(x_train, y_train, min_df, max_df, n_topics)
            y_predicted_lda = getPostsTagsPrediction(x_test, 5, tags_by_topics_m, tf_vectorizer_m, lda_model_m, lda_output_m)
            predictionAccuracy(y_predicted_lda,y_test)

[min_df:5 max_df:0.75 num_topics:10] Log Likelihood: -10238550.67 - Perplexity: 1301.86 Prediction accuracy: 24.04 % 
[min_df:5 max_df:0.85 num_topics:10] Log Likelihood: -10205791.30 - Perplexity: 1272.32 Prediction accuracy: 22.29 % 
[min_df:5 max_df:0.95 num_topics:10] Log Likelihood: -10184437.46 - Perplexity: 1253.43 Prediction accuracy: 22.58 % 
[min_df:10 max_df:0.75 num_topics:10] Log Likelihood: -9552865.84 - Perplexity: 1046.12 Prediction accuracy: 23.20 % 
[min_df:10 max_df:0.85 num_topics:10] Log Likelihood: -9560962.01 - Perplexity: 1052.30 Prediction accuracy: 22.53 % 
[min_df:10 max_df:0.95 num_topics:10] Log Likelihood: -9534272.93 - Perplexity: 1032.06 Prediction accuracy: 23.76 % 
[min_df:20 max_df:0.75 num_topics:10] Log Likelihood: -8862523.85 - Perplexity: 859.02 Prediction accuracy: 23.56 % 
[min_df:20 max_df:0.85 num_topics:10] Log Likelihood: -8869653.93 - Perplexity: 863.71 Prediction accuracy: 24.45 % 
[min_df:20 max_df:0.95 num_topics:10] Log Likelihood: -891

Manuellement nous obtenons le meilleur score avec 30 topics, un min_df à 20 et un max_df à 0.75.<br/>
Mais l'accuracy reste plus faible que le GridSearchCV et nous selectionnerons donc le modèle tunné par le GridSearch.

### Evaluation de la prédiction de notre sample de validation

In [66]:
lda_output_m, lda_model_m, tf_vectorizer_m, tags_by_topics_m = fit_transform_LDA(x_train, y_train, 
                                                                                 20, 0.75, 30)

[min_df:20 max_df:0.75 num_topics:30] Log Likelihood: -8757486.13 - Perplexity: 792.92 

In [67]:
y_predicted = getPostsTagsPrediction(x_validation, 5,tags_by_topics_m, tf_vectorizer_m, lda_model_m, lda_output_m )
predictionAccuracy(y_predicted, y_validation)

Prediction accuracy: 20.79 % 


## 4.3 Ajustement du nombre de tags

In [43]:
num_tags = [3, 4, 5, 6, 7, 8]
for n in num_tags :
    print("{} tags :".format(n), end=" ", flush=True)
    y_predicted = getPostsTagsPrediction(x_validation, n,tags_by_topics_lda, tf_vectorizer, best_lda_model, lda_output )
    predictionAccuracy(y_predicted, y_validation)

3 tags : Prediction accuracy: 19.33 % 
4 tags : Prediction accuracy: 22.29 % 
5 tags : Prediction accuracy: 24.65 % 
6 tags : Prediction accuracy: 26.55 % 
7 tags : Prediction accuracy: 28.16 % 
8 tags : Prediction accuracy: 29.69 % 


=> Nous remarquons que plus nous augmenter le nombre de tags, meilleure est la prédiction. <br/>
=> Cependant, il faut rester surun nombre raisonnable de tags et nous conseiller de partir sur 6 tags.

## 4.4 Analyse de la modélisation

### Topics / Words

Affichage de la matrice topics / words.

In [44]:
getTopicsWordsMatrix(best_lda_model, tf_vectorizer.get_feature_names())

Unnamed: 0,aa,aaa,aar,ab,abc,abi,abil,abl,abort,about,...,youtub,yyyi,yyyy,zero,zeros,zip,zlib,zone,zoom,zygoteinit
Topic0,75.374851,33.168353,65.733151,113.181787,0.05,12.176349,1.238935,7.726519,5.881364,1.362979,...,0.357279,0.05,2.994896,11.849024,0.05,6.099444,15.58826,0.05,0.05,0.05
Topic1,0.05,0.05,63.970777,0.268828,7.763422,1.795093,9.94227,175.872095,11.702859,0.453102,...,0.05,0.05,0.05,0.649349,0.05,20.912222,0.05,0.05,0.05,0.05
Topic2,5.793988,3.612217,0.05,18.873082,0.309743,0.191999,0.05,9.699893,0.05,7.257495,...,0.05,0.05,0.05,0.05,0.05,0.127806,0.05,0.05,47.625976,0.05
Topic3,0.05,0.05,0.05,0.17623,3.485663,0.05,1.667105,76.871154,0.093133,0.05,...,0.05,0.05,4.815695,30.294729,2.332194,0.05,0.05,0.05,46.29573,0.05
Topic4,13.221767,0.05,0.05,6.767516,10.796418,0.05,5.310617,72.37086,9.445042,6.092735,...,0.05,0.05,0.05,0.055514,0.05,134.057642,1.951765,0.179205,0.05,0.05
Topic5,0.05,0.05,0.05,0.05,0.05,0.05,0.052424,25.351481,0.05,27.678329,...,1.15176,0.05,1.5772,0.088743,0.05,0.057403,0.083377,168.54348,0.162221,0.05
Topic6,22.483101,0.05,0.05,0.81488,7.255762,0.050294,7.628345,99.679828,3.52559,1.829298,...,0.05,0.183121,0.05,13.215018,0.08355,0.666222,0.05,0.064114,0.05,0.05
Topic7,0.787063,4.962179,0.05,1.272468,0.505099,0.059567,4.085678,29.02053,0.05,9.38504,...,0.05,1.882679,9.357501,0.448728,0.05,3.093201,0.05,0.05,3.110095,0.05
Topic8,1.456314,35.319714,0.05,14.011873,34.024728,0.05,1.314749,42.708503,0.05,5.659438,...,0.050276,0.05,0.05,19.416792,0.053552,67.982104,0.05,0.050149,0.05,0.05
Topic9,0.05,0.05,0.05,7.050773,11.416369,0.05,0.050205,34.28879,0.05108,13.610586,...,0.051615,2.489114,0.05,59.512029,44.730704,7.496405,0.05,0.050174,29.548764,0.05


In [45]:
showTopicsTopWords(best_lda_model, tf_vectorizer.get_feature_names(), 12)

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11
Topic 0,php,file,size,long,cs,cach,ns,clang,laravel,main,std,error
Topic 1,project,use,build,version,error,work,file,tri,app,run,studio,xcode
Topic 2,px,width,height,color,style,background,div,left,top,font,var,text
Topic 3,input,use,imag,tf,data,size,group,would,like,get,set,tri
Topic 4,system,net,file,use,microsoft,run,api,server,web,instal,error,version
Topic 5,id,from,angular,this,component,import,select,compon,router,export,users,error
Topic 6,array,data,use,key,like,would,function,string,need,way,const,read
Topic 7,div,class,item,button,li,menu,span,text,click,element,display,option
Topic 8,int,test,list,function,import,char,use,type,for,return,main,std
Topic 9,in,self,python,data,the,line,for,model,py,import,file,to


Nous pouvons interpréter certains topics : <br/>
- le sujet 1 : Tableau HTML
- Le sujet 7 : Androïd 
- Le sujet 9 : Requêtes HTTP 
- Le sujet 15 : Microsoft / .Net
- Le sujet 16 : Java / Apache
- Le sujet 17 : Python
- Le sujet 19 : System / Installation
...

###  Docs / Topics

Voyons maintenant le topic associé à quelques documents.

In [46]:
showTopicsTopDocs(best_lda_model.components_, lda_output, tf_vectorizer.get_feature_names(), x_train, y_train, 10, 2)

Topic 0 : php file size long cs cach ns clang laravel main
Doc 8419  Title: Why is the same function with a different integer parameter  - Tags: ['performance', 'rust']
Doc 7667  Title: Is it common to have just a return statement in a php file - Tags: ['php']

Topic 1 : project use build version error work file tri app run
Doc 2894  Title: How to create .NET Standard NuGet package with minimal depen - Tags: ['c#', '.net', 'visual-studio-2017', '.net-standard']
Doc 8900  Title: I get conflicting provisioning settings error when I try to  - Tags: ['ios', 'xcode']

Topic 2 : px width height color style background div left top font
Doc 1109  Title: HTML element not printing - Tags: ['html', 'css', 'html5', 'printing']
Doc 5829  Title: Efficient CSS3 Long Shadows on div with rounded corners - Tags: ['html', 'css', 'html5', 'css3']

Topic 3 : input use imag tf data size group would like get
Doc 6535  Title: Understanding TensorBoard (weight) histograms - Tags: ['tensorflow', 'tensorboard']


In [47]:
getDocsTopicsMatrix(x_train, best_lda_model, lda_output).head(10)

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,...,Topic11,Topic12,Topic13,Topic14,Topic15,Topic16,Topic17,Topic18,Topic19,Dominant_Topic
Doc0,0.0025,0.9525,0.0025,0.0025,0.0025,0.0025,0.0025,0.0025,0.0025,0.0025,...,0.0025,0.0025,0.0025,0.0025,0.0025,0.0025,0.0025,0.0025,0.0025,1
Doc1,0.0016,0.3394,0.0016,0.0016,0.0016,0.0016,0.0016,0.0016,0.0016,0.0016,...,0.0016,0.0016,0.0016,0.4355,0.0016,0.1976,0.0016,0.0016,0.0016,14
Doc2,0.0002,0.0002,0.0538,0.0002,0.0002,0.0002,0.2236,0.0002,0.3213,0.0002,...,0.0002,0.0002,0.0002,0.0002,0.0269,0.3709,0.0002,0.0002,0.0002,16
Doc3,0.0001,0.0882,0.0001,0.0001,0.0001,0.0001,0.7774,0.0001,0.0001,0.0312,...,0.0283,0.0001,0.0001,0.056,0.0001,0.0001,0.0173,0.0001,0.0001,6
Doc4,0.0015,0.4211,0.0015,0.0015,0.0015,0.0015,0.5524,0.0015,0.0015,0.0015,...,0.0015,0.0015,0.0015,0.0015,0.0015,0.0015,0.0015,0.0015,0.0015,6
Doc5,0.001,0.001,0.001,0.001,0.2195,0.001,0.001,0.001,0.001,0.001,...,0.001,0.001,0.001,0.001,0.001,0.001,0.0189,0.001,0.001,10
Doc6,0.001,0.263,0.2252,0.123,0.001,0.001,0.001,0.001,0.001,0.001,...,0.001,0.001,0.001,0.001,0.3721,0.001,0.001,0.001,0.001,15
Doc7,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,...,0.0001,0.0001,0.091,0.0001,0.0001,0.0001,0.8874,0.0206,0.0001,17
Doc8,0.0008,0.0008,0.0008,0.0008,0.1964,0.0008,0.0008,0.0008,0.0008,0.0008,...,0.0008,0.0008,0.0008,0.0008,0.4567,0.0008,0.1604,0.1743,0.0008,15
Doc9,0.0004,0.0004,0.0004,0.0004,0.0004,0.0004,0.4192,0.0312,0.0004,0.0004,...,0.159,0.0004,0.0004,0.0004,0.2451,0.1397,0.0004,0.0004,0.0004,6


# 5. Topic modeling avec NMF

Voyons maintenant la méthode NMF (Factorisation par matrices non négatives) pour prédire nos tags.

## 5.1 Tuning manuel

### Apprentissage

In [48]:
def fit_transform_NMF(train_data, target_data, min_df, max_df, num_topics) :
    print("[min_df:{:d} max_df:{:.2f} num_topics:{}]"
          .format(min_df, max_df, num_topics), end=" ", flush=True)
    tfidf_vectorizer = TfidfVectorizer(min_df=min_df, max_df=max_df)
    tfidf_vectorized = tfidf_vectorizer.fit_transform(train_data['TITLE_P'] + ' ' + train_data['BODY_P'])
    nmf_model = NMF(n_components=num_topics, random_state=1, alpha=.1, l1_ratio=.5)
    nmf_output = nmf_model.fit_transform(tfidf_vectorized)
    tags_by_topics_matrix = buildTagByTopicsMatrix(nmf_model, nmf_output, target_data)
    return nmf_output, nmf_model, tfidf_vectorizer, tags_by_topics_matrix

### Evaluation

In [49]:
for n_topics in num_topics :
    for min_df in min_df_values :
        for max_df in max_df_values :
            nmf_output_m, nmf_model_m, tfidf_vectorizer_m, tags_by_topics_m = fit_transform_NMF(x_train, y_train, min_df, max_df, n_topics)
            y_predicted_nmf = getPostsTagsPrediction(x_test, 5, tags_by_topics_m, tfidf_vectorizer_m, nmf_model_m, nmf_output_m)
            predictionAccuracy(y_predicted_nmf,y_test)

[min_df:5 max_df:0.75 num_topics:10] Prediction accuracy: 26.89 % 
[min_df:5 max_df:0.85 num_topics:10] Prediction accuracy: 26.89 % 
[min_df:5 max_df:0.95 num_topics:10] Prediction accuracy: 26.89 % 
[min_df:10 max_df:0.75 num_topics:10] Prediction accuracy: 27.06 % 
[min_df:10 max_df:0.85 num_topics:10] Prediction accuracy: 27.06 % 
[min_df:10 max_df:0.95 num_topics:10] Prediction accuracy: 27.06 % 
[min_df:20 max_df:0.75 num_topics:10] Prediction accuracy: 27.75 % 
[min_df:20 max_df:0.85 num_topics:10] Prediction accuracy: 27.75 % 
[min_df:20 max_df:0.95 num_topics:10] Prediction accuracy: 27.75 % 
[min_df:5 max_df:0.75 num_topics:20] Prediction accuracy: 30.59 % 
[min_df:5 max_df:0.85 num_topics:20] Prediction accuracy: 30.59 % 
[min_df:5 max_df:0.95 num_topics:20] Prediction accuracy: 30.59 % 
[min_df:10 max_df:0.75 num_topics:20] Prediction accuracy: 31.18 % 
[min_df:10 max_df:0.85 num_topics:20] Prediction accuracy: 31.18 % 
[min_df:10 max_df:0.95 num_topics:20] Prediction accur

### Evaluation de la prédiction de notre sample de validation

In [54]:
nmf_output, best_nmf_model, tfidf_vectorizer, tags_by_topics_nmf = fit_transform_NMF(x_train, y_train, 10, 0.95, 30)

[min_df:10 max_df:0.95 num_topics:30] 

In [55]:
y_predicted = getPostsTagsPrediction(x_validation, 5,tags_by_topics_nmf, tfidf_vectorizer, best_nmf_model, nmf_output)
predictionAccuracy(y_predicted, y_validation)

Prediction accuracy: 26.83 % 


## 5.2 Ajustement du nombre de tags

In [56]:
num_tags = [3, 4, 5, 6, 7, 8]
for n in num_tags :
    print("{} tags :".format(n), end=" ", flush=True)
    y_predicted = getPostsTagsPrediction(x_validation, n, tags_by_topics_nmf, tfidf_vectorizer, best_nmf_model, nmf_output)
    predictionAccuracy(y_predicted, y_validation)

3 tags : Prediction accuracy: 22.70 % 
4 tags : Prediction accuracy: 25.07 % 
5 tags : Prediction accuracy: 26.83 % 
6 tags : Prediction accuracy: 28.12 % 
7 tags : Prediction accuracy: 29.36 % 
8 tags : Prediction accuracy: 30.64 % 


## 5.3 Analyse

### Topics / Words

In [57]:
showTopicsTopWords(best_nmf_model, tfidf_vectorizer.get_feature_names(), 12)

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11
Topic 0,use,user,would,request,like,get,work,api,code,http,way,set
Topic 1,android,layout,parent,view,dp,height,match,id,width,support,studio,widget
Topic 2,int,compil,const,main,void,return,char,include,float,for,gcc,clang
Topic 3,div,class,button,form,input,span,ng,label,html,text,click,col
Topic 4,java,at,org,lang,internal,apache,util,activitythread,springframework,com,run,sun
Topic 5,angular,import,component,compon,router,from,ts,export,this,rout,core,ng
Topic 6,px,width,color,height,background,text,border,style,css,left,top,font
Topic 7,public,class,void,static,new,private,method,this,override,return,,final
Topic 8,file,python,py,line,import,lib,in,path,instal,usr,pip,tensorflow
Topic 9,foo,bar,baz,const,struct,class,str,template,compil,templat,argument,void
