## GENERAL INFORMATION

- In average, a tag is 14 times frequent in the total corpus with a standard deviation of 127 tags.
- There are 28 tags which occurs more than 1000 times in the whole corpus.
- 41% of tags occurs only once in the corpus
- TF-IDF: our X_data has been vectorized to 1130 words from the whole Text + Title corpus (3M+ corpus words).
- TF-IDF: our y_data has been vectorized to 231 most important tags from the 13k+ original tags.
- In average, the MultiOutput(RegressionLogistic) model makes 2 tags identification errors/ mismatch per document (one_zero_loss_score)
- In average, the jaccard score of the model is 0.64 if trained on Text + Title and 0.58 if trained only on Titles - which is a good score. The Text + Title scenario is the most efficient approach.

- df_clean['Text'] correspond to Body_3 and Title_2
- df_bert_2_index correspond to Body + Title - few preprocessing (html)


Github: https://github.com/maurlco/stackoverflow-classification-tag

In [85]:
import pandas as pd
import numpy as np
import nltk
import torch
import matplotlib.pyplot as plt
from statistics import mean
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import log_loss
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_score, zero_one_loss,balanced_accuracy_score,precision_score, recall_score, hamming_loss

## Chargement du preprocessed DataSet

In [2]:
df = pd.read_csv('/Users/maurelco/Developer/Python/Projet 4/data/Cleaned/df_process_text_3.csv')

In [3]:
df

Unnamed: 0,Title,Body,Tags,_clean_tags,_len_body,Body_2,_len_body_2,Body_3,Title_2
0,giving unix process exclusive rw access directory,way sandbox linux process certain directory gi...,linux ubuntu process sandbox selinux,"['linux', 'ubuntu', 'process', 'sandbox', 'sel...",526,sandbox linux process certain directory give p...,462,sandbox linux process certain directory proces...,giving unix process exclusive rw access directory
1,automatic repaint minimizing window,jframe two panel one panel draw line working m...,java graphics jframe jpanel paint,"['java', 'graphics', 'jframe', 'jpanel', 'paint']",2969,jframe two panel panel draw line working minim...,2855,jframe panel panel draw line minimized window ...,automatic repaint minimizing window
2,man-in-the-middle attack security threat ssh a...,expert network security pardon question smart ...,security ssh ssh-keys openssh man-in-the-middle,"['security', 'ssh', 'ssh-keys', 'openssh', 'ma...",447,expert network security pardon smart automatin...,414,expert network security pardon smart automatin...,man-in-the-middle attack security threat ssh a...
3,managing data access simple winforms app,simple winforms data entry app us sqlite alway...,c# winforms sqlite datatable sqlconnection,"['c#', 'winforms', 'sqlite', 'datatable', 'sql...",2537,simple winforms entry app u sqlite always sing...,2382,winforms entry app sqlite always single-user a...,managing data access winforms app
4,render basic html view,basic node.js app trying get ground using expr...,javascript html node.js mongodb express,"['javascript', 'html', 'node.js', 'mongodb', '...",335,basic node.js app get ground express framework...,292,basic node.js app get ground express framework...,render basic html view
...,...,...,...,...,...,...,...,...,...
49995,bypass vertica error execution time exceeded r...,using ssis tool ole db downloading data vertic...,sql-server ssas oledb sql-server-data-tools ve...,"['sql-server', 'ssas', 'oledb', 'sql-server-da...",206,ssis tool ole db downloading vertica database ...,186,ssis tool ole db downloading vertica database ...,bypass vertica error execution time exceeded r...
49996,conflicting conditional operation currently pr...,using f uploading file already created bucket ...,python amazon-web-services amazon-s boto pytho...,"['python', 'amazon-web-services', 'amazon-s', ...",301,f uploading file already created bucket deleti...,268,uploading file bucket deleting bucket executio...,conflicting conditional operation progress bucket
49997,problem lr_find pytorch fastai course,following jupyter notebook course hit upon err...,python machine-learning deep-learning pytorch ...,"['python', 'machine-learning', 'deep-learning'...",908,jupyter notebook course hit upon line cnn_lear...,781,jupyter notebook course hit upon line cnn_lear...,problem lr_find pytorch fastai course
49998,jsonpatch escape slash jsonpatch+json,json wanted update field process my-process po...,java json rest json-patch http-patch,"['java', 'json', 'rest', 'json-patch', 'http-p...",645,json wanted update field process my-process po...,566,json wanted field process my-process pod some-...,jsonpatch escape slash jsonpatch+json


## Mise en Forme du DATASET

In [3]:
df['Text'] = df['Title_2'] + ' ' + df['Body_3']
df_clean = df.drop(['_len_body','Body_2','_len_body_2','Body','Title','Body_3'], axis=1).dropna()
print('Shape of dataset :',df_clean.shape)
df_clean.head(5)

Shape of dataset : (49997, 4)


Unnamed: 0,Tags,_clean_tags,Title_2,Text
0,linux ubuntu process sandbox selinux,"['linux', 'ubuntu', 'process', 'sandbox', 'sel...",giving unix process exclusive rw access directory,giving unix process exclusive rw access direct...
1,java graphics jframe jpanel paint,"['java', 'graphics', 'jframe', 'jpanel', 'paint']",automatic repaint minimizing window,automatic repaint minimizing window jframe pan...
2,security ssh ssh-keys openssh man-in-the-middle,"['security', 'ssh', 'ssh-keys', 'openssh', 'ma...",man-in-the-middle attack security threat ssh a...,man-in-the-middle attack security threat ssh a...
3,c# winforms sqlite datatable sqlconnection,"['c#', 'winforms', 'sqlite', 'datatable', 'sql...",managing data access winforms app,managing data access winforms app winforms ent...
4,javascript html node.js mongodb express,"['javascript', 'html', 'node.js', 'mongodb', '...",render basic html view,render basic html view basic node.js app get g...


In [5]:
df_clean.isna().sum()

Tags           0
_clean_tags    0
Title_2        0
Text           0
dtype: int64

In [4]:
def delete_row_with_more_than_x_tags(nbr_max_tags,df):
    print('Deleting rows (sentences) with more than 7 tags ... ')
    tokenizer = nltk.RegexpTokenizer(r'[a-zA_\-+#]*\.?[a-zA_\+#]+')
    doc_tokens = []
    for list in df.Tags.values:
        doc_tokens.append(tokenizer.tokenize(list))
    doc_tokens = pd.Series(doc_tokens)
    tokens_length = doc_tokens.apply(len)
    index_to_drop = tokens_length[tokens_length > nbr_max_tags].index.values
    df_clean = df.drop(labels=index_to_drop, axis=0)
    return df_clean

In [5]:
df_clean = delete_row_with_more_than_x_tags(7,df_clean)
df_clean.shape

Deleting rows (sentences) with more than 7 tags ... 


(49843, 4)

In [8]:
df_clean.to_csv('/Users/maurelco/Developer/Python/Projet 4/data/Cleaned/df_clean_bis.csv',index= False)

Tag Frequency Analysis

In [9]:
raw_corpus_tags = " ".join(df.Tags.values)
tokenizer = nltk.RegexpTokenizer(r'[a-zA_\-+#]*\.?[a-zA_\+#]+')
raw_tokens_tags= tokenizer.tokenize(raw_corpus_tags)

In [10]:
tmp_tags = pd.Series(raw_tokens_tags).value_counts()
print('Total number of tags in corpus : ',len(tmp_tags))
print('Average tags fequency within corpus :  ',round(np.average(tmp_tags),0))
print('Median tags fequency within corpus :  ',np.median(tmp_tags))
print('Standard deviation tags frequency within corpus : ',round(np.std(tmp_tags),0))
print('Number of tags with more than 128 occurences within corpus : ',len(tmp_tags[tmp_tags > 128]) )
print('Top 3 tags (freq) :\n',tmp_tags[tmp_tags > 150][:3])

Total number of tags in corpus :  18044
Average tags fequency within corpus :   14.0
Median tags fequency within corpus :   2.0
Standard deviation tags frequency within corpus :  128.0
Number of tags with more than 128 occurences within corpus :  267
Top 3 tags (freq) :
 c#            6699
java          6278
javascript    6236
dtype: int64


## MultiLabelBinarizer

The MultiLabelBinarizer is not conclusive for creating a binary multi-label matrix, see example below:

In [11]:
tokenizer = nltk.RegexpTokenizer(r'[a-zA_\-+#]*\.?[a-zA_\+#]+')
df_clean['Tags_tokens'] = df_clean['Tags'].apply(lambda x : tokenizer.tokenize(x))
df_clean['Tags_tokens'].iloc[0:2]

0    [linux, ubuntu, process, sandbox, selinux]
1       [java, graphics, jframe, jpanel, paint]
Name: Tags_tokens, dtype: object

In [12]:
# Initialize the MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Fit the MultiLabelBinarizer to your data
mlb.fit(df_clean['Tags_tokens'])

# Transform your data into a binary encoding
binary_matrix = mlb.transform(df_clean['Tags_tokens'])
total_words_text = mlb.classes_
binary_matrix_df = pd.DataFrame(binary_matrix)
binary_matrix_df.columns = total_words_text
binary_matrix_df[:3]

Unnamed: 0,+,-addin,-addon,-advanced-app,-ajax,-ami,-aot,-api,-api-manager,-api-tools,...,zsh,zshrc,zstd,zsync,zul,zurb-foundation,zurb-ink,zurb-joyride,zxing,zynq
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
binary_matrix_df[binary_matrix_df['c++'] > 1]

Unnamed: 0,+,-addin,-addon,-advanced-app,-ajax,-ami,-aot,-api,-api-manager,-api-tools,...,zsh,zshrc,zstd,zsync,zul,zurb-foundation,zurb-ink,zurb-joyride,zxing,zynq


### Data formating for TF-IDF

In [32]:
# step_1 = df_clean.join(df_clean["Tags"].str.split(' ',6, expand=True))
# step_2 = step_1.drop(['Title_2','Tags','Text'],axis=1).stack()
# freq_step_2 = step_2.value_counts()
# least_freq_tags = freq_step_2[freq_step_2 <= 150].index
# step_2_df = pd.DataFrame(step_2)
# step_2_df = step_2_df[~step_2_df.isin(least_freq_tags.values).any(axis=1)]
# step_3 = pd.get_dummies(step_2_df)
# step_3.columns = step_3.columns.str.strip(['0_','-','_'])
# tag_columns = step_3.groupby(level=0).sum()

In [11]:
def create_multi_label_matrix(df, column_to_split_str,columns_to_drop_arr,freq_threshold_int, characters_to_strip_arr):
    print('step 1 : splitting the Tags values up to 6 maximum into columns and join it to the dataset ... ')
    step_1 = df.join(df[column_to_split_str].str.split(' ',n=6, expand=True))
    print('Step 2 : deleting the columns Text/Title/Tags and stack the tags columns dataframe ... ')
    step_2 = step_1.drop(columns_to_drop_arr,axis=1).stack()
    print('Step 3 : deleting columns of least frequent tags ...')
    freq_step_2 = step_2.value_counts()
    least_freq_tags = freq_step_2[freq_step_2 <= freq_threshold_int].index
    step_2_df = pd.DataFrame(step_2)
    step_2_df = step_2_df[~step_2_df.isin(least_freq_tags.values).any(axis=1)]
    print('Step 4 : creating a dummy dataframe ...')
    step_3 = pd.get_dummies(step_2_df)
    print('Step 5 : cleaning the name of the tags columns ...')
    step_3.columns = step_3.columns.str.strip(characters_to_strip_arr[0])
    step_3.columns = step_3.columns.str.strip(characters_to_strip_arr[1])
    step_3.columns = step_3.columns.str.strip(characters_to_strip_arr[2])
    print('Step 6 : GroupBy sentences and sum values per Tags columns ...')
    matrix_tags = step_3.groupby(level=0).sum()
    return matrix_tags

In [12]:
matrix_labels = create_multi_label_matrix(df_clean,'Tags',['Title_2','Tags','Text'],150,['0_','-','_'])
print('Multi-label matrix shape : ', matrix_labels.shape)
matrix_labels.head(5)

step 1 : splitting the Tags values up to 6 maximum into columns and join it to the dataset ... 
Step 2 : deleting the columns Text/Title/Tags and stack the tags columns dataframe ... 
Step 3 : deleting columns of least frequent tags ...
Step 4 : creating a dummy dataframe ...
Step 5 : cleaning the name of the tags columns ...
Step 6 : GroupBy sentences and sum values per Tags columns ...
Multi-label matrix shape :  (48434, 229)


Unnamed: 0,core,mvc,web-api,.htaccess,.net,.x,actionscript,ajax,algorithm,amazon-s,...,windows-phone,winforms,woocommerce,wordpress,wpf,x,xamarin,xaml,xcode,xml
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
final = pd.concat([df_clean, matrix_labels], axis=1).drop('Tags', axis=1)
final = final.dropna()
final.shape

(48434, 232)

## Early data


In [14]:
final.to_csv('/Users/maurelco/Developer/Python/Projet 4/data/Cleaned/final.csv', index=False)

In [15]:
train_data, test_data = train_test_split(final, test_size=0.3, random_state=42)

In [16]:
train_data.shape

(33903, 232)

## Preparing vocabulary for TF-IDF

Bag of word - Tf-idf

## Préparation sentences

In [19]:
print('Creating a List of the whole corpus words ...')
combined_vocab_title = train_data['Title_2'].tolist()
combined_vocab_text = train_data['Text'].tolist()

Creating a List of the whole corpus words ...


In [286]:
print('Saving the multi-labels value within an array ...')
target_cols = matrix_labels.columns.values
print('Multi-Labels size : ',len(target_cols))

Saving the multi-labels value within an array ...
Multi-Labels size :  229


In [287]:
type(target_cols)

numpy.ndarray

In [288]:
target_cols

array(['core', 'mvc', 'web-api', '.htaccess', '.net', '.x',
       'actionscript', 'ajax', 'algorithm', 'amazon-s',
       'amazon-web-services', 'android', 'android-fragments',
       'android-layout', 'android-studio', 'angular', 'angularjs',
       'animation', 'apache', 'apache-flex', 'apache-spark', 'api',
       'arrays', 'asp.net', 'assembly', 'asynchronous', 'audio',
       'authentication', 'azure', 'bash', 'button', 'c', 'c#', 'c++',
       'caching', 'canvas', 'class', 'cocoa', 'cocoa-touch', 'cookies',
       'cordova', 'css', 'csv', 'curl', 'd', 'data-binding', 'database',
       'dataframe', 'date', 'datetime', 'debugging', 'delphi',
       'deployment', 'dictionary', 'django', 'dll', 'docker', 'dom',
       'dynamic', 'eclipse', 'email', 'encoding', 'encryption',
       'entity-framework', 'events', 'excel', 'exception', 'express',
       'facebook', 'file', 'firebase', 'firefox', 'flash', 'flask',
       'flutter', 'forms', 'function', 'gcc', 'generics', 'git',
       '

In [289]:
dataframe_tags_tdidf = pd.DataFrame(target_cols, columns=[0])
dataframe_tags_tdidf

Unnamed: 0,0
0,core
1,mvc
2,web-api
3,.htaccess
4,.net
...,...
224,x
225,xamarin
226,xaml
227,xcode


In [292]:
empty_array = []
for row in dataframe_tags_tdidf.values:
    print(row[0])
    # empty_array.append(row)

# empty_array

core
mvc
web-api
.htaccess
.net
.x
actionscript
ajax
algorithm
amazon-s
amazon-web-services
android
android-fragments
android-layout
android-studio
angular
angularjs
animation
apache
apache-flex
apache-spark
api
arrays
asp.net
assembly
asynchronous
audio
authentication
azure
bash
button
c
c#
c++
caching
canvas
class
cocoa
cocoa-touch
cookies
cordova
css
csv
curl
d
data-binding
database
dataframe
date
datetime
debugging
delphi
deployment
dictionary
django
dll
docker
dom
dynamic
eclipse
email
encoding
encryption
entity-framework
events
excel
exception
express
facebook
file
firebase
firefox
flash
flask
flutter
forms
function
gcc
generics
git
google-chrome
google-maps
gradle
gridview
hadoop
hibernate
html
http
iis
image
image-processing
inheritance
installation
internet-explorer
ionic-framework
ios
iphone
jakarta-ee
java
javascript
jdbc
jenkins
jpa
jquery
jquery-ui
jsf
json
jsp
junit
keras
kotlin
laravel
linq
linux
list
listview
logging
loops
machine-learning
macos
math
matlab
matplotlib
m

In [282]:
empty_array_df = pd.DataFrame(empty_array)
for row in empty_array_df.values:
    print(row)

['core']
['mvc']
['web-api']
['.htaccess']
['.net']
['.x']
['actionscript']
['ajax']
['algorithm']
['amazon-s']
['amazon-web-services']
['android']
['android-fragments']
['android-layout']
['android-studio']
['angular']
['angularjs']
['animation']
['apache']
['apache-flex']
['apache-spark']
['api']
['arrays']
['asp.net']
['assembly']
['asynchronous']
['audio']
['authentication']
['azure']
['bash']
['button']
['c']
['c#']
['c++']
['caching']
['canvas']
['class']
['cocoa']
['cocoa-touch']
['cookies']
['cordova']
['css']
['csv']
['curl']
['d']
['data-binding']
['database']
['dataframe']
['date']
['datetime']
['debugging']
['delphi']
['deployment']
['dictionary']
['django']
['dll']
['docker']
['dom']
['dynamic']
['eclipse']
['email']
['encoding']
['encryption']
['entity-framework']
['events']
['excel']
['exception']
['express']
['facebook']
['file']
['firebase']
['firefox']
['flash']
['flask']
['flutter']
['forms']
['function']
['gcc']
['generics']
['git']
['google-chrome']
['google-maps']

In [293]:
dataframe_tags_tdidf.to_csv('/Users/maurelco/Developer/Python/Projet4/data/Cleaned/tfidf_labels_tags.csv', index=False)

In [21]:
print('Splitting dataset into train data and test data for TF-IDF...',)
X = train_data['Text']
y = train_data[target_cols]

X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X, y, test_size = 0.25, random_state = 42)
print('Shape of X_train and y train : ',X_train_tfidf.shape, y_train_tfidf.shape)
print('Shape of X test and y test : ',X_test_tfidf.shape, y_test_tfidf.shape)

Splitting dataset into train data and test data for TF-IDF...
Shape of X_train and y train :  (25427,) (25427, 229)
Shape of X test and y test :  (8476,) (8476, 229)


## TF-IDF

2 approaches :
- Tf-IDF fitted on the whole Text&Title corpus
- Tf-IDF fitted on the whole Title only corpus

In [210]:
print('Instanciate the TF-IDF Vectorizer for text&titles and Titles only ... ')
vectorizer_text = TfidfVectorizer(analyzer= 'word', min_df = 0.009, sublinear_tf = True,token_pattern='[a-zA_][a-zA_\#+-]*')
vectorizer_title = TfidfVectorizer(analyzer= 'word', min_df = 0.009, sublinear_tf = True,token_pattern='[a-zA_][a-zA_\#+-]*')

Instanciate the TF-IDF Vectorizer for text&titles and Titles only ... 


In [206]:
torch.save(vectorizer_text, 'vectorizer_text.pt')

In [207]:
torch.save(vectorizer_text, 'vectorizer_text.pkl')

In [51]:
def create_tfidf_embedded_matrix(vectorizer_text, vectorizer_title, X_train, X_test, y_train, y_test):
    print('fitting on texts & titles and fitting on titles only ...')
    vectorizer_text = vectorizer_text.fit(combined_vocab_text)
    vectorizer_title = vectorizer_title.fit(combined_vocab_title)

    print('transforming on X_train & X_test using the vectorizer fitted on texts & titles...')
    X_train_tfidf_text = vectorizer_text.transform(X_train)
    X_test_tfidf_text = vectorizer_text.transform(X_test)

    print('transforming on X_train & X_test using the vectorizer fitted only on titles...')
    X_train_tfidf_title = vectorizer_title.transform(X_train)
    X_test_tfidf_title = vectorizer_title.transform(X_test)
    print(' ')
    print('Shape of tf-idf train matrix and y train : ',X_train_tfidf_text.shape, y_train.shape)
    print('Shape of tf-idf test matrix on text&titles and y test : ',X_test_tfidf_text.shape, y_test.shape)
    print(' ')
    print('Shape of tf-idf train matrix on titles(only) and y train : ',X_train_tfidf_title.shape, y_train.shape)
    print('Shape of tf-idf test matrix and y test : ',X_test_tfidf_title.shape, y_test.shape)
    return X_train_tfidf_text,X_train_tfidf_title, X_test_tfidf_text, X_test_tfidf_title

In [None]:
loaded_vec = TfidfVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl","rb")))
loaded_vec.transform(text)

In [214]:
vectorizer_text

In [211]:
X_train_tfidf_text, X_train_tfidf_title, X_test_tfidf_text, X_test_tfidf_title = create_tfidf_embedded_matrix(vectorizer_text, vectorizer_title, X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf)

fitting on texts & titles and fitting on titles only ...
transforming on X_train & X_test using the vectorizer fitted on texts & titles...
transforming on X_train & X_test using the vectorizer fitted only on titles...
 
Shape of tf-idf train matrix and y train :  (25427, 1127) (25427, 229)
Shape of tf-idf test matrix on text&titles and y test :  (8476, 1127) (8476, 229)
 
Shape of tf-idf train matrix on titles(only) and y train :  (25427, 94) (25427, 229)
Shape of tf-idf test matrix and y test :  (8476, 94) (8476, 229)


In [223]:
X_train_tfidf_text

<25427x1127 sparse matrix of type '<class 'numpy.float64'>'
	with 800699 stored elements in Compressed Sparse Row format>

In [221]:
combined_vocab_text

['sum value html table using loop javascript html get updated ajax php query sum calorie time get entry total row time click add button total although calorie add line entry add third line sum line always line missing get html code input id food food choose food input id amount choose amount gram amount input type button add id submit submit id update_div id mytable tr food calorie tr tr total id total tr javascript code document .ready function #submit .click function far var food #food .val var amount #amount .val .ajax url search_value.php food food amount +amount type get datatype html success function #update_div .append correctly var sum var document.getelementbyid mytable var table.getelementsbytagname var tds table.getelementsbytagname var tds.length i++ sum isnan tds .innertext parseint tds .innertext document.getelementbyid total .innerhtml sum php code php database configuration pdo pdo mysql host localhost dbname calotools root food _get food amount _get amount query select

In [212]:
import pickle
pickle.dump(vectorizer_text.vocabulary_, open("feature.pkl","wb"))

In [27]:
print('Saving the features names for both tf-idf vector matrices ...')
total_words_text = vectorizer_text.get_feature_names_out()
total_words_title= vectorizer_title.get_feature_names_out()

Saving the features names for both tf-idf vector matrices ...


In [228]:
total_words_text

array(['__init__', 'a', 'absolute', ..., 'yes', 'yet', 'z'], dtype=object)

In [302]:
total_words_text_df = pd.DataFrame(total_words_text)
total_words_text_df

Unnamed: 0,0
0,__init__
1,a
2,absolute
3,accept
4,access
...,...
1122,y
1123,year
1124,yes
1125,yet


In [303]:
total_words_text_df.to_csv('/Users/maurelco/Developer/Python/Projet4/data/Cleaned/tfidf_class_tags.csv', index=False)

In [304]:
total_words_text_df.to_csv('/Users/maurelco/Developer/Python/Projet4/API/tfidf_class_tags.csv', index=False)

## TF-IDF word analysis

In [224]:
def print_corpus_word(x_tfidf, index_number, features_name_list, most_important_words=False):

    ''' fonction qui affiche les mots avec le plus d'importance dans un document'''

    df = pd.DataFrame(x_tfidf[index_number].toarray())
    df.columns = features_name_list

    print('----------- TOTAL CORPUS WORDS ------------')
    for i in df.values[0]:
        if i > 0:
            localisation = np.where(df.values[0] == i)
            print(localisation)
            print(f'{df.columns.values[localisation[0]]} : {i} ')

    print('----------- MOST IMPORTANT CORPUS WORDS ------------')
    if most_important_words:
        np.max(df.values[0])
        for i in df.values[0]:
            if i > np.max(df.values[0]) * 0.7:
                localisation = np.where(df.values[0] == i)
                print( f'most important words - {df.columns.values[localisation[0]]} : {i} ')

In [54]:
print_corpus_word(X_test_tfidf_title, 200, total_words_title, most_important_words=True)

----------- TOTAL CORPUS WORDS ------------
['app'] : 0.5238335609629335 
['google'] : 0.5835914048051537 
['http'] : 0.4882333727314678 
['io'] : 0.2555907766895767 
['run'] : 0.28516837354256963 
----------- MOST IMPORTANT CORPUS WORDS ------------
most important words - ['app'] : 0.5238335609629335 
most important words - ['google'] : 0.5835914048051537 
most important words - ['http'] : 0.4882333727314678 


In [225]:
print_corpus_word(X_test_tfidf_title, 200, total_words_title, most_important_words=True)

----------- TOTAL CORPUS WORDS ------------
(array([6]),)
['app'] : 0.5238335609629335 
(array([30]),)
['google'] : 0.5835914048051537 
(array([32]),)
['http'] : 0.4882333727314678 
(array([35]),)
['io'] : 0.2555907766895767 
(array([66]),)
['run'] : 0.28516837354256963 
----------- MOST IMPORTANT CORPUS WORDS ------------
most important words - ['app'] : 0.5238335609629335 
most important words - ['google'] : 0.5835914048051537 
most important words - ['http'] : 0.4882333727314678 


In [56]:
print_corpus_word(X_test_tfidf_text, 200, total_words_text, most_important_words=True)

----------- TOTAL CORPUS WORDS ------------
['app'] : 0.22651945683423566 
['apple'] : 0.42529300957607535 
['apps'] : 0.329483887393112 
['developer'] : 0.18544271447796648 
['doc'] : 0.1736012998228027 
['effect'] : 0.19430958435852355 
['google'] : 0.31793498831837086 
['host'] : 0.16742290946713867 
['http'] : 0.1627340642453252 
['id'] : 0.10273096689438227 
['io'] : 0.14354280481076065 
['must'] : 0.1571494523020039 
['pa'] : 0.15129117447539567 
['platform'] : 0.18555652594731475 
['process'] : 0.14112703016937192 
['run'] : 0.167882878315492 
['store'] : 0.2592550190911551 
['support'] : 0.3183708557991503 
['taken'] : 0.21063299989684203 
['though'] : 0.16480825504603894 
----------- MOST IMPORTANT CORPUS WORDS ------------
most important words - ['apple'] : 0.42529300957607535 
most important words - ['apps'] : 0.329483887393112 
most important words - ['google'] : 0.31793498831837086 
most important words - ['support'] : 0.3183708557991503 


## MultiOutputClassifier using TFIDF

In [239]:
print('Instantiating the MultiOutputClassifier for the TF-IDF embedding (2 approaches) ... ')
clf_text = MultiOutputClassifier(LogisticRegression(n_jobs=-1, max_iter= 200),n_jobs=-1)
# clf_title = MultiOutputClassifier(LogisticRegression(n_jobs=-1, max_iter= 200),n_jobs=-1)

Instantiating the MultiOutputClassifier for the TF-IDF embedding (2 approaches) ... 


In [305]:
clf_text_sgdc = MultiOutputClassifier(SGDClassifier(n_jobs=-1), n_jobs=-1)

In [240]:
X_train_tfidf_text.shape

(25427, 1127)

In [241]:
# print('1st approach : fitting on titles only')
# clf_title = clf_title.fit(X_train_tfidf_title, y_train_tfidf)
# clf_title

In [306]:
clf_text_sgdc = clf_text_sgdc.fit(X_train_tfidf_text, y_train_tfidf)
clf_text_sgdc

In [242]:
print('2nd approach : fitting on texts & titles')
clf_text = clf_text.fit(X_train_tfidf_text, y_train_tfidf)
clf_text

2nd approach : fitting on texts & titles


In [243]:
clf_text.n_features_in_

1127

In [307]:
clf_text_sgdc.n_features_in_

1127

In [258]:
tfidf_tags = pd.DataFrame(y_train_tfidf.columns)
tfidf_tags

Unnamed: 0,0
0,core
1,mvc
2,web-api
3,.htaccess
4,.net
...,...
224,x
225,xamarin
226,xaml
227,xcode


In [260]:
tfidf_tags[0]

0           core
1            mvc
2        web-api
3      .htaccess
4           .net
         ...    
224            x
225      xamarin
226         xaml
227        xcode
228          xml
Name: 0, Length: 229, dtype: object

In [263]:
for tag in tfidf_tags[0]:
    print(type(tag))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

In [253]:
tfidf_tags.to_csv('/Users/maurelco/Developer/Python/Projet4/data/Cleaned/tfidf_labels_tags.csv', index=False)

In [201]:
torch.save(clf_text, 'tfidf_model.pkl')

In [244]:
torch.save(clf_text, 'tfidf_model_2.pt')

In [61]:
print('predicting y_train and y_test values based on texts & titles...')
y_pred_train_text = clf_text.predict(X_train_tfidf_text)
y_pred_test_text = clf_text.predict(X_test_tfidf_text)
print('Done')

print('predicting y_train and y_test values based on titles only ...')
y_pred_train_title = clf_title.predict(X_train_tfidf_title)
y_pred_test_title = clf_title.predict(X_test_tfidf_title)
print('Done')

predicting y_train and y_test values based on texts & titles...
Done
predicting y_train and y_test values based on titles only ...
Done


In [308]:
print('predicting y_train and y_test values based on texts & titles...')
y_pred_sgdc = clf_text_sgdc.predict(X_train_tfidf_text)
y_pred_sgdc = clf_text_sgdc.predict(X_test_tfidf_text)
print('Done')

predicting y_train and y_test values based on texts & titles...
Done


In [309]:
y_pred_sgdc_df = pd.DataFrame(y_pred_sgdc)
y_pred_sgdc_df.columns = y_test_tfidf.columns
y_pred_sgdc_df.shape

(8476, 229)

In [62]:
y_pred_test_text_df = pd.DataFrame(y_pred_test_text)
y_pred_test_text_df.columns = y_test_tfidf.columns
y_pred_test_text_df.shape

(8476, 229)

In [63]:
y_pred_test_title_df = pd.DataFrame(y_pred_test_title)
y_pred_test_title_df.columns = y_test_tfidf.columns
y_pred_test_title_df.shape

(8476, 229)

In [64]:
print('Reset index of y test')
y_test = y_test_tfidf.reset_index(drop=True)

## MultiOutputClassifier using BERT embedding

In [65]:
print('Charging the dataset specific to BERT ...',)
df_bert = pd.read_csv('/Users/maurelco/Developer/Python/Projet4/data/Cleaned/df_bert_2_index.csv')
df_bert.shape

Charging the dataset specific to BERT ...


(24180, 11)

In [66]:
df_bert = df_bert.drop(['Body','_len_body','Body_2','_len_body_2','Body_3','Title_2'], axis=1)
df_bert.head(5)

Unnamed: 0,Title,Tags,_clean_tags,Text,bert_features
0,aws elastic beanstalk unable access aws msk,amazon-web-services apache-kafka aws-lambda am...,"['amazon-web-services', 'apache-kafka', 'aws-l...",aws elastic beanstalk unable access aws msk aw...,[[-2.15726882e-01 4.35024351e-02 1.79541588e...
1,soap message expiration,c# .net asp.net web-services soap,"['c#', '.net', 'asp.net', 'web-services', 'soap']",soap message expiration use time-stamp soap he...,[[ 1.21477060e-01 -4.00442690e-01 1.87169895e...
2,python panda equivalent sql case statement usi...,python sql pandas window-functions case-statement,"['python', 'sql', 'pandas', 'window-functions'...",python panda equivalent sql case statement usi...,[[-2.34946415e-01 -6.18148595e-02 7.88969159e...
3,java stack overflow error increase stack size ...,java eclipse jvm stack-overflow jvm-arguments,"['java', 'eclipse', 'jvm', 'stack-overflow', '...",java stack overflow error increase stack size ...,[[-3.46309878e-02 -6.87190294e-02 2.64229685e...
4,creating bitmask large number option,java android serialization bit-manipulation bi...,"['java', 'android', 'serialization', 'bit-mani...",creating bitmask large number option android a...,[[ 1.35769474e-03 -2.04592481e-01 2.64054954e...


In [67]:
print('Charging the BERT embedded matrix ...')
df_bert_matrix_embedding = pd.read_csv('/Users/maurelco/Developer/Python/Projet4/data/Cleaned/df_bert_embedding_matrix.csv')
df_bert_matrix_embedding.shape

Charging the BERT embedded matrix ...


(24180, 768)

> ##### Creating the multi labels (tags) matrix of shape (nbr_sentences, nbr_target_tags):

In [74]:
def create_multi_label_matrix(df, column_to_split_str,columns_to_drop_arr,freq_threshold_int, characters_to_strip_arr):
    print('step 1 : splitting the Tags values up to 6 maximum into columns and join it to the dataset ... ')
    step_1 = df.join(df[column_to_split_str].str.split(' ',n=6, expand=True))
    print('Step 2 : deleting the columns Text/Title/Tags and stack the tags columns dataframe ... ')
    step_2 = step_1.drop(columns_to_drop_arr,axis=1).stack()
    print('Step 3 : deleting columns of least frequent tags ...')
    freq_step_2 = step_2.value_counts()
    least_freq_tags = freq_step_2[freq_step_2 <= freq_threshold_int].index
    step_2_df = pd.DataFrame(step_2)
    step_2_df = step_2_df[~step_2_df.isin(least_freq_tags.values).any(axis=1)]
    print('Step 4 : creating a dummy dataframe ...')
    step_3 = pd.get_dummies(step_2_df)
    print('Step 5 : cleaning the name of the tags columns ...')
    step_3.columns = step_3.columns.str.strip(characters_to_strip_arr[0])
    step_3.columns = step_3.columns.str.strip(characters_to_strip_arr[1])
    step_3.columns = step_3.columns.str.strip(characters_to_strip_arr[2])
    print('Step 6 : GroupBy sentences and sum values per Tags columns ...')
    matrix_tags = step_3.groupby(level=0).sum()
    return matrix_tags

In [75]:
matrix_labels_bert = create_multi_label_matrix(df_bert,"Tags",['Title','Tags','_clean_tags','bert_features','Text'],70,['0_','-','_'])
print('Multi-label matrix shape for bert: ',matrix_labels.shape)
matrix_labels.head(5)

step 1 : splitting the Tags values up to 6 maximum into columns and join it to the dataset ... 
Step 2 : deleting the columns Text/Title/Tags and stack the tags columns dataframe ... 
Step 3 : deleting columns of least frequent tags ...
Step 4 : creating a dummy dataframe ...
Step 5 : cleaning the name of the tags columns ...
Step 6 : GroupBy sentences and sum values per Tags columns ...
Multi-label matrix shape for bert:  (48434, 229)


Unnamed: 0,core,mvc,web-api,.htaccess,.net,.x,actionscript,ajax,algorithm,amazon-s,...,windows-phone,winforms,woocommerce,wordpress,wpf,x,xamarin,xaml,xcode,xml
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [76]:
final = pd.concat([df_bert, matrix_labels_bert], axis=1).drop('Tags', axis=1)
final.head(5)

Unnamed: 0,Title,_clean_tags,Text,bert_features,core,mvc,web-api,.htaccess,.net,.x,...,winforms,woocommerce,wordpress,wpf,x,xamarin,xaml,xcode,xml,xpath
0,aws elastic beanstalk unable access aws msk,"['amazon-web-services', 'apache-kafka', 'aws-l...",aws elastic beanstalk unable access aws msk aw...,[[-2.15726882e-01 4.35024351e-02 1.79541588e...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,soap message expiration,"['c#', '.net', 'asp.net', 'web-services', 'soap']",soap message expiration use time-stamp soap he...,[[ 1.21477060e-01 -4.00442690e-01 1.87169895e...,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,python panda equivalent sql case statement usi...,"['python', 'sql', 'pandas', 'window-functions'...",python panda equivalent sql case statement usi...,[[-2.34946415e-01 -6.18148595e-02 7.88969159e...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,java stack overflow error increase stack size ...,"['java', 'eclipse', 'jvm', 'stack-overflow', '...",java stack overflow error increase stack size ...,[[-3.46309878e-02 -6.87190294e-02 2.64229685e...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,creating bitmask large number option,"['java', 'android', 'serialization', 'bit-mani...",creating bitmask large number option android a...,[[ 1.35769474e-03 -2.04592481e-01 2.64054954e...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [77]:
final.isna().sum()

Title              0
_clean_tags        0
Text               0
bert_features      0
core             676
                ... 
xamarin          676
xaml             676
xcode            676
xml              676
xpath            676
Length: 239, dtype: int64

In [78]:
index_nan = []
for i, row in final.isna().iterrows():
    if row['core'] == True:
        index_nan.append(i)
print('Number of NaNs to delete in the BERT matrix : ',len(index_nan))

Number of NaNs to delete in the BERT matrix :  676


In [79]:
print('Deleting the NaNs in the BERT matrix ... ')
for i, row in df_bert_matrix_embedding.iterrows():
    if i in index_nan:
        df_bert_matrix_embedding.drop(i, axis=0, inplace=True)
df_bert_matrix_embedding.shape

Deleting the NaNs in the BERT matrix ... 


(23504, 768)

In [80]:
print('Shape before deleting NaNs : ',final.shape )
final = final.dropna()
print('Shape after deleting NaNs : ', final.shape)

Shape before deleting NaNs :  (24180, 239)
Shape after deleting NaNs :  (23504, 239)


In [81]:
print('creating the array of the multi-labels target tags')
target_cols = matrix_labels_bert.columns.values
target_cols.shape

creating the array of the multi-labels target tags


(235,)

In [82]:
print('Splitting the dataset between Train data and Test data ...')
X = df_bert_matrix_embedding
y = final[target_cols]

X_train_bert, X_test_bert, y_train_bert, y_test_bert = train_test_split(X, y, test_size = 0.25, random_state = 42)
print(X_train_bert.shape, y_train_bert.shape)
print(X_test_bert.shape, y_test_bert.shape)

Splitting the dataset between Train data and Test data ...
(17628, 768) (17628, 235)
(5876, 768) (5876, 235)


#### Classification Model

In [83]:
print('Instantiate the MultiOutputClassifier for BERT ... ')
clf_bert = MultiOutputClassifier(LogisticRegression(n_jobs=-1, max_iter= 400,solver='sag', multi_class='multinomial'),n_jobs=-1)

Instantiate the MultiOutputClassifier for BERT ... 


In [84]:
clf_bert = clf_text.fit(X_train_bert, y_train_bert)
clf_bert

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [203]:
torch.save(clf_bert, 'bert_model.pkl')

In [86]:
torch.save(clf_bert, 'bert_model.pt')
# loaded_model = torch.load('bert_model.pt')

# Save the model
# torch.save(clf_bert.state_dict(), 'bert_model.pt')
# # Load the model
# loaded_model = BertModel.from_pretrained('bert-base-uncased')
# loaded_model.load_state_dict(torch.load('bert_model.pt'))
# torch.save(bert_tokenizer.state_dict(), 'bert_tokenizer.pt')
# loaded_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# loaded_tokenizer.load_state_dict(torch.load('bert_tokenizer.pt'))

In [87]:
y_pred_train_bert = clf_text.predict(X_train_bert)
y_pred_test_bert = clf_text.predict(X_test_bert)

In [88]:
y_pred_test_bert_df = pd.DataFrame(y_pred_test_bert)
y_pred_test_bert_df.columns = y_test_bert.columns
print('shape of prediction matrix :',y_pred_test_bert_df.shape)
y_pred_test_bert_df[:5]

shape of prediction matrix : (5876, 235)


Unnamed: 0,core,mvc,web-api,.htaccess,.net,.x,actionscript,ajax,algorithm,amazon-s,...,winforms,woocommerce,wordpress,wpf,x,xamarin,xaml,xcode,xml,xpath
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [89]:
y_test_bert = y_test_bert.reset_index(drop=True)

## MultiOutputClassifier using Word2Vec embedding

In [90]:
print('Loading the dataset specific for Word2Vec ... ')
df_word2vec = pd.read_csv('/Users/maurelco/Developer/Python/Projet 4/data/Cleaned/df_clean_bis.csv')
print('Shape : ',df_word2vec.shape)
df_word2vec.head(5)

Loading the dataset specific for Word2Vec ... 
Shape :  (49843, 4)


Unnamed: 0,Tags,_clean_tags,Title_2,Text
0,linux ubuntu process sandbox selinux,"['linux', 'ubuntu', 'process', 'sandbox', 'sel...",giving unix process exclusive rw access directory,giving unix process exclusive rw access direct...
1,java graphics jframe jpanel paint,"['java', 'graphics', 'jframe', 'jpanel', 'paint']",automatic repaint minimizing window,automatic repaint minimizing window jframe pan...
2,security ssh ssh-keys openssh man-in-the-middle,"['security', 'ssh', 'ssh-keys', 'openssh', 'ma...",man-in-the-middle attack security threat ssh a...,man-in-the-middle attack security threat ssh a...
3,c# winforms sqlite datatable sqlconnection,"['c#', 'winforms', 'sqlite', 'datatable', 'sql...",managing data access winforms app,managing data access winforms app winforms ent...
4,javascript html node.js mongodb express,"['javascript', 'html', 'node.js', 'mongodb', '...",render basic html view,render basic html view basic node.js app get g...


In [91]:
print('Loading the Word2Vec embedded matrix ... ')
df_matrix_word2vec_embedding = pd.read_csv('/Users/maurelco/Developer/Python/Projet4/data/Cleaned/df_word2vec_matrix_embedding.csv')
print('Shape : ', df_matrix_word2vec_embedding.shape)
df_matrix_word2vec_embedding[:5]

Loading the Word2Vec embedded matrix ... 
Shape :  (49843, 300)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-0.017392,-0.00516,0.050944,-0.038304,-0.007477,0.005038,-0.027893,0.020133,0.007672,-0.011984,...,-0.032344,0.00175,-0.012358,-0.009956,0.010883,-0.031312,-0.063954,0.045778,0.021745,0.01154
1,-0.027313,-0.173435,-0.047737,-0.116093,0.104681,0.098169,0.033931,-0.12227,-0.287956,0.062984,...,-0.275156,-0.057395,0.008508,0.258975,0.151782,-0.020713,0.038451,-0.17754,0.000615,0.067953
2,0.031391,-0.018802,0.024694,-0.008162,-0.026555,-0.000861,-0.01998,0.015817,0.024177,0.01087,...,0.003158,-0.016125,-0.01117,0.012918,0.007613,-0.024196,-0.032643,0.024943,0.012432,0.022935
3,-0.04332,-0.10956,-0.015442,-0.077797,0.061917,0.032022,0.005346,-0.054202,-0.077683,-0.094736,...,-0.265825,-0.021202,0.047354,0.025186,0.04169,0.019355,0.027115,-0.041112,0.037517,-0.034422
4,-0.051266,-0.035526,-0.010039,-0.023131,0.04767,-0.012697,0.066157,0.016522,0.013432,-0.006722,...,-0.031288,-0.051745,0.004116,-0.006013,-0.014885,-0.040796,0.000389,-0.029573,0.042336,0.037725


In [92]:
matrix_labels_word2vec = create_multi_label_matrix(df_word2vec,"Tags",['Title_2','Tags','Text'],70,['0_','-','_'])
print('Multi-label matrix shape for Word2Vec: ',matrix_labels.shape)
matrix_labels_word2vec.head(5)

step 1 : splitting the Tags values up to 6 maximum into columns and join it to the dataset ... 
Step 2 : deleting the columns Text/Title/Tags and stack the tags columns dataframe ... 
Step 3 : deleting columns of least frequent tags ...
Step 4 : creating a dummy dataframe ...
Step 5 : cleaning the name of the tags columns ...
Step 6 : GroupBy sentences and sum values per Tags columns ...
Multi-label matrix shape for Word2Vec:  (48434, 229)


Unnamed: 0,bit,core,mvc,web-api,.htaccess,.js,.net,.x,actionscript,active-directory,...,xamarin.forms,xamarin.ios,xaml,xcode,xml,xpath,xsd,xslt,yii,zend-framework
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [93]:
final = pd.concat([df_word2vec, matrix_labels_word2vec], axis=1).drop('Tags', axis=1)
final.shape

(49843, 509)

In [94]:
final.isna().sum()

_clean_tags         0
Title_2             0
Text                0
bit               577
core              577
                 ... 
xpath             577
xsd               577
xslt              577
yii               577
zend-framework    577
Length: 509, dtype: int64

In [95]:
index_nan = []
for i, row in final.isna().iterrows():
    if row['core'] == True:
        index_nan.append(i)
print('Number of NaNs to delete in the Word2Ve matrix : ',len(index_nan))

Number of NaNs to delete in the Word2Ve matrix :  577


In [96]:
print('Shape before deleting NaNs : ',final.shape )
final = final.dropna()
print('Shape after deleting NaNs : ', final.shape)

Shape before deleting NaNs :  (49843, 509)
Shape after deleting NaNs :  (49266, 509)


In [97]:
for i, row in df_matrix_word2vec_embedding.iterrows():
    if i in index_nan:
        df_matrix_word2vec_embedding.drop(i, axis=0, inplace=True)

print('Checking that the shape of the embedded matrix corresponds of the shape of the dataset : ',df_matrix_word2vec_embedding.shape)

Checking that the shape of the embedded matrix corresponds of the shape of the dataset :  (49266, 300)


In [98]:
print('Saving the multi-labels tags ...',)
target_cols = matrix_labels_word2vec.columns.values
target_cols.shape

Saving the multi-labels tags ...


(506,)

> ##### Splitting the Train set from the Test set:

In [100]:
print('Splitting the dataset into Train data and Test data ...')
X = df_matrix_word2vec_embedding
y = final[target_cols]

X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec = train_test_split(X, y, test_size = 0.25, random_state = 42)
print(X_train_word2vec.shape, y_train_word2vec.shape)
print(X_test_word2vec.shape, y_test_word2vec.shape)

Splitting the dataset into Train data and Test data ...
(36949, 300) (36949, 506)
(12317, 300) (12317, 506)


In [101]:
print('Instantiate the MultiOutputClassifier for the Word2Vec embedding approach ... ')
clf_word2vec = MultiOutputClassifier(LogisticRegression(n_jobs=-1, max_iter= 200,solver='sag', multi_class='multinomial'),n_jobs=-1)

Instantiate the MultiOutputClassifier for the Word2Vec embedding approach ... 


In [102]:
clf_word2vec = clf_word2vec.fit(X_train_word2vec, y_train_word2vec)
clf_word2vec

In [204]:
torch.save(clf_word2vec, 'word2vec_model.pkl')

In [205]:
torch.save(clf_word2vec, 'word2vec_model.pt')

In [103]:
print('Predicting the Y_train and Y_test ...')
y_pred_train_word2vec = clf_word2vec.predict(X_train_word2vec)
y_pred_test_word2vec = clf_word2vec.predict(X_test_word2vec)
print('Done')

Predicting the Y_train and Y_test ...
Done


In [104]:
print('Transforming the predicted array into a Dataframe for future analysis ...')
y_pred_test_word2vec_df = pd.DataFrame(y_pred_test_word2vec)
y_pred_test_word2vec_df.columns = y_test_word2vec.columns
print('Done')

Transforming the predicted array into a Dataframe for future analysis ...
Done


In [105]:
y_test_word2vec = y_test_word2vec.reset_index(drop=True)

## MultiOutputClassifier using USE embedding

In [162]:
print('Loading the dataset specific to Universal Sentence Encoding ... ')
df_use = pd.read_csv('/Users/maurelco/Developer/Python/Projet4/data/Cleaned/df_sample_USE.csv')
print('Shape : ',df_use.shape)
df_use.head(5)

Loading the dataset specific to Universal Sentence Encoding ... 
Shape :  (19937, 4)


Unnamed: 0,Tags,_clean_tags,Title_2,Text
0,python pycharm tuples type-hinting numpydoc,"['python', 'pycharm', 'tuples', 'type-hinting'...",document multiple value numpydoc format,document multiple value numpydoc format docume...
1,javascript jquery ajax client jqgrid,"['javascript', 'jquery', 'ajax', 'client', 'jq...",add script button row jqgrid,add script button row jqgrid handle click butt...
2,angularjs d .js angularjs-directive jasmine an...,"['angularjs', 'd', '.js', 'angularjs-directive...",testing angular service,testing angular service visualization directiv...
3,python html templates dictionary jinja,"['python', 'html', 'templates', 'dictionary', ...",ordering dictionary value jinja template,ordering dictionary value jinja template jinja...
4,iphone uiview input uinavigationcontroller uip...,"['iphone', 'uiview', 'input', 'uinavigationcon...",iphone set clear window-size blocker view,iphone set clear window-size blocker view feel...


In [163]:
print('Loading the USE embedded matrix ... ')
df_matrix_use_embedding = pd.read_csv('/Users/maurelco/Developer/Python/Projet4/data/Cleaned/df_USE_matrix_embedding.csv')
print('Shape : ',df_matrix_use_embedding.shape)
df_matrix_use_embedding[:5]

Loading the USE embedded matrix ... 
Shape :  (19937, 512)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,502,503,504,505,506,507,508,509,510,511
0,0.046964,-0.057052,0.054507,0.053334,-0.01452,0.050335,-0.007187,-0.05146,-0.049502,0.055937,...,0.039558,-0.057088,0.041486,-0.05695,-0.00568,-0.047099,0.054239,0.057088,-0.0147,-0.002092
1,-0.046036,-0.052987,0.053248,-0.048955,0.008427,0.05294,0.038515,-0.053076,-0.053524,0.050871,...,0.052285,-0.053531,-0.030481,0.046797,-0.046828,-0.023913,0.033414,0.053531,-0.050665,-0.050646
2,-0.04854,-0.051164,0.022097,0.050482,-0.036864,-0.051216,-0.043904,-0.031899,-0.051167,0.018467,...,-0.048398,-0.051216,-0.042921,-0.05117,-0.050854,0.034542,0.046458,0.051216,-0.042362,0.044671
3,-0.053333,-0.055284,0.003104,0.047137,0.050313,-0.035289,0.041185,-0.048864,-0.01225,0.049218,...,0.049453,-0.055307,-0.052105,0.021132,0.011582,-0.055035,-0.038493,0.055303,-0.008562,-0.055214
4,-0.01337,-0.032238,-0.033093,0.006536,-0.036437,-0.014039,-0.018035,0.03377,0.051799,0.060404,...,0.028044,-0.063751,-0.018454,0.032906,-0.017325,0.043295,0.037858,0.063871,-0.062935,-0.001134


In [195]:
type(df_matrix_use_embedding)

pandas.core.frame.DataFrame

In [164]:
matrix_labels_use = create_multi_label_matrix(df_use,"Tags",['Title_2','_clean_tags','Tags','Text'],100,['0_','-','_'])
print('Multi-label matrix shape for Word2Vec: ',matrix_labels_use.shape)
matrix_labels_use.head(5)

step 1 : splitting the Tags values up to 6 maximum into columns and join it to the dataset ... 
Step 2 : deleting the columns Text/Title/Tags and stack the tags columns dataframe ... 
Step 3 : deleting columns of least frequent tags ...
Step 4 : creating a dummy dataframe ...
Step 5 : cleaning the name of the tags columns ...
Step 6 : GroupBy sentences and sum values per Tags columns ...
Multi-label matrix shape for Word2Vec:  (18886, 126)


Unnamed: 0,core,mvc,.net,.x,actionscript,ajax,algorithm,amazon-web-services,android,android-studio,...,wcf,web-services,windows,winforms,wordpress,wpf,x,xaml,xcode,xml
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [165]:
final = pd.concat([df_use, matrix_labels_use], axis=1).drop('Tags', axis=1)
final.shape

(19937, 129)

In [166]:
final.isna().sum()

_clean_tags       0
Title_2           0
Text              0
core           1051
mvc            1051
               ... 
wpf            1051
x              1051
xaml           1051
xcode          1051
xml            1051
Length: 129, dtype: int64

In [167]:
index_nan = []
for i, row in final.isna().iterrows():
    if row['core'] == True:
        index_nan.append(i)
print('Number of NaNs to delete in the USE matrix : ',len(index_nan))

Number of NaNs to delete in the USE matrix :  1051


In [168]:
print('Shape before deleting NaNs : ',final.shape )
final = final.dropna()
print('Shape after deleting NaNs : ', final.shape)

Shape before deleting NaNs :  (19937, 129)
Shape after deleting NaNs :  (18886, 129)


In [169]:
for i, row in df_matrix_use_embedding.iterrows():
    if i in index_nan:
        df_matrix_use_embedding.drop(i, axis=0, inplace=True)

df_matrix_use_embedding.shape

(18886, 512)

In [170]:
print('Saving the multi-labels outputs tags ... ')
target_cols = matrix_labels_use.columns.values
target_cols[:5]

Saving the multi-labels outputs tags ... 


array(['core', 'mvc', '.net', '.x', 'actionscript'], dtype=object)

In [171]:
print('Splitting data into train data and validation data : ')
X = df_matrix_use_embedding
y = final[target_cols]

X_train_use, X_test_use, y_train_use, y_test_use = train_test_split(X, y, test_size = 0.2, random_state = 42)
print('Shape of X_train and Y-train : ',X_train_use.shape, y_train_use.shape)
print('Shape of X_test and y_test : ',X_test_use.shape, y_test_use.shape)

Splitting data into train data and validation data : 
Shape of X_train and Y-train :  (15108, 512) (15108, 126)
Shape of X_test and y_test :  (3778, 512) (3778, 126)


In [172]:
print('Instanciate a MultiOutputClassifier for the USE embedding approach ...')
clf_use = MultiOutputClassifier(LogisticRegression(n_jobs=-1, max_iter= 200,solver='sag', multi_class='multinomial'),n_jobs=-1)

Instanciate a MultiOutputClassifier for the USE embedding approach ...


In [183]:
X_train_use

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,502,503,504,505,506,507,508,509,510,511
18094,-0.051499,-0.025185,0.009498,-0.042865,-0.031793,0.056037,-0.062414,-0.053468,-0.015504,0.045969,...,0.043504,-0.064834,-0.016211,-0.029759,0.046647,0.011567,-0.035060,-0.004782,-0.056792,0.000935
11838,-0.067415,-0.064604,-0.010247,0.053043,-0.025951,0.002847,0.034059,0.001695,-0.045038,-0.019465,...,0.053251,-0.067704,-0.010557,0.004605,0.005079,-0.064375,0.015156,0.035226,-0.067358,-0.046550
9403,-0.025513,-0.052453,0.040335,0.052401,-0.044514,-0.040048,0.016213,-0.043764,-0.053199,-0.015527,...,0.042718,-0.053321,0.052588,0.041495,-0.051388,-0.050355,-0.004409,0.053321,-0.047316,-0.052841
13943,0.046005,-0.040428,-0.030146,0.058840,-0.054847,0.023528,-0.059034,-0.058609,0.050264,0.042895,...,0.027283,-0.058473,-0.058172,-0.032009,0.041688,0.003648,0.052627,-0.047537,-0.051495,-0.051362
11765,-0.052901,-0.055702,0.053764,0.029436,0.014987,0.050164,0.019628,-0.051084,-0.055510,0.052658,...,-0.049774,-0.055753,0.016679,0.017896,0.026510,0.018545,0.053133,0.055752,-0.035043,-0.055719
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11912,-0.062369,0.047923,0.037698,-0.008464,0.027103,0.004717,-0.065801,0.019506,-0.022582,-0.059834,...,0.015399,-0.065370,-0.048514,-0.008283,-0.032224,-0.048100,0.012479,0.061299,-0.015394,-0.062593
12624,-0.055638,-0.009793,-0.039935,0.047095,0.004741,0.021238,-0.044322,-0.034090,0.052039,-0.042954,...,-0.003985,-0.057782,-0.021293,-0.050831,-0.036694,0.027639,-0.044462,0.059620,0.056963,0.058094
5678,-0.013344,-0.032707,0.055443,-0.060968,-0.045342,0.062095,0.017057,-0.041836,0.009390,0.058670,...,0.033060,0.038311,0.040305,0.047507,0.019211,-0.016109,0.057275,0.049471,-0.047174,0.061954
900,0.048967,0.051173,-0.051215,0.051038,-0.027396,-0.048674,-0.042941,-0.050412,0.014892,0.049187,...,0.051116,-0.051229,-0.039541,0.005828,0.029829,-0.049382,0.048698,0.051059,-0.051128,-0.038407


In [184]:
y_train_use

Unnamed: 0,core,mvc,.net,.x,actionscript,ajax,algorithm,amazon-web-services,android,android-studio,...,wcf,web-services,windows,winforms,wordpress,wpf,x,xaml,xcode,xml
18094,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11838,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9403,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
13943,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12624,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5678,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
900,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [173]:
print('Fitting the Classifier to the multi-class inputs and multi-labels outputs ...')
clf_use = clf_use.fit(X_train_use, y_train_use)
clf_use

Fitting the Classifier to the multi-class inputs and multi-labels outputs


In [180]:
from pickle import dump
dump(clf_use, open('use_model.pkl', 'wt'))

TypeError: write() argument must be str, not bytes

In [181]:
torch.save(clf_use, 'use_model.pkl')

In [198]:
type(X_train_use)

pandas.core.frame.DataFrame

In [199]:
type(y_train_use)

pandas.core.frame.DataFrame

In [200]:
type(X_test_use)

pandas.core.frame.DataFrame

In [174]:
print('Predicting the multi-label tags matrix for the test sentences ... ')
y_pred_train_use = clf_use.predict(X_train_use)
y_pred_test_use = clf_use.predict(X_test_use)

Predicting the multi-label tags matrix for the test sentences ... 


In [185]:
y_pred_test_use

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [188]:
test = pd.DataFrame(y_test_use.columns)

In [189]:
test.to_csv('/Users/maurelco/Developer/Python/Projet4/data/Cleaned/USE_labels_tags.csv', index=False)

In [175]:
y_pred_test_use_df = pd.DataFrame(y_pred_test_use)
y_pred_test_use_df.columns = y_test_use.columns
y_pred_test_use_df.shape

Unnamed: 0,core,mvc,.net,.x,actionscript,ajax,algorithm,amazon-web-services,android,android-studio,...,wcf,web-services,windows,winforms,wordpress,wpf,x,xaml,xcode,xml
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3773,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3775,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3776,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [176]:
y_pred_train_use_df = pd.DataFrame(y_pred_train_use)
y_pred_train_use_df.columns = y_train_use.columns
y_pred_train_use_df.shape

(15108, 126)

In [177]:
y_test_use = y_test_use.reset_index(drop=True)

## EVALUATION de la prediction du model

In [335]:
def compare_predicted_tags(y_test,y_pred, nbr_samples):
    y_test = y_test.reset_index()
    y_pred = y_pred.reset_index()
    sample_test = y_test.sample(nbr_samples)
    sample_pred = y_pred.iloc[sample_test.index]
    for i in range(0,len(sample_test)):
        print("\033[1m" + 'Sentence ' + str(i) +"\033[0m")
        doc = sample_test.iloc[i]
        # print('Doc : ', doc)
        # print('Doc type : ',type(doc))
        print("\033[1m" +'Correct Tags :\n' + "\033[0m" +str(doc[doc == 1]) )
        tags = doc[doc == 1]
        # print(tags)
        # print(np.array(tags.index))
        doc = sample_pred.iloc[i]
        print("\033[1m" +'Predicted Tags :\n' + "\033[0m" +str(doc[doc == 1]) )

In [336]:
compare_predicted_tags(y_test_tfidf, y_pred_sgdc_df, 5)

[1mSentence 0[0m
[1mCorrect Tags :
[0mc++         1.0
opencv      1.0
pointers    1.0
Name: 8311, dtype: float64
[1mPredicted Tags :
[0mc++    1.0
Name: 8311, dtype: float64
[1mSentence 1[0m
[1mCorrect Tags :
[0mwindows    1.0
Name: 1010, dtype: float64
[1mPredicted Tags :
[0mc++    1.0
Name: 1010, dtype: float64
[1mSentence 2[0m
[1mCorrect Tags :
[0mjavascript    1.0
jquery        1.0
pdf           1.0
Name: 6994, dtype: float64
[1mPredicted Tags :
[0mjavascript    1.0
pdf           1.0
Name: 6994, dtype: float64
[1mSentence 3[0m
[1mCorrect Tags :
[0mpython    1.0
Name: 4497, dtype: float64
[1mPredicted Tags :
[0mpython    1.0
Name: 4497, dtype: float64
[1mSentence 4[0m
[1mCorrect Tags :
[0mc       1.0
c#      1.0
c++     1.0
java    1.0
Name: 6099, dtype: float64
[1mPredicted Tags :
[0mc#      1.0
c++     1.0
java    1.0
Name: 6099, dtype: float64


In [337]:
compare_predicted_tags(y_test_tfidf, y_pred_test_text_df, 5)

[1mSentence 0[0m
[1mCorrect Tags :
[0meclipse    1.0
java       1.0
Name: 1128, dtype: float64
[1mPredicted Tags :
[0meclipse    1.0
Name: 1128, dtype: float64
[1mSentence 1[0m
[1mCorrect Tags :
[0msql                  1.0
sql-server           1.0
stored-procedures    1.0
tsql                 1.0
Name: 2651, dtype: float64
[1mPredicted Tags :
[0msql           1.0
sql-server    1.0
Name: 2651, dtype: float64
[1mSentence 2[0m
[1mCorrect Tags :
[0mforms         1.0
javascript    1.0
jquery        1.0
validation    1.0
Name: 691, dtype: float64
[1mPredicted Tags :
[0mjavascript    1.0
jquery        1.0
Name: 691, dtype: float64
[1mSentence 3[0m
[1mCorrect Tags :
[0mjava       1.0
regex      1.0
unicode    1.0
Name: 1098, dtype: float64
[1mPredicted Tags :
[0mjava    1.0
Name: 1098, dtype: float64
[1mSentence 4[0m
[1mCorrect Tags :
[0m.x     1.0
csv    1.0
Name: 1043, dtype: float64
[1mPredicted Tags :
[0mcsv       1.0
python    1.0
Name: 1043, dtype: float64


In [325]:
compare_predicted_tags(y_test_use,y_pred_test_use_df, 5)

[1mSentence 0[0m
[1mCorrect Tags :
[0mspring-boot    1.0
tomcat         1.0
Name: 2298, dtype: float64
[1mPredicted Tags :
[0mamazon-web-services    1.0
java                   1.0
spring                 1.0
Name: 2298, dtype: float64
[1mSentence 1[0m
[1mCorrect Tags :
[0mc#          1.0
linq        1.0
winforms    1.0
Name: 653, dtype: float64
[1mPredicted Tags :
[0mc#      1.0
linq    1.0
Name: 653, dtype: float64
[1mSentence 2[0m
[1mCorrect Tags :
[0mios            1.0
objective-c    1.0
Name: 594, dtype: float64
[1mPredicted Tags :
[0mobjective-c    1.0
Name: 594, dtype: float64
[1mSentence 3[0m
[1mCorrect Tags :
[0masp.net    1.0
Name: 2957, dtype: float64
[1mPredicted Tags :
[0mc#    1.0
Name: 2957, dtype: float64
[1mSentence 4[0m
[1mCorrect Tags :
[0mc#        1.0
string    1.0
Name: 2776, dtype: float64
[1mPredicted Tags :
[0m.net    1.0
c#      1.0
Name: 2776, dtype: float64


In [301]:
compare_predicted_tags(y_test_use,y_pred_test_use_df, 1)

[1mSentence 0[0m
Doc :  core            0.0
mvc             0.0
.net            0.0
.x              0.0
actionscript    0.0
               ... 
wpf             0.0
x               0.0
xaml            0.0
xcode           0.0
xml             0.0
Name: 1065, Length: 126, dtype: float64
Doc type :  <class 'pandas.core.series.Series'>
[1mCorrect Tags :
[0mvisual-studio    1.0
Name: 1065, dtype: float64
visual-studio    1.0
Name: 1065, dtype: float64
['visual-studio']
[1mPredicted Tags :
[0mc++    1.0
Name: 1065, dtype: float64


In [194]:
compare_predicted_tags(y_test_use,y_pred_test_use_df, 1)

[1mSentence 0[0m
[1mCorrect Tags :
[0mc#      1.0
wpf     1.0
xaml    1.0
Name: 2560, dtype: float64
['c#' 'wpf' 'xaml']
[1mPredicted Tags :
[0mc#      1.0
wpf     1.0
xaml    1.0
Name: 2560, dtype: float64


In [178]:
compare_predicted_tags(y_test_use,y_pred_test_use_df)

[1mSentence 0[0m
[1mCorrect Tags :
[0m.net    1.0
c#      1.0
wpf     1.0
Name: 3724, dtype: float64
[1mPredicted Tags :
[0mc#          1.0
winforms    1.0
Name: 3724, dtype: float64
[1mSentence 1[0m
[1mCorrect Tags :
[0mjava           1.0
spring-boot    1.0
xml            1.0
Name: 508, dtype: float64
[1mPredicted Tags :
[0mjava    1.0
Name: 508, dtype: float64
[1mSentence 2[0m
[1mCorrect Tags :
[0msql           1.0
sql-server    1.0
xml           1.0
Name: 1349, dtype: float64
[1mPredicted Tags :
[0mxml    1.0
Name: 1349, dtype: float64
[1mSentence 3[0m
[1mCorrect Tags :
[0malgorithm    1.0
python       1.0
Name: 1546, dtype: float64
[1mPredicted Tags :
[0malgorithm    1.0
python       1.0
Name: 1546, dtype: float64
[1mSentence 4[0m
[1mCorrect Tags :
[0mandroid    1.0
Name: 3281, dtype: float64
[1mPredicted Tags :
[0mandroid    1.0
Name: 3281, dtype: float64
[1mSentence 5[0m
[1mCorrect Tags :
[0mauthentication    1.0
python            1.0
Name: 73, d

In [148]:
compare_predicted_tags(y_test_word2vec,y_pred_test_word2vec_df)

[1mSentence 0[0m
[1mCorrect Tags :
[0mjava           1.0
permissions    1.0
security       1.0
sockets        1.0
Name: 1324, dtype: float64
[1mPredicted Tags :
[0mSeries([], Name: 1324, dtype: float64)
[1mSentence 1[0m
[1mCorrect Tags :
[0mandroid    1.0
Name: 6268, dtype: float64
[1mPredicted Tags :
[0mandroid    1.0
java       1.0
Name: 6268, dtype: float64
[1mSentence 2[0m
[1mCorrect Tags :
[0majax          1.0
javascript    1.0
jquery        1.0
Name: 11543, dtype: float64
[1mPredicted Tags :
[0majax            1.0
asynchronous    1.0
javascript      1.0
promise         1.0
Name: 11543, dtype: float64
[1mSentence 3[0m
[1mCorrect Tags :
[0mdownload             1.0
html                 1.0
internet-explorer    1.0
vb.net               1.0
Name: 4193, dtype: float64
[1mPredicted Tags :
[0mc#    1.0
Name: 4193, dtype: float64
[1mSentence 4[0m
[1mCorrect Tags :
[0mapi             1.0
concurrency     1.0
java            1.0
rest            1.0
web-services   

In [149]:
compare_predicted_tags(y_test_bert,y_pred_test_bert_df)

[1mSentence 0[0m
[1mCorrect Tags :
[0mnumpy     1.0
python    1.0
Name: 4620, dtype: float64
[1mPredicted Tags :
[0m.x        1.0
arrays    1.0
numpy     1.0
Name: 4620, dtype: float64
[1mSentence 1[0m
[1mCorrect Tags :
[0mazure           1.0
powershell      1.0
unit-testing    1.0
Name: 3894, dtype: float64
[1mPredicted Tags :
[0mSeries([], Name: 3894, dtype: float64)
[1mSentence 2[0m
[1mCorrect Tags :
[0mauthentication    1.0
Name: 2213, dtype: float64
[1mPredicted Tags :
[0mSeries([], Name: 2213, dtype: float64)
[1mSentence 3[0m
[1mCorrect Tags :
[0mparsing    1.0
sorting    1.0
Name: 3055, dtype: float64
[1mPredicted Tags :
[0mjava       1.0
sorting    1.0
Name: 3055, dtype: float64
[1mSentence 4[0m
[1mCorrect Tags :
[0mcookies              1.0
firefox              1.0
google-chrome        1.0
internet-explorer    1.0
Name: 5381, dtype: float64
[1mPredicted Tags :
[0mSeries([], Name: 5381, dtype: float64)
[1mSentence 5[0m
[1mCorrect Tags :
[0masp.n

In [150]:
compare_predicted_tags(y_test, y_pred_test_text_df)

[1mSentence 0[0m
[1mCorrect Tags :
[0mdom           1.0
html          1.0
javascript    1.0
reactjs       1.0
Name: 187, dtype: float64
[1mPredicted Tags :
[0mjavascript    1.0
reactjs       1.0
Name: 187, dtype: float64
[1mSentence 1[0m
[1mCorrect Tags :
[0mdataframe    1.0
datetime     1.0
pandas       1.0
python       1.0
Name: 5434, dtype: float64
[1mPredicted Tags :
[0mdataframe    1.0
pandas       1.0
python       1.0
Name: 5434, dtype: float64
[1mSentence 2[0m
[1mCorrect Tags :
[0mcurl          1.0
javascript    1.0
php           1.0
Name: 5308, dtype: float64
[1mPredicted Tags :
[0mjavascript    1.0
php           1.0
Name: 5308, dtype: float64
[1mSentence 3[0m
[1mCorrect Tags :
[0mjava    1.0
ssl     1.0
Name: 2276, dtype: float64
[1mPredicted Tags :
[0mSeries([], Name: 2276, dtype: float64)
[1mSentence 4[0m
[1mCorrect Tags :
[0mphp     1.0
soap    1.0
xml     1.0
Name: 5262, dtype: float64
[1mPredicted Tags :
[0mphp    1.0
Name: 5262, dtype: float

## Model Metrics

In [157]:
def calculate_scores(y_true_df,y_pred_df, number_of_score_to_print=None,jaccard=True, f1=False, accuracy=False, loss=False, ham_loss=False, precision=False, recall=False):
    if jaccard:
        print('-------------- JACCARD SCORES -------------')
        print('-------------------------------------------')
        jaccard_scores = {}
        for i in range(len(y_true_df)):
            jaccard_scores[i] = jaccard_score(y_true_df.iloc[i], y_pred_df.iloc[i], average='macro',zero_division=0)
        print({A:N for (A,N) in [x for x in jaccard_scores.items()][:10]})
        print('---------- MIN JACCARD SCORE ----------')
        print(jaccard_scores[min(jaccard_scores, key=jaccard_scores.get)])
        print('---------- MAX JACCARD SCORE -----------')
        print(jaccard_scores[max(jaccard_scores, key=jaccard_scores.get)])
        print('---------- AVERAGE JACCARD SCORE ---------')
        print(mean([jaccard_scores[key] for key in jaccard_scores]))

    if f1:
        print()
        print('---------------- F1 SCORES ----------------')
        print('-------------------------------------------')
        f1_scores = {}
        for i in range(len(y_true_df)):
            f1_scores[i] = f1_score(y_true_df.iloc[i], y_pred_df.iloc[i], average='macro',zero_division=0)
        print({A:N for (A,N) in [x for x in f1_scores.items()][:10]})
        print('---------- MIN F1 SCORE ----------')
        print(f1_scores[min(f1_scores, key=f1_scores.get)])
        print('---------- MAX F1 SCORE ----------')
        print(f1_scores[max(f1_scores, key=f1_scores.get)])
        print('---------- AVERAGE F1 SCORE ----------')
        print(mean([f1_scores[key] for key in f1_scores]))

    if accuracy:
        print()
        print('------------- ACCURACY SCORES -------------')
        print('-------------------------------------------')
        accuracy_scores = {}
        for i in range(len(y_true_df)):
            accuracy_scores[i] = balanced_accuracy_score(y_true_df.iloc[i], y_pred_df.iloc[i])
        print({A:N for (A,N) in [x for x in accuracy_scores.items()][:10]})
        print('---------- MIN ACCURACY SCORE ----------')
        print(accuracy_scores[min(accuracy_scores, key=accuracy_scores.get)])
        print('---------- MAX ACCURACY SCORE ----------')
        print(accuracy_scores[max(accuracy_scores, key=accuracy_scores.get)])
        print('---------- AVERAGE ACCURACY SCORE ----------')
        print(mean([accuracy_scores[key] for key in accuracy_scores]))

    if loss:
        print()
        print('--------------- LOSS SCORES ---------------')
        print('-------------------------------------------')
        loss_scores = {}
        for i in range(len(y_true_df)):
            loss_scores[i] = zero_one_loss(y_true_df.iloc[i], y_pred_df.iloc[i], normalize=False)
        print({A:N for (A,N) in [x for x in loss_scores.items()][:10]})
        print('---------- MIN LOSS SCORE -----------')
        print(loss_scores[min(loss_scores, key=loss_scores.get)])
        print('---------- MAX LOSS SCORE -----------')
        print(loss_scores[max(loss_scores, key=loss_scores.get)])
        print('---------- AVERAGE LOSS SCORE ---------')
        print(mean([loss_scores[key] for key in loss_scores]))

    if ham_loss:
        print()
        print('------------- HAMMING LOSS SCORES -----------')
        print('---------------------------------------------')
        hamming_loss_scores = {}
        for i in range(len(y_true_df)):
            hamming_loss_scores[i] = hamming_loss(y_true_df.iloc[i], y_pred_df.iloc[i])
        print({A:N for (A,N) in [x for x in hamming_loss_scores.items()][:10]})
        print('---------- MIN HAMMING LOSS SCORE ----------')
        print(hamming_loss_scores[min(hamming_loss_scores, key=hamming_loss_scores.get)])
        print('---------- MAX HAMMING LOSS SCORE ----------')
        print(hamming_loss_scores[max(hamming_loss_scores, key=hamming_loss_scores.get)])
        print('--------- AVERAGE HAMMING LOSS SCORE ---------')
        print(mean([hamming_loss_scores[key] for key in hamming_loss_scores]))


    if precision:
        print()
        print('---------------- PRECISION SCORES -----------')
        print('---------------------------------------------')
        precision_scores = {}
        for i in range(len(y_true_df)):
            precision_scores[i] = precision_score(y_true_df.iloc[i], y_pred_df.iloc[i], average='macro', zero_division=0)
        print({A:N for (A,N) in [x for x in precision_scores.items()][:10]})
        print('--------------- MIN PRECISION SCORE -----------')
        print(precision_scores[min(precision_scores, key=precision_scores.get)])
        print('--------------- MAX PRECISION SCORE -----------')
        print(precision_scores[max(precision_scores, key=precision_scores.get)])
        print('--------------- AVERAGE PRECISION SCORE ----------')
        print(mean([precision_scores[key] for key in precision_scores]))

        if recall:
            print()
            print('--------------- RECALL SCORES ---------------')
            print('---------------------------------------------')
            recall_scores = {}
            for i in range(len(y_true_df)):
                recall_scores[i] = recall_score(y_true_df.iloc[i], y_pred_df.iloc[i], average='macro',zero_division=0)
            print({A:N for (A,N) in [x for x in recall_scores.items()][:10]})
            print('--------------- MIN RECALL SCORE --------------')
            print(recall_scores[min(recall_scores, key=recall_scores.get)])
            print('--------------- MAX RECALL SCORE --------------')
            print(recall_scores[max(recall_scores, key=recall_scores.get)])
            print('--------------- AVERAGE RECALL SCORE -----------')
            print(mean([recall_scores[key] for key in recall_scores]))

In [None]:
This warning is saying you that the classification_report output is influenced because of one of labels is never predicted for your model (in your case, label "2").

This will generate a problem calculating Precision (dividing by 0), because (true positives + false positives =0). When the function deals with this problem, aoutomatically output 0. Note this is not the real value, it should be "undefined" or something like this, but it's his approach. As you can see, when you are calculating macro avg, you are using this calculated 0. So the error is just reminding you that you macro avg is influenced by a "fake" 0.

In [310]:
calculate_scores(y_test_tfidf, y_pred_sgdc_df, 10, f1=True)

-------------- JACCARD SCORES -------------
-------------------------------------------
{0: 0.7478070175438596, 1: 0.7455947136563876, 2: 0.49344978165938863, 3: 0.5912280701754385, 4: 0.49344978165938863, 5: 1.0, 6: 0.5912280701754385, 7: 0.5912280701754385, 8: 0.7478070175438596, 9: 0.4978165938864629}
---------- MIN JACCARD SCORE ----------
0.2445414847161572
---------- MAX JACCARD SCORE -----------
1.0
---------- AVERAGE JACCARD SCORE ---------
0.6276655666219593

---------------- F1 SCORES ----------------
-------------------------------------------
{0: 0.8322344322344322, 1: 0.8311209439528023, 2: 0.4967032967032967, 3: 0.6622418879056047, 4: 0.4967032967032967, 5: 1.0, 6: 0.6622418879056047, 7: 0.6622418879056047, 8: 0.8322344322344322, 9: 0.49890590809628005}
---------- MIN F1 SCORE ----------
0.24724061810154524
---------- MAX F1 SCORE ----------
1.0
---------- AVERAGE F1 SCORE ----------
0.6682035990932774


In [179]:
calculate_scores(y_test_use, y_pred_test_use_df,10,f1=True,accuracy=False,loss=True, ham_loss=True, precision=True, recall=True)

-------------- JACCARD SCORES -------------
-------------------------------------------
{0: 0.7419354838709677, 1: 0.6586666666666666, 2: 0.4112903225806452, 3: 0.49603174603174605, 4: 0.49206349206349204, 5: 0.49206349206349204, 6: 0.8293010752688172, 7: 0.6586666666666666, 8: 1.0, 9: 0.746}
---------- MIN JACCARD SCORE ----------
0.246
---------- MAX JACCARD SCORE -----------
1.0
---------- AVERAGE JACCARD SCORE ---------
0.6628626243154851

---------------- F1 SCORES ----------------
-------------------------------------------
{0: 0.8292682926829268, 1: 0.7459677419354839, 2: 0.46395663956639566, 3: 0.49800796812749004, 4: 0.49599999999999994, 5: 0.49599999999999994, 6: 0.8979757085020244, 7: 0.7459677419354839, 8: 1.0, 9: 0.8313253012048192}
---------- MIN F1 SCORE ----------
0.24798387096774194
---------- MAX F1 SCORE ----------
1.0
---------- AVERAGE F1 SCORE ----------
0.7028951356885439

--------------- LOSS SCORES ---------------
-------------------------------------------
{0:

In [158]:
calculate_scores(y_test_word2vec, y_pred_test_word2vec_df,10,f1=True,accuracy=False,loss=True, ham_loss=True, precision=True, recall=True)

-------------- JACCARD SCORES -------------
-------------------------------------------
{0: 0.49604743083003955, 1: 0.49703557312252966, 2: 0.899003984063745, 3: 0.49604743083003955, 4: 0.4950592885375494, 5: 0.6970238095238095, 6: 0.49703557312252966, 7: 0.3313570487483531, 8: 0.33069828722002637, 9: 0.748015873015873}
---------- MIN JACCARD SCORE ----------
0.24802371541501977
---------- MAX JACCARD SCORE -----------
1.0
---------- AVERAGE JACCARD SCORE ---------
0.5576979227928643

---------------- F1 SCORES ----------------
-------------------------------------------
{0: 0.498015873015873, 1: 0.4985133795837463, 2: 0.9439459399579041, 3: 0.498015873015873, 4: 0.49751737835153925, 5: 0.7842217484008529, 6: 0.4985133795837463, 7: 0.33234225305583087, 8: 0.33201058201058203, 9: 0.832339297548045}
---------- MIN F1 SCORE ----------
0.24900793650793648
---------- MAX F1 SCORE ----------
1.0
---------- AVERAGE F1 SCORE ----------
0.5851300922102908

--------------- LOSS SCORES ----------

In [159]:
calculate_scores(y_test_bert, y_pred_test_bert_df,10,f1=True,accuracy=False,loss=True, ham_loss=True, precision=True, recall=True)

-------------- JACCARD SCORES -------------
-------------------------------------------
{0: 1.0, 1: 0.49361702127659574, 2: 0.48936170212765956, 3: 0.49361702127659574, 4: 0.49361702127659574, 5: 0.6623931623931624, 6: 0.7478632478632479, 7: 0.7457081545064378, 8: 0.4957446808510638, 9: 0.4978723404255319}
---------- MIN JACCARD SCORE ----------
0.24574468085106382
---------- MAX JACCARD SCORE -----------
1.0
---------- AVERAGE JACCARD SCORE ---------
0.5877916791963215

---------------- F1 SCORES ----------------
-------------------------------------------
{0: 1.0, 1: 0.49678800856531047, 2: 0.4946236559139785, 3: 0.49678800856531047, 4: 0.49678800856531047, 5: 0.747854077253219, 6: 0.8322626695217701, 7: 0.8311781609195402, 8: 0.49786324786324787, 9: 0.4989339019189766}
---------- MIN F1 SCORE ----------
0.24785407725321887
---------- MAX F1 SCORE ----------
1.0
---------- AVERAGE F1 SCORE ----------
0.6204984573316698

--------------- LOSS SCORES ---------------
--------------------

In [161]:
calculate_scores(y_test_tfidf,y_pred_test_text_df,10,f1=True,accuracy=False,loss=True, ham_loss=True, precision=True, recall=True)

-------------- JACCARD SCORES -------------
-------------------------------------------
{0: 0.7478070175438596, 1: 0.7455947136563876, 2: 0.49344978165938863, 3: 0.5912280701754385, 4: 0.49344978165938863, 5: 0.7478070175438596, 6: 0.5912280701754385, 7: 0.6933920704845815, 8: 0.6622807017543859, 9: 0.4978165938864629}
---------- MIN JACCARD SCORE ----------
0.24563318777292575
---------- MAX JACCARD SCORE -----------
1.0
---------- AVERAGE JACCARD SCORE ---------
0.6307328813781168

---------------- F1 SCORES ----------------
-------------------------------------------
{0: 0.8322344322344322, 1: 0.8311209439528023, 2: 0.4967032967032967, 3: 0.6622418879056047, 4: 0.4967032967032967, 5: 0.8322344322344322, 6: 0.6622418879056047, 7: 0.7823883433639531, 8: 0.7477973568281938, 9: 0.49890590809628005}
---------- MIN F1 SCORE ----------
0.24779735682819382
---------- MAX F1 SCORE ----------
1.0
---------- AVERAGE F1 SCORE ----------
0.6713637814654748

--------------- LOSS SCORES ----------

## Fonction API

In [None]:
#1/ fonction_api()

In [None]:
from bs4 import BeautifulSoup
from nltk.stem import WordNetLemmatizer
def preprocess_fct(title,text):
    #1: Delete html balises and lower text
    title = BeautifulSoup(title).get_text().lower()
    text = BeautifulSoup(text).get_text().lower()
    #2: Deelete english words:
    #tokenizer
    tokenizer = nltk.RegexpTokenizer(r'[a-zA_\-+#]*\.?[a-zA_\+#]+')
    tokens_list_title = tokenizer.tokenize(title)
    tokens_list_text = tokenizer.tokenize(text)
    english_stop_words=nltk.corpus.stopwords.words('English')
    clean_tokens_list_title = [word for word in tokens_list_title if word not in english_stop_words]
    clean_tokens_list_text = [word for word in tokens_list_text if word not in english_stop_words]
    #3: lemmatization:
    trans = WordNetLemmatizer()
    trans_title = [trans.lemmatize(word) for word in clean_tokens_list_title]
    trans_text = [trans.lemmatize(word) for word in clean_tokens_list_text]
    final_text = trans_title + trans_text

    return " ".join(final_text)

In [None]:
import tensorflow_hub as hub
def fonction_api(title, text):
    #call the preprocessing fonction
    text = preprocess_fct(title, text)
    #call the USE
    doc_df = use_embedding(text)
    array = predict_emdedded_matrix(doc_df)
    for tag in array:
        print('tag')

In [None]:
def use_embedding(text):
    #call the USE
    text = text.to_list()
    embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
    embedding = embed(text)
    return pd.DataFrame(embedding.numpy())


In [None]:
def predict_emdedded_matrix(doc_df):
    loaded_model = torch.load('use_model.pkl')
    doc_pred = pd.DataFrame(loaded_model.predict(doc_df))
    target_labels = pd.read_csv('/Users/maurelco/Developer/Python/Projet4/data/Cleaned/USE_labels_tags.csv', index=False)
    doc_pred.columns = target_labels
    tags = doc_pred[doc_pred == 1]
    return np.array(tags.index)

In [85]:
y_pred_train_proba_text = clf_text.predict_proba(X_train_tfidf_text)
y_pred_test_proba_text = clf_text.predict_proba(X_test_tfidf_text)

In [86]:
y_pred_cv_proba_text

[array([[8.68952390e-01, 1.29141729e-01, 1.75210474e-03, 1.53775471e-04],
        [9.83684562e-01, 1.56477943e-02, 6.08193816e-04, 5.94498040e-05],
        [9.63780999e-01, 3.48925528e-02, 1.23871626e-03, 8.77321710e-05],
        ...,
        [9.19485758e-01, 7.81902717e-02, 2.15963660e-03, 1.64333324e-04],
        [9.46380213e-01, 5.19249202e-02, 1.56922232e-03, 1.25644218e-04],
        [8.67554381e-01, 1.29693078e-01, 2.57198165e-03, 1.80559145e-04]]),
 array([[0.89605627, 0.10394373],
        [0.98019526, 0.01980474],
        [0.94765742, 0.05234258],
        ...,
        [0.91729809, 0.08270191],
        [0.96372151, 0.03627849],
        [0.99080059, 0.00919941]]),
 array([[9.85681887e-01, 1.37743471e-02, 4.84581767e-04, 5.91843088e-05],
        [9.64608022e-01, 3.43794294e-02, 8.93883684e-04, 1.18664624e-04],
        [9.92361014e-01, 7.28650559e-03, 3.06429094e-04, 4.60516488e-05],
        ...,
        [9.13948687e-01, 7.70648656e-02, 8.63138695e-03, 3.55060590e-04],
        [9.63

In [111]:
y_pred_cv_proba_text = np.concatenate(y_pred_cv_proba_text, axis=1)

In [108]:
y_pred_cv_proba_text_df = pd.DataFrame(y_pred_cv_proba_text)
y_pred_cv_proba_text_df.columns = y_cv.columns
y_pred_cv_proba_text_df

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (231, 8476) + inhomogeneous part.

In [878]:
# def get_best_threshold(true, pred):
#     true_df = pd.DataFrame(true)
#     best_thresholds = []
#     f1_scores = []
#     for idx in range(343):
#         print(true_df.iloc[:][idx])
#         print(pred[:][idx])
#         print((pred[:][idx] > 0.55)*1)
#         f1_scores.append(f1_score(true_df.iloc[:][idx], (pred[:][idx] > 0.55) * 1))
#         best_thresh = np.argmax(f1_scores)
#         best_thresholds.append(best_thresh)
#     return best_thresholds

In [972]:
# def get_best_thresholds(true, pred):
#     thresholds = [i/100 for i in range(0,100)]
#     best_thresholds = []
#     for idx in range(231):
#         f1_scores = [f1_score(true[:,idx], (np.argmax(y_pred_cv_proba_text[idx][idx], axis=1) > thresh) * 1, average='macro') for thresh in thresholds]
#         best_thresh = thresholds[np.argmax(f1_scores)]
#         best_thresholds.append(best_thresh)
#     return best_thresholds

In [115]:
def get_best_thresholds(true, pred):
    thresholds = [i/100 for i in range(100)]
    best_thresholds = []
    for idx in range(len(y_cv.columns)):
        print(idx)
        f1_scores = [f1_score(true[:, idx], (pred[:,idx] > thresh) * 1, average='macro') for thresh in thresholds] #f1_score(y_true_df.iloc[i], y_pred_df.iloc[i], average='macro') f1_score(y_true_df.iloc[i], y_pred_df.iloc[i], average='macro')
        best_thresh = thresholds[np.argmax(f1_scores)]
        best_thresholds.append(best_thresh)
    return best_thresholds

In [87]:
#check that the shapes of y_cv.values and y_pred_cv_proba_text match, since f1_score expects the input to have the same shape.
y_cv.values.shape

(8476, 231)

In [91]:
serie_test = pd.Series(y_pred_cv_proba_text)
serie_test.shape

(231,)

In [105]:
y_pred_cv_proba_text

[array([[8.68952390e-01, 1.29141729e-01, 1.75210474e-03, 1.53775471e-04],
        [9.83684562e-01, 1.56477943e-02, 6.08193816e-04, 5.94498040e-05],
        [9.63780999e-01, 3.48925528e-02, 1.23871626e-03, 8.77321710e-05],
        ...,
        [9.19485758e-01, 7.81902717e-02, 2.15963660e-03, 1.64333324e-04],
        [9.46380213e-01, 5.19249202e-02, 1.56922232e-03, 1.25644218e-04],
        [8.67554381e-01, 1.29693078e-01, 2.57198165e-03, 1.80559145e-04]]),
 array([[0.89605627, 0.10394373],
        [0.98019526, 0.01980474],
        [0.94765742, 0.05234258],
        ...,
        [0.91729809, 0.08270191],
        [0.96372151, 0.03627849],
        [0.99080059, 0.00919941]]),
 array([[9.85681887e-01, 1.37743471e-02, 4.84581767e-04, 5.91843088e-05],
        [9.64608022e-01, 3.43794294e-02, 8.93883684e-04, 1.18664624e-04],
        [9.92361014e-01, 7.28650559e-03, 3.06429094e-04, 4.60516488e-05],
        ...,
        [9.13948687e-01, 7.70648656e-02, 8.63138695e-03, 3.55060590e-04],
        [9.63

In [106]:
y_pred_cv_text

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [116]:
get_best_thresholds(y_cv.values, y_pred_cv_text)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230


[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0

In [104]:
f1_score(y_cv[:, 0], (y_pred_cv_proba_text[0] > 0.55) * 1, average='macro')

InvalidIndexError: (slice(None, None, None), 0)

In [996]:
len(y_cv)

8476

In [1001]:
y_cv

Unnamed: 0,c#,javascript,java,python,php,html,jquery,.net,asp.net,android,...,cookies,webpack,dynamic,android-fragments,for-loop,sharepoint,codeigniter,pyqt,jsf,windows-phone
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8471,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8472,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8474,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [1007]:
y_pred_cv_proba_text_df = pd.DataFrame(y_pred_cv_proba_text)
y_pred_cv_proba_text_df.columns = y_cv.columns
y_pred_cv_proba_text_df

  values = np.array([convert(v) for v in values])


ValueError: could not broadcast input array from shape (8476,4) into shape (8476,)

In [981]:
clf_text.classes_

[array([0., 1., 2., 3.]),
 array([0., 1.]),
 array([0., 1., 2., 3.]),
 array([0., 1., 2., 3.]),
 array([0., 1., 2., 4.]),
 array([0., 1., 2., 3.]),
 array([0., 1., 2.]),
 array([0., 1., 2.]),
 array([0., 1., 2., 3.]),
 array([0., 1., 2., 3.]),
 array([0., 1., 2., 3.]),
 array([0., 1., 2., 3., 4.]),
 array([0., 1.]),
 array([0., 1.]),
 array([0., 1., 2.]),
 array([0., 1., 2.]),
 array([0., 1.]),
 array([0., 1., 2., 3.]),
 array([0., 1.]),
 array([0., 1., 2.]),
 array([0., 1., 2., 3., 4., 5.]),
 array([0., 1., 2., 3.]),
 array([0., 1., 2., 3.]),
 array([0., 1., 2.]),
 array([0., 1., 2.]),
 array([0., 1., 2.]),
 array([0., 1., 2.]),
 array([0., 1.]),
 array([0., 1., 2., 3., 4.]),
 array([0., 1., 2., 4.]),
 array([0., 1.]),
 array([0., 1., 2.]),
 array([0., 1.]),
 array([0., 1.]),
 array([0., 1.]),
 array([0., 1., 2., 3., 4.]),
 array([0., 1., 2.]),
 array([0., 1., 2., 3.]),
 array([0., 1., 2., 3.]),
 array([0., 1.]),
 array([0., 1.]),
 array([0., 1.]),
 array([0., 1.]),
 array([0., 1.]),


In [985]:
clf_text.feature_names_in_

AttributeError: 'MultiOutputClassifier' object has no attribute 'feature_names_in_'

In [971]:
np.argmax(y_pred_cv_proba_text[:][100], axis=1)

array([0, 0, 0, ..., 0, 0, 0])

In [963]:
rounded_labels=np.argmax(y_pred_cv_proba_text[40], axis=1)
rounded_labels

array([0, 0, 0, ..., 0, 0, 0])

In [919]:
df_tfidf = pd.DataFrame(X_train_tfidf_text[0].T.todense(), index=vectorizer_text.get_feature_names_out(), columns=["TF-IDF"])
df_tfidf

Unnamed: 0,TF-IDF
__init__,0.0
a,0.0
absolute,0.0
accept,0.0
access,0.0
...,...
y,0.0
year,0.0
yes,0.0
yet,0.0


In [944]:
best_threshold = get_best_thresholds(y_cv.values, y_pred_cv_proba_text)

AttributeError: 'list' object has no attribute 'iloc'

## ANNEXE - RECHERCHES INTERNET ANCIENNE

In [None]:
nb_clf = MultinomialNB()
sgd = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=6, tol=None)
lr = LogisticRegression()
mn = MultinomialNB()

for classifier in [nb_clf, sgd, lr, mn]:
    clf = OneVsRestClassifier(classifier)
    clf.fit(x_train_tfidf, y_train_tfidf)
    y_pred = clf.predict(x_test_tfidf)
    print_score(y_pred, classifier)

In [None]:
num_classes = 100
grouped_tags = df.groupby("Tags").size().reset_index(name='count')
most_common_tags = grouped_tags.nlargest(num_classes, columns="count")
df.Tags_clean = df.Tags.apply(lambda tag : tag if tag in most_common_tags.Tags.values else None)
df.Tags_clean