# Multi-Label Text Classification for Stack Overflow Tag Prediction (On Reduced Data)

My Text Cleaning Package:` https://github.com/mikelakoju/preprocess_V2_NLP_mikelakoju`

The Original data can be found on Kaggle. The text cleaning package can be  used for preprocessing


In [1]:
import pandas as pd
import numpy as np


In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer

In [52]:
df= pd.read_csv("stackoverflow.csv", index_col=0)

In [53]:
df.head()

Unnamed: 0,Text,Tags
2,aspnet site maps has anyone got experience cre...,"['sql', 'asp.net']"
4,adding scripting functionality to net applicat...,"['c#', '.net']"
5,should i use nested classes in this case i am ...,['c++']
6,homegrown consumption of web services i have b...,['.net']
8,automatically update version number i would li...,['c#']


In [8]:
type(df.iloc[0]['Tags'])

str

In [7]:
df.iloc[0]['Tags']

"['sql', 'asp.net']"

### **NOTE:** The tags are in a `string format` we need to convert them into a `list`. We will import `import ast` to deal with this 

In [9]:
import ast

In [10]:
ast.literal_eval(df.iloc[0]['Tags'])

['sql', 'asp.net']

In [12]:
# apply to the entire dataframe

df['Tags'] = df['Tags'].apply(lambda x: ast.literal_eval(x))

In [13]:
type(df.iloc[0]['Tags'])

list

In [15]:
df.iloc[0]['Tags']

['sql', 'asp.net']

In [16]:
df.head()

Unnamed: 0,Text,Tags
2,aspnet site maps has anyone got experience cre...,"[sql, asp.net]"
4,adding scripting functionality to net applicat...,"[c#, .net]"
5,should i use nested classes in this case i am ...,[c++]
6,homegrown consumption of web services i have b...,[.net]
8,automatically update version number i would li...,[c#]


### Multilabel

In [17]:
multilabel = MultiLabelBinarizer()

# This converts the Tags label list into One-hot-encoding
y = multilabel.fit_transform(df['Tags'])

In [18]:
y

array([[0, 0, 1, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

#### To get the classes.


In [19]:
classes = multilabel.classes_

In [20]:
classes

array(['.net', 'android', 'asp.net', 'c', 'c#', 'c++', 'css', 'html',
       'ios', 'iphone', 'java', 'javascript', 'jquery', 'mysql',
       'objective-c', 'php', 'python', 'ruby', 'ruby-on-rails', 'sql'],
      dtype=object)

In [21]:
len(classes)

20

In [22]:
pd.DataFrame(y, columns=[classes])

Unnamed: 0,.net,android,asp.net,c,c#,c++,css,html,ios,iphone,java,javascript,jquery,mysql,objective-c,php,python,ruby,ruby-on-rails,sql
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48971,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
48972,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
48973,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
48974,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [23]:
tfidf = TfidfVectorizer(analyzer='word', max_features= 1000, ngram_range = (1,3), stop_words='english')

In [24]:
X = tfidf.fit_transform(df['Text'])

In [25]:
X

<48976x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 1239765 stored elements in Compressed Sparse Row format>

In [27]:
# tfidf.vocabulary_

In [28]:
X.shape, y.shape

((48976, 1000), (48976, 20))

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [30]:
X_train.shape, X_test.shape

((39180, 1000), (9796, 1000))

# Model Building
### LogisticsRegression Model

> ### To evaluate a Multi-label classification problem we will create a `Jaccard function`:
```
The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. Source: wiki
```
**How to Calculate the Jaccard Index**

The formula to find the Index is:
Jaccard Index = (the number in both sets) / (the number in either set) * 100

The same formula in notation is:
J(X,Y) = |X∩Y| / |X∪Y|

In Steps, that’s:

* 1.   Count the number of members which are shared between both sets.
* 2.  Count the total number of members in both sets (shared and un-shared).
* 3.   Divide the number of shared members (1) by the total number of members (2).
* 4.   Multiply the number you found in (3) by 100.

https://www.statisticshowto.com/jaccard-index/

In [32]:
def j_score(y_true, y_pred):
    jaccard = np.minimum(y_true, y_pred).sum(axis = 1)/np.maximum(y_true, y_pred).sum(axis = 1)
    return jaccard.mean()*100

In [33]:
from sklearn.multiclass import OneVsRestClassifier

In [31]:
lr = LogisticRegression(solver='lbfgs')

In [34]:
clf = OneVsRestClassifier(lr)
clf.fit(X_train, y_train)

OneVsRestClassifier(estimator=LogisticRegression())

In [35]:
y_pred = clf.predict(X_test)

In [36]:
j_score(y_test, y_pred)

49.12668436096366

# SVM

In [37]:
from sklearn.svm import LinearSVC

In [38]:
svm  = LinearSVC(C=1.5, penalty= 'l1',dual=False)
clf = OneVsRestClassifier(svm)

clf.fit(X_train, y_train)

OneVsRestClassifier(estimator=LinearSVC(C=1.5, dual=False, penalty='l1'))

In [39]:
y_pred = clf.predict(X_test)

In [40]:
j_score(y_test, y_pred)

53.3787600381108

#### Predicting on data

In [41]:
x = ['how to write ml code in python and java i have data but do not know how to do it']

In [42]:
xt = tfidf.transform(x)

In [43]:
clf.predict(xt)

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0]])

#### To get the predicted Tag for this

In [45]:
multilabel.inverse_transform(clf.predict(xt))

[('java', 'python')]

# Save model

In [46]:
import pickle

In [49]:
pickle.dump(clf, open('svm_multilabel_v1.pkl', 'wb'))
pickle.dump(clf, open('tfidf-multilabel_v1.pkl', 'wb'))