### Multi-Label Text Classification for Stack Overflow Tag Prediction

DATASET and Library come from:
 Text Cleaning Package: https://github.com/laxmimerit/preprocess_kgptalkie

 dataset: https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/stackoverflow.csv

TIP:
Multi-class classification means a classification task with more than two classes; each label are mutually exclusive. The classification makes the assumption that each sample is assigned to one and only one label.

Libraries

In [170]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer


In [171]:
df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/stackoverflow.csv', index_col = 0)
df.head()

Unnamed: 0,Text,Tags
2,aspnet site maps has anyone got experience cre...,"['sql', 'asp.net']"
4,adding scripting functionality to net applicat...,"['c#', '.net']"
5,should i use nested classes in this case i am ...,['c++']
6,homegrown consumption of web services i have b...,['.net']
8,automatically update version number i would li...,['c#']


In [172]:
type(df.iloc[0]['Text'])

str

In [173]:
type(df.iloc[0]['Tags'])

str

In [174]:
df.iloc[0]['Text']

'aspnet site maps has anyone got experience creating sqlbased aspnet sitemap providersi have got the default xml file websitemap working properly with my menu and sitemappath controls but i will need a way for the users of my site to create and modify pages dynamicallyi need to tie page viewing permissions into the standard aspnet membership system as well'

In [175]:
df.iloc[0]['Tags']

"['sql', 'asp.net']"

In [176]:
import ast # It provides a way to represent the syntax of Python code as a tree of objects

In [177]:
ast.literal_eval(df.iloc[0]['Tags'])     #function in Python is used to evaluate a Python literal in a string and return the resulting valu

['sql', 'asp.net']

Cleaning Text

In [178]:
df['Tags'] = df['Tags'].apply(lambda x: ast.literal_eval(x))
df.head()

Unnamed: 0,Text,Tags
2,aspnet site maps has anyone got experience cre...,"[sql, asp.net]"
4,adding scripting functionality to net applicat...,"[c#, .net]"
5,should i use nested classes in this case i am ...,[c++]
6,homegrown consumption of web services i have b...,[.net]
8,automatically update version number i would li...,[c#]


In [179]:
multilabel=MultiLabelBinarizer()
y=multilabel.fit_transform(df['Tags'])


In [180]:
classes=multilabel.classes_
classes

array(['.net', 'android', 'asp.net', 'c', 'c#', 'c++', 'css', 'html',
       'ios', 'iphone', 'java', 'javascript', 'jquery', 'mysql',
       'objective-c', 'php', 'python', 'ruby', 'ruby-on-rails', 'sql'],
      dtype=object)

In [181]:
pd.DataFrame(y, columns=classes)

Unnamed: 0,.net,android,asp.net,c,c#,c++,css,html,ios,iphone,java,javascript,jquery,mysql,objective-c,php,python,ruby,ruby-on-rails,sql
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48971,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
48972,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
48973,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
48974,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [182]:
tfidf=TfidfVectorizer(analyzer='word', max_features=1500, ngram_range=(1,3), stop_words='english')

In [183]:
X=tfidf.fit_transform(df['Text'])

In [184]:
X.shape, y.shape

((48976, 1500), (48976, 20))

In [185]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 21)

In [186]:
X_train.shape, X_test.shape

((39180, 1500), (9796, 1500))

Model Bulding

In [187]:
from sklearn.multiclass import OneVsRestClassifier

TIP: The j_score function in Python is used to calculate the Jaccard score for a set of predicted labels and true labels. The Jaccard score is a metric that measures the similarity between two sets of data. It is defined as the size of the intersection of the two sets divided by the size of the union of the two sets.

The j_score function takes two arguments: y_true and y_pred. y_true is the set of true labels and y_pred is the set of predicted labels.

The function first calculates the Jaccard score for each pair of true and predicted labels. This is done by finding the minimum of the true and predicted labels for each sample and then dividing that by the maximum of the true and predicted labels for each sample.

In [188]:
def j_score(y_true, y_pred):
    jaccard = np.minimum(y_true, y_pred).sum(axis = 1)/np.maximum(y_true, y_pred).sum(axis = 1)
    return jaccard.mean()*100

In [189]:
lr = LogisticRegression(solver='lbfgs')


In [190]:
clf = OneVsRestClassifier(lr)
clf.fit(X_train, y_train)

In [191]:
y_pred = clf.predict(X_test)

In [192]:
j_score(y_test, y_pred)

50.68973730774465

SVM MODEL

In [193]:
from sklearn.svm import LinearSVC

In [194]:
svm = LinearSVC(C = 1.5, penalty='l1', dual = False)
clf = OneVsRestClassifier(svm)
clf.fit(X_train, y_train)

In [195]:
y_pred = clf.predict(X_test)
j_score(y_test, y_pred)

56.889546753777054

TESTING MODEL

In [196]:
X=['aspnet site maps has anyone got experience creating sql based aspnet sitemap providersi have got the default xml file websitemap working properly with my menu and sitemappath controls but i will need a way for the users of my site to create and modify pages dynamicallyi need to tie page viewing permissions into the standard aspnet membership system as well']

In [197]:
xt=tfidf.transform(X)

In [198]:
clf.predict(xt)

array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [199]:
multilabel.inverse_transform(clf.predict(xt))

[('asp.net',)]

In [200]:
import pickle

In [201]:
pickle.dump(clf, open('svm_multilabel.pkl', 'wb'))
pickle.dump(clf, open('tfidf-multilabel.pkl', 'wb'))