# ***StackOverFlow Tag Predicton***
By Nakshatra Singh 

This notebook is an illustration to predict stackoverflow tags present in textual data.

###**1. Retrieve, Inspect and Preprocess Dataset** 

Let's download the dataset which is uploaded on my google drive. This dataset is a preprocessed version of the following [dataset](https://www.kaggle.com/stackoverflow/stacksample). I have preprocessed the body column with their respective tags and made the dataset a little small so we can run it on any laptop without investing much time.

In [1]:
!gdown --id 1d_ZMfvXA4thwEJEnd3vttIQX5bekU-9z 

Downloading...
From: https://drive.google.com/uc?id=1d_ZMfvXA4thwEJEnd3vttIQX5bekU-9z
To: /content/stackoverflow.csv
47.0MB [00:00, 177MB/s]


We'll use `pandas` to parse the csv files.   

In [2]:
import pandas as pd
df = pd.read_csv('/content/stackoverflow.csv', index_col=0)

Let's take a look at the first few rows of the table just to see what's in there.   

In [3]:
df.head() 

Unnamed: 0,Text,Tags
2,aspnet site maps has anyone got experience cre...,"['sql', 'asp.net']"
4,adding scripting functionality to net applicat...,"['c#', '.net']"
5,should i use nested classes in this case i am ...,['c++']
6,homegrown consumption of web services i have b...,['.net']
8,automatically update version number i would li...,['c#']


What is the dtype of the label column?

In [4]:
df['Tags'].iloc[0]

"['sql', 'asp.net']"

As we notice in the output, the tags column is stored as a string. We want it to be a list. So we'll use *literal_eval*  function present in Abstract Syntax Trees (ast).

In [5]:
import ast

ast.literal_eval(df['Tags'].iloc[0]) 

['sql', 'asp.net']

As we can see we converted the Tags column to a list. We did it for the first data-point, we'll do it for the entire column now.

In [6]:
df['Tags'] = df['Tags'].apply(lambda x: ast.literal_eval(x)) 

Let's have a look at the Tags column and see if it's converted to a list.

In [7]:
df.head() 

Unnamed: 0,Text,Tags
2,aspnet site maps has anyone got experience cre...,"[sql, asp.net]"
4,adding scripting functionality to net applicat...,"[c#, .net]"
5,should i use nested classes in this case i am ...,[c++]
6,homegrown consumption of web services i have b...,[.net]
8,automatically update version number i would li...,[c#]


Now, we'll setup our training variables.

In [8]:
y = df["Tags"] # Target feature 
print(y)

2          [sql, asp.net]
4              [c#, .net]
5                   [c++]
6                  [.net]
8                    [c#]
                ...      
1262668             [c++]
1262834             [c++]
1262915          [python]
1263065          [python]
1263454             [c++]
Name: Tags, Length: 48976, dtype: object


We'll use sklearns' *MultiLabelBinarizer* to one-hot encode our multi-label target variables to integers.

In [9]:
from sklearn.preprocessing import MultiLabelBinarizer

multi_label = MultiLabelBinarizer()    # Calling the Binarizer
y = multi_label.fit_transform(df['Tags'])    # Encode the whole Tags column 
print(y)

[[0 0 1 ... 0 0 1]
 [1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Let's look at the top 20 frequently occuring tags appearing in the target column.

In [10]:
multi_label.classes_

array(['.net', 'android', 'asp.net', 'c', 'c#', 'c++', 'css', 'html',
       'ios', 'iphone', 'java', 'javascript', 'jquery', 'mysql',
       'objective-c', 'php', 'python', 'ruby', 'ruby-on-rails', 'sql'],
      dtype=object)

We'll see it in a dataframe to get a fair idea of how the encoded arrays are distributed.

In [11]:
pd.DataFrame(y, columns=multi_label.classes_).head()

Unnamed: 0,.net,android,asp.net,c,c#,c++,css,html,ios,iphone,java,javascript,jquery,mysql,objective-c,php,python,ruby,ruby-on-rails,sql
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


###**2. TFIDF Vectorizer**

We'll form a feature matrix using TFIDF with *max features = 50000* .

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words='english', max_features=50000) 
''' If max_features = None (default), the matrix size is roughly around 230K which is very huge 
    to train for a linear classifier like Logistic Regression. It would also crash the notebook 
    sometimes, so we'll use 50000 to build the model and save us time.
'''
X = tfidf.fit_transform(df['Text'])  
# Applying it on entire text column.

Let's take a look at the training variable shapes.

In [13]:
X.shape, y.shape

((48976, 50000), (48976, 20))

###**3. Train Test Split**

Let's split our prepared variables to training and validation sets using *train_test_split*.

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) 

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(39180, 50000) (9796, 50000) (39180, 20) (9796, 20)


###**4. Modelling**

Multi-label classification problems must be assessed using different performance measures than single-label classification problems. Two of the most common performance metrics are hamming loss and Jaccard similarity. Hamming loss is the average fraction of incorrect labels. Note that hamming loss is a loss function and that the perfect score is 0. Jaccard similarity, or the Jaccard index, is the size of the intersection of the predicted labels and the true labels divided by the size of the union of the predicted and true labels. We'll be using Jaccard Score here.

In [15]:
from sklearn.metrics import jaccard_score 

def j_score(y_true, y_pred):
  ''' Helper Function to print classifier name and Jaccard Scare '''
  return jaccard_score(y_test, y_pred, average='samples')

def print_score(y_pred, clf):
  print('clf:', clf.__class__.__name__)
  print('Jaccard Score: {}'.format(j_score(y_test, y_pred))) 
  print('------------------------------')

We'll use Logistic Regression with OneVsRestClassifier here. You surely try some other algorithms like LinearSVC, SGDClassifier and even Naive Bayes.

In [16]:
from sklearn.linear_model import LogisticRegression
''' max_iter =100000: The lbfgs solver can converge without reaching max limit of iterations
    n_jobs=-1: Uses all cores for fast training
'''
lr = LogisticRegression(solver='lbfgs', max_iter=100000, C=1, n_jobs=-1) 

In [17]:
from sklearn.multiclass import OneVsRestClassifier

# Training the classifier with OneVsRest for Multilabel Classification
clf = OneVsRestClassifier(lr)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print_score(y_pred, clf) 

clf: OneVsRestClassifier
Jaccard Score: 0.47421396488362605
------------------------------


###**5. Testing**

Let's test the classifier on some real examples.

In [18]:
# Pass x in a list
x = ['How do I use matplotlib']

# TFIDF vectorize the list
xt = tfidf.transform(x) 

# Predict using classifier predict
clf.predict(xt) 

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]])

For a cleaner output, we'll inverse the predicted array value to its true label and print it out.

In [19]:
multi_label.inverse_transform(clf.predict(xt)) 

[('python',)]

###**6.Conclusion**

As you saw, the model is correctly predicting the tags. You surely can try some *other ML algorithms* also, they surely might be able to increase performance. You can also use GridSearchCV to find the best hyperparameters for the classifier you are using. In short, we learned how to train a classifier for multi-label datasets using OneVsRestClassifier.

`NOTE:` The dataset has been altered to a small dimension so we dont invest more time in training and rather invest more time on understanding the fundamentals. To get state-of-art performance, we must use the whole dataset with optimized hyperparameters for this problem statement.