## Stackoverflow Tags Prediction

<img src="1.png" style="height:280px">

## 1. Real world Use-Case:
We will be given Question info, like Title and Body, and our job is to predict the suitable tag. This is very important to get the answers quickly because the question will reach to the correct developers community.

### 2. Mapping to Machine Learning Techniques:
In this use-case, we are trying to predict the tags for a particular question. Here we have 2(“title”, “body”) input features(x1,x2) and one or more target features (y1, y2, y3,…yn). It’s a classification problem. If we have 2 categories in target variable Y then it will be binary classification and if we have more than 2 categories then it will be multi-class classification. Our use-case is not binary or multi-class classification but it’s called Multi-label Classification problem. 

### 3. Data Collection:
Go to the link https://data.stackexchange.com/stackoverflow/query/edit/1186275 and you will see Stackoverflow site where you can query the data that you want. In the right-side pan you can see the Database Schema (Table name (“Posts” and it’s attributes).

#### Query: SELECT TOP 5000 Id, Title, Body , Tags from Posts WHERE Title IS NOT NULL AND Body IS NOT NULL ORDER BY RAND()

- Once you get the result you can download the result as a .CSV file.

### a. Load The Dataset And Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter('ignore')

In [2]:
df = pd.read_csv("QueryResults.csv",nrows=5000)
df.head()

Unnamed: 0,Id,Title,Body,Tags
0,51783673,Lock file for application,<p>I working on a service. This service will r...,<c#><file><locking>
1,51783676,C++ Console : Parsing METAR data,<p>I am working on my first web app (weather v...,<c++><console><dev-c++>
2,51783677,How to connect WIX template with 3rd party RES...,<p>I created a simple web site using WIX platf...,<node.js><html><velo>
3,51783681,Need Help implementing FileNameFilter java,<p>I want to create an application which will ...,<java><swing><tree><jtree>
4,51783689,C # Excel Interop - adding the function / modu...,<p>I need to add a some functionality to new w...,<c#><vba><excel-interop><worksheet>


In [3]:
df.shape

(5000, 4)

In [4]:
df = df[['Title','Body','Tags']]

#### b. Checking for Null, Duplicates:

In [5]:
#check for Missing values
df.isnull().sum()

Title    0
Body     0
Tags     0
dtype: int64

In [6]:
df.duplicated().any()

False

In [7]:
df.shape

(5000, 3)

### Text Preprocessing : We have some html tags, special characters in Title, Body, Tags. We need to be careful as there will tag “.net”, “C#” we need to do separate text cleaning for Tags.

In [8]:
from bs4 import BeautifulSoup
import re
def strip_html_tags(text):
    """remove html tags from text"""
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text(separator=" ")
    #Remove non-alphanumeric and except "#", "." for c# and .net keywords
    stripped_text = re.sub(r'[^A-Za-z0-9#+.\-]+',' ',stripped_text)
    return stripped_text

def Tags_cleaning(text):
  text = text.replace('><'," ")
  text = text.replace('<', "")
  text = text.replace('>', "")
  return text

In [9]:
df['Title'] = df['Title'].apply(lambda text: strip_html_tags(text))
df['Body'] = df['Body'].apply(lambda text: strip_html_tags(text))
df['Tags'] = df['Tags'].apply(lambda text: Tags_cleaning(text))
df.head()

Unnamed: 0,Title,Body,Tags
0,Lock file for application,I working on a service. This service will run ...,c# file locking
1,C++ Console Parsing METAR data,I am working on my first web app weather visua...,c++ console dev-c++
2,How to connect WIX template with 3rd party RES...,I created a simple web site using WIX platform...,node.js html velo
3,Need Help implementing FileNameFilter java,I want to create an application which will sho...,java swing tree jtree
4,C # Excel Interop - adding the function module...,I need to add a some functionality to new work...,c# vba excel-interop worksheet


In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = CountVectorizer(tokenizer = lambda x: x.split())
tag_dtm = vectorizer.fit_transform(df['Tags']).toarray()
tags = vectorizer.get_feature_names()

In [11]:
tags[:10]

['.htaccess',
 '.net',
 '.net-2.0',
 '.net-3.5',
 '.net-4.5',
 '.net-assembly',
 '.net-core',
 '.net-standard-2.0',
 '.nettiers',
 '32-bit']

In [12]:
print(tag_dtm.shape)
print("No.of Unique tags : ",len(tags))

(5000, 3956)
No.of Unique tags :  3956


In [13]:
tags = pd.DataFrame(tag_dtm,columns =vectorizer.get_feature_names())
tags.head(10)

Unnamed: 0,.htaccess,.net,.net-2.0,.net-3.5,.net-4.5,.net-assembly,.net-core,.net-standard-2.0,.nettiers,32-bit,...,zend-view,zeroconf,zigbee,zip,zipfile,zipkin,zlib,zooming,zurb-foundation,zxing
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
freqs = tag_dtm.sum(axis=0)
result = dict(zip(tags, freqs))
result = pd.DataFrame(list(result.items()),columns=['Tag','Count'])
result.head()

Unnamed: 0,Tag,Count
0,.htaccess,11
1,.net,123
2,.net-2.0,3
3,.net-3.5,8
4,.net-4.5,3


In [15]:
tag_df_sorted = result.sort_values(['Count'], ascending=False)
#tag_counts = tag_df_sorted['Count'].values

In [16]:
tag_df_sorted.head(20)

Unnamed: 0,Tag,Count
1723,javascript,512
1706,java,435
517,c#,397
2707,python,344
2537,php,336
106,android,289
1781,jquery,252
1514,html,248
520,c++,179
784,css,173


In [17]:
# binary='true' will give a binary vectorizer
vectorizer = CountVectorizer(tokenizer = lambda x: x.split(), binary=True)
multilabel_y = vectorizer.fit_transform(df['Tags'])

In [18]:
y_list = vectorizer.get_feature_names()
print(len(y_list))

3956


In [19]:
#print(multilabel_y[0:12].toarray())
total_size=df.shape[0]
train_size=int(0.80*total_size)

x_train=df[['Title','Body']].head(train_size)
x_test=df[['Title','Body']].tail(total_size - train_size)

y_train = multilabel_y[0:train_size,:]
y_test = multilabel_y[train_size:total_size,:]
print("x_train",x_train.shape)
print("x_test",x_test.shape)
print("y_train",y_train.shape)
print("y_test",y_test.shape)

x_train (4000, 2)
x_test (1000, 2)
y_train (4000, 3956)
y_test (1000, 3956)


### It’s time to create X matrix. In x_train, x_test we have “Title” and “Body” as a separate features. Let’s combine both and create X matrix. I used TF-IDF technique to create X matrix ( alternatively we can use Countvectozer(), HashingVectorizer, etc). We will try different technique in-order to improve the accuracy of our model.

In [20]:

x_train['Text']= x_train['Title'] + " " + x_train['Body']
x_test['Text']= x_test['Title']+" "+ x_test['Body']
print(x_train.shape)
print(x_test.shape)
x_train.head()

(4000, 3)
(1000, 3)


Unnamed: 0,Title,Body,Text
0,Lock file for application,I working on a service. This service will run ...,Lock file for application I working on a servi...
1,C++ Console Parsing METAR data,I am working on my first web app weather visua...,C++ Console Parsing METAR data I am working on...
2,How to connect WIX template with 3rd party RES...,I created a simple web site using WIX platform...,How to connect WIX template with 3rd party RES...
3,Need Help implementing FileNameFilter java,I want to create an application which will sho...,Need Help implementing FileNameFilter java I w...
4,C # Excel Interop - adding the function module...,I need to add a some functionality to new work...,C # Excel Interop - adding the function module...


#### Text is the combination Of Title and Body , And we train the model By this Text Data Only

In [21]:
x_train = x_train['Text']
x_test = x_test['Text']
print(x_train.shape)
print(x_test.shape)

(4000,)
(1000,)


In [22]:
vectorizer = TfidfVectorizer(analyzer='word', max_features=1000, ngram_range=(1,3), stop_words='english')
x_train_multilabel = vectorizer.fit_transform(x_train)
x_test_multilabel = vectorizer.transform(x_test)

In [23]:
print("x_train",x_train_multilabel.shape)
print("x_test",x_test_multilabel.shape)
print("y_train",y_train.shape)
print("y_test",y_test.shape)

x_train (4000, 1000)
x_test (1000, 1000)
y_train (4000, 3956)
y_test (1000, 3956)


In [24]:
import pickle
with open("vectorizer1.pkl","wb") as f:
   pickle.dump(vectorizer,f)

### Model Build

In [25]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.metrics import f1_score,precision_score,recall_score
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

In [26]:
from sklearn.svm import LinearSVC
clf = OneVsRestClassifier(LinearSVC(C=1.5, penalty = 'l1', dual=False))
clf.fit(x_train_multilabel, y_train)
predictions = clf.predict(x_test_multilabel)

In [27]:
print("accuracy :",metrics.accuracy_score(y_test,predictions))
print("macro f1 score :",metrics.f1_score(y_test, predictions, average = 'macro'))
print("micro f1 scoore :",metrics.f1_score(y_test, predictions, average = 'micro'))
print("hamming loss :",metrics.hamming_loss(y_test,predictions))

accuracy : 0.034
macro f1 score : 0.008854759944227368
micro f1 scoore : 0.261661606403533
hamming loss : 0.0006761880687563195


In [28]:
import pickle
with open("tag_predictor_model.pkl","wb") as f:
   pickle.dump(clf,f)

In [31]:
#save all tags
import csv
all_Tags['Tags'] = pd.DataFrame(y_list)
all_Tags.to_csv("All_tags.csv")

### Model testing

In [34]:
vectorizer_saved = open("vectorizer.pkl","r")
f = open("tag_predictor_model.pkl","rb")
model_saved = pickle.load(f)

In [35]:
user_input = ["Facing problem with javascript"]
user_text_x = vectorizer.transform(user_input)
pred_result = model_saved.predict(user_text_x)
pred_result = pred_result.toarray() 
pred_result= pred_result.tolist()
pred_result = pred_result[0]
print(pred_result[0])

0


In [36]:
def vect_to_lable(v):
  l = []
  for i in range(len(v)):
    if v[i] == 1:
      l.append(y_list[i])
  return l

In [37]:
print(vect_to_lable(pred_result))

['javascript']
