# Stack overflow question tag predictor

#### Team Name- COT
#### Team Members- Nilesh Tanwar (203050060), Ankit Kumar (203050109), Shivam Dixit (193050012)

The task is to classify the questions that appeared at some time on stack overflow and assign them suitable tags so that a particular target audience corresponding to that tag can be catered.

### Steps involved -
1. Acquiring the data. Fortunately kaggle has a datset of 10% of stack overflow questions. Link- https://www.kaggle.com/stackoverflow/stacksample
2. Identifying some data patterns to show some relation and statistics in the dataset.
3. Obtain final cleaned data on which different machine learning techniques can be applied.
4. Apply different machine learning techniques and obtain suitable metrics to compare efficiency of different methods.

### Import packages

In [1]:
import pandas as pd
import re
import string
import collections
import numpy as np
from collections import Counter
from sklearn import metrics
from bs4 import BeautifulSoup

### Read data from .csv files using read_csv() function

In [2]:
df1=pd.read_csv('Questions.csv', encoding='latin-1')
df2=pd.read_csv('Tags.csv',encoding='latin-1')

In [3]:
df1.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
3,180,2089740.0,2008-08-01T18:42:19Z,,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
4,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...


In [4]:
df2.head()

Unnamed: 0,Id,Tag
0,80,flex
1,80,actionscript-3
2,80,air
3,90,svn
4,90,tortoisesvn


### Extraploratory Data Analysis (EDA)

From the extrapolatory data analysis conducted for different data patterns corresponding to each of the columns some interesting results were collected and the final dataset is thus modified to contain only the useful information.

Some observed data patterns-
1. OwnerUserId, ClosedDate and to a limit even Score column are not having any correlation with the assigned tag, these fields provides us some information about the question but that is insignificant in guessing the tag for that question. So, it is better to drop these columns.
2. Also CreationDate does not directly influence the assigned tag value, but if we can somehow observe the trends over years, like popularity of a particular tag has increased or decresed with time then an additional factor can be added to this prediction model. But to suitably get such a trend is a difficult task, moreover the trend will follow a similar nature in future is also not guatranteed.
3. So finally we are left with question Title and Body that will help us to predict a suitable tag.

### Cleaning data - Removing unneccasary columns
We can finally drop unneccasary columns and form our final data frame which shows us all relevant columns of critical importance to predict the tag for the question 

In [5]:
df1=df1.drop(columns=['ClosedDate','OwnerUserId','CreationDate','Score'])

In [6]:
df1.head()

Unnamed: 0,Id,Title,Body
0,80,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
3,180,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
4,260,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...


Lets have a look on how the question body is stored, and how using this info we can predict the right tag.

We have the cleaned data with us, now we can do a join of 2 dataframes viz df1 and df2 on common column Id, and futher we no longer need the Id column identfier so we can drop it also to arive at our merged dataframe.

In [7]:
df_join = pd.merge(left=df1, right=df2, left_on='Id', right_on='Id')

In [8]:
df_join.head()

Unnamed: 0,Id,Title,Body,Tag
0,80,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,flex
1,80,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,actionscript-3
2,80,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,air
3,90,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,svn
4,90,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,tortoisesvn


#### Note- The top 10 frequent tags covers a great part of the dataset, so to make the prediction and model analysis easy we can limit our data to top 10 tags only

In [9]:
top_tag =  collections.Counter(list(df_join['Tag'])).most_common(10)
print(top_tag)
tags=[]
for pair in top_tag:
    tags.append(pair[0])
df_join['Tag'] = df_join['Tag'].apply(lambda x: x if x in tags else None)
df_join.dropna(inplace=True)
df_join.drop_duplicates(subset=['Id'], inplace=True)

[('javascript', 124155), ('java', 115212), ('c#', 101186), ('php', 98808), ('android', 90659), ('jquery', 78542), ('python', 64601), ('html', 58976), ('c++', 47591), ('ios', 47009)]


In [10]:
df_join.head()

Unnamed: 0,Id,Title,Body,Tag
14,260,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,c#
18,330,Should I use nested classes in this case?,<p>I am working on a collection of classes use...,c++
28,650,Automatically update version number,<p>I would like the version property of my app...,c#
35,930,How do I connect to a database and loop over a...,<p>What's the simplest way to connect and quer...,c#
39,1010,"How to get the value of built, encoded ViewState?",<p>I need to grab the base64-encoded represent...,c#


### Data cleaning using BeautifulSoup

In [11]:
df_join['Body'][14]

'<p>I have a little game written in C#. It uses a database as back-end. It\'s \na <a href="http://en.wikipedia.org/wiki/Collectible_card_game">trading card game</a>, and I wanted to implement the function of the cards as a script.</p>\n\n<p>What I mean is that I essentially have an interface, <code>ICard</code>, which a card class implements (<code>public class Card056 : ICard</code>) and which contains function that are called by the game.</p>\n\n<p>Now, to make the thing maintainable/moddable, I would like to have the class for each card as source code in the database and essentially compile it on first use. So when I have to add/change a card, I\'ll just add it to the database and tell my application to refresh, without needing any assembly deployment (especially since we would be talking about 1 assembly per card which means hundreds of assemblies).</p>\n\n<p>Is that possible? Register a class from a source file and then instantiate it, etc.</p>\n\n<pre><code>ICard Cards[current] =

In [12]:
df_join['Title'][14]

'Adding scripting functionality to .NET applications'

### Observation
1. It looks like that the question body is actually the HTML text of the question body. We don't need these uneccasary texts to predict the tag value.
2. Also some repeated spaces and line gaps are also there, we can reduce all such noises to come up with a less noisy question body.
3. We have used "re" library of python to remove relatable texts.

In [14]:
def clean(text):
    text = re.sub('\n', ' ', text)
    text = re.sub('\r', ' ', text)
    text = re.sub('"', ' ', text)
    text = re.sub(',', ' ', text)
    text = re.sub('<pre><code>.*?</code></pre>', ' ', text)
    text = BeautifulSoup(text, "lxml").get_text().lower()
    text = re.sub('\n', ' ', text)
    text = re.sub('\r', ' ', text)
    text = re.sub('"', ' ', text)
    text = re.sub(',', ' ', text)
    return text
with open("clean_data.csv", "w") as f:
    f.write("Tag,Text\n")
    i = 0
    for index, row in df_join.iterrows():
        text = clean(row['Title'] + " " + row['Body'])
        if(len(text) < 1):
            continue
        f.write(row['Tag'] + "," + text + "\n")
        i = i + 1
        if(i%10000==0):
            print(i,end=" ")

UnicodeEncodeError: 'charmap' codec can't encode characters in position 80-81: character maps to <undefined>

#### We finally have the cleaned data

In [None]:
df_final=pd.read_csv('clean_data.csv')

In [None]:
df_final.head()

### Model selection and accuracy analysis

#### Test and train split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_final['Text'], df_final['Tag'], random_state=42, test_size=0.3, shuffle=True)

#### Vectorization of the text data to convert categorical values to numerical values

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
def vectorize(train_in, test_in):
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
    train_in = vectorizer.fit_transform(train_in)
    test_in = vectorizer.transform(test_in)
    return train_in, test_in

In [None]:
X_train_1,X_test_1=vectorize(X_train,X_test)

#### 1. Ridge Classifier

In [27]:
from sklearn.linear_model import RidgeClassifier
rc1=RidgeClassifier()
rc1.fit(X_train_1,y_train)
pred1=rc1.predict(X_test_1)
print(metrics.classification_report(y_test, pred1))

              precision    recall  f1-score   support

     android       0.80      0.88      0.84     21657
          c#       0.81      0.85      0.83     30281
         c++       0.86      0.78      0.82     13678
        html       0.74      0.62      0.67      7145
         ios       0.93      0.92      0.93     13201
        java       0.84      0.81      0.82     34305
  javascript       0.75      0.79      0.77     35644
      jquery       0.59      0.43      0.50      9598
         php       0.85      0.89      0.87     27563
      python       0.93      0.90      0.92     18827

    accuracy                           0.82    211899
   macro avg       0.81      0.79      0.80    211899
weighted avg       0.82      0.82      0.82    211899



#### 2. Naive Bayes Classifier

In [28]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
rc2=MultinomialNB()
rc2.fit(X_train_1,y_train)
pred2=rc2.predict(X_test_1)
print(metrics.classification_report(y_test, pred2))

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

     android       0.81      0.56      0.66     21657
          c#       0.81      0.70      0.75     30281
         c++       0.97      0.26      0.41     13678
        html       0.83      0.01      0.01      7145
         ios       0.99      0.41      0.58     13201
        java       0.51      0.85      0.64     34305
  javascript       0.46      0.93      0.61     35644
      jquery       0.00      0.00      0.00      9598
         php       0.81      0.71      0.76     27563
      python       0.97      0.44      0.60     18827

    accuracy                           0.62    211899
   macro avg       0.72      0.49      0.50    211899
weighted avg       0.70      0.62      0.60    211899



#### 3. Perceptron

In [29]:
from sklearn.linear_model import Perceptron
rc3=Perceptron()
rc3.fit(X_train_1,y_train)
pred3=rc3.predict(X_test_1)
print(metrics.classification_report(y_test, pred3))

              precision    recall  f1-score   support

     android       0.74      0.76      0.75     21657
          c#       0.82      0.73      0.78     30281
         c++       0.79      0.76      0.78     13678
        html       0.59      0.63      0.61      7145
         ios       0.86      0.90      0.88     13201
        java       0.73      0.80      0.77     34305
  javascript       0.71      0.72      0.72     35644
      jquery       0.47      0.39      0.43      9598
         php       0.83      0.83      0.83     27563
      python       0.87      0.88      0.87     18827

    accuracy                           0.76    211899
   macro avg       0.74      0.74      0.74    211899
weighted avg       0.76      0.76      0.76    211899



#### 4. Linear SVM

In [32]:
from sklearn.linear_model import SGDClassifier
rc4=SGDClassifier(loss='hinge')
rc4.fit(X_train_1,y_train)
pred4=rc4.predict(X_test_1)
print(metrics.classification_report(y_test, pred4))

              precision    recall  f1-score   support

     android       0.79      0.88      0.83     21657
          c#       0.77      0.83      0.80     30281
         c++       0.81      0.76      0.78     13678
        html       0.71      0.64      0.67      7145
         ios       0.88      0.91      0.90     13201
        java       0.86      0.77      0.81     34305
  javascript       0.75      0.79      0.77     35644
      jquery       0.60      0.35      0.44      9598
         php       0.82      0.89      0.86     27563
      python       0.92      0.89      0.90     18827

    accuracy                           0.80    211899
   macro avg       0.79      0.77      0.78    211899
weighted avg       0.80      0.80      0.80    211899



#### 5. Logistic Regression

In [33]:
rc5=SGDClassifier(loss='log')
rc5.fit(X_train_1,y_train)
pred5=rc5.predict(X_test_1)
print(metrics.classification_report(y_test, pred5))

              precision    recall  f1-score   support

     android       0.80      0.81      0.81     21657
          c#       0.66      0.82      0.73     30281
         c++       0.87      0.60      0.71     13678
        html       0.80      0.48      0.60      7145
         ios       0.93      0.75      0.83     13201
        java       0.73      0.78      0.75     34305
  javascript       0.67      0.82      0.73     35644
      jquery       0.62      0.28      0.39      9598
         php       0.83      0.84      0.84     27563
      python       0.95      0.79      0.86     18827

    accuracy                           0.76    211899
   macro avg       0.79      0.70      0.73    211899
weighted avg       0.77      0.76      0.75    211899



### Accuracy analysis
Let's try to compare the accuracy of different models and predict which model will be the best for our tag prediction task.

### Conclusion
By comparing above models we can see that Ridge classifier and Linear SVM models showed a comparative greater accuracy and be utilized to predict tag. Moreover if we wan to compare the models on time of training Naive Bayes Clssifier is most time efficient because of the assumption of all the columns being independent of each other. After a suitable analysis of dataset we can choose the right model to predict tags.

#### We can use Ridge regression classifier to predict tags for unclassified questions, other suitable options can be explored by exploring answer texts also, but would make the model more complex and high complexity.