# Stack Exchange Question Classifier

This is a resolution for the HackerRank problem: Stack Exchange Question Classifier. The challenge consists in classifying Stack Exchange texts in one of the ten possible topics (Electronics, Mathematics, Photography, etc.). This notebook is divided in the following sections:

1. Acquiring Data
2. Data Preparation
3. Algorithms
4. Results
5. Conclusions 

## 1. Acquiring Data

Training data is available in https://s3.amazonaws.com/hr-testcases/845/assets/training.json where each line is a json object. Test data is available in a zip file https://hr-testcases.s3.amazonaws.com/845/assets/testcase.zip. We descompress it in /time-series-prediction-testcases folder.

First, we load some essential libraries to load data.

In [1]:
from urllib.request import urlopen
import json
import pandas as pd

The json file is read line by line, because each line is a json object, then the first line is discarted because it's just an integer indicating the number of lines.

In [2]:
data = []
url = 'https://s3.amazonaws.com/hr-testcases/845/assets/training.json'
data_raw = urlopen(url)
for line in data_raw:
    data.append(json.loads(line))
data.pop(0)

20219

We turn this json data into a pandas dataframe.

In [3]:
df_train = pd.DataFrame.from_dict(data)
df_train

Unnamed: 0,topic,question,excerpt
0,electronics,What is the effective differencial effective o...,"I'm trying to work out, in general terms, the ..."
1,electronics,Heat sensor with fan cooling,Can I know which component senses heat or acts...
2,electronics,Outlet Installation--more wires than my new ou...,I am replacing a wall outlet with a Cooper Wir...
3,electronics,Buck Converter Operation Question,"i have been reading about the buck converter, ..."
4,electronics,"Urgent help in area of ASIC design, verificati...",I need help with deciding on a Master's Projec...
...,...,...,...
20214,wordpress,How to set a Custom Post Type as the parent of...,I have a Custom Post Type called Recipe with p...
20215,wordpress,Tracking last login and last visit,I'm using the code below to track when a user ...
20216,wordpress,How to exclude the particular category from th...,"add_action( 'pre_get_posts', 'custom_pre_get_p..."
20217,wordpress,display sub categories assoccited with each po...,i have wordpress blog with many posts. each po...


Same thing is done with test data from folder \testcases

In [4]:
data_test = []
with open('testcases\input00.txt','r', encoding="utf8") as f:
    lines = f.readlines() 
    for line in lines:
        data_test.append(json.loads(line))
    data_test.pop(0)

In [5]:
test_output = []
with open('testcases\output00.txt','r', encoding="utf8") as f:
    lines = f.readlines()
    for i in range(len(lines)):
        test_output.append(lines[i].rstrip())

In [6]:
df_test = pd.DataFrame.from_dict(data_test)
df_test['topic'] = test_output
df_test

Unnamed: 0,question,excerpt,topic
0,Frequency Inverter LS IS5,I have been working with a IS5 frequency inver...,electronics
1,Why did the designer use this motor?,I was taking apart this thing that I bought fr...,electronics
2,Help with amplifier with feedback,I am starting to learn to use operational ampl...,electronics
3,Single Supply Op Amp to Amplify 0-3.3V to 0-10V,This may be a very basic question but as the u...,electronics
4,How to start with 3d tracking? [on hold],I am new to all of this and I really feel over...,electronics
...,...,...,...
15027,Add exception to WP Mobile Detector,In the function websitez_detect_mobile_device ...,wordpress
15028,Query Problem in getting top viewed posts,"I wanted to display top viewed posts by month,...",wordpress
15029,Translate Woosidebars plugin strings,I'm trying to translate Woosidebars strings i...,wordpress
15030,Fatal error: Allowed memory size of 37748736 b...,I hope this is not a duplicate question.\n\nI ...,wordpress


As we can see, there is text data in columns 'question' and 'excerpt' which correspond to the title and description of a stack exchange thread respectively. As we want to use all available data, 'excerpt' and 'question' columns are concatenated to form an unique text in 'text' column.

In [7]:
df_train['text'] = df_train[['question','excerpt']].agg(' '.join, axis=1)
df_train

Unnamed: 0,topic,question,excerpt,text
0,electronics,What is the effective differencial effective o...,"I'm trying to work out, in general terms, the ...",What is the effective differencial effective o...
1,electronics,Heat sensor with fan cooling,Can I know which component senses heat or acts...,Heat sensor with fan cooling Can I know which ...
2,electronics,Outlet Installation--more wires than my new ou...,I am replacing a wall outlet with a Cooper Wir...,Outlet Installation--more wires than my new ou...
3,electronics,Buck Converter Operation Question,"i have been reading about the buck converter, ...",Buck Converter Operation Question i have been ...
4,electronics,"Urgent help in area of ASIC design, verificati...",I need help with deciding on a Master's Projec...,"Urgent help in area of ASIC design, verificati..."
...,...,...,...,...
20214,wordpress,How to set a Custom Post Type as the parent of...,I have a Custom Post Type called Recipe with p...,How to set a Custom Post Type as the parent of...
20215,wordpress,Tracking last login and last visit,I'm using the code below to track when a user ...,Tracking last login and last visit I'm using t...
20216,wordpress,How to exclude the particular category from th...,"add_action( 'pre_get_posts', 'custom_pre_get_p...",How to exclude the particular category from th...
20217,wordpress,display sub categories assoccited with each po...,i have wordpress blog with many posts. each po...,display sub categories assoccited with each po...


In [8]:
df_test['text'] = df_test[['question','excerpt']].agg(' '.join, axis=1)
df_test

Unnamed: 0,question,excerpt,topic,text
0,Frequency Inverter LS IS5,I have been working with a IS5 frequency inver...,electronics,Frequency Inverter LS IS5 I have been working ...
1,Why did the designer use this motor?,I was taking apart this thing that I bought fr...,electronics,Why did the designer use this motor? I was tak...
2,Help with amplifier with feedback,I am starting to learn to use operational ampl...,electronics,Help with amplifier with feedback I am startin...
3,Single Supply Op Amp to Amplify 0-3.3V to 0-10V,This may be a very basic question but as the u...,electronics,Single Supply Op Amp to Amplify 0-3.3V to 0-10...
4,How to start with 3d tracking? [on hold],I am new to all of this and I really feel over...,electronics,How to start with 3d tracking? [on hold] I am ...
...,...,...,...,...
15027,Add exception to WP Mobile Detector,In the function websitez_detect_mobile_device ...,wordpress,Add exception to WP Mobile Detector In the fun...
15028,Query Problem in getting top viewed posts,"I wanted to display top viewed posts by month,...",wordpress,Query Problem in getting top viewed posts I wa...
15029,Translate Woosidebars plugin strings,I'm trying to translate Woosidebars strings i...,wordpress,Translate Woosidebars plugin strings I'm tryi...
15030,Fatal error: Allowed memory size of 37748736 b...,I hope this is not a duplicate question.\n\nI ...,wordpress,Fatal error: Allowed memory size of 37748736 b...


## 2. Data Preparation

To prepare our data for the machine learning algorithm, we apply some cleaning and transformations. First, soome important packages are imported. 

In [9]:
import numpy as np
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\henri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\henri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


We place our raw text data in X_raw variable and the label in y_train.

In [10]:
X_raw, y_train = df_train['text'], df_train['topic']

Some transformations are applied to every document in the dataset:
- Clean all special characters with a regular expression (\W)
- Transform all upper case characters to lower case
- Turn every document into a list of words with document.split()
- Apply lemmatization to every word in every document, which consistis in converting a word to its base form considering its context.
- Join all transformed words in each document

In [11]:
X_train = []
stemmer = WordNetLemmatizer()

for i in range(0, len(X_raw)):
    document = re.sub(r'\W', ' ', X_raw[i])
    document = document.lower()
    document = document.split()
    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)
    X_train.append(document)

## 3. Algorithms

Naive-Bayes is a widely used algorithm for text classification, so we'll use it in this problem. First it is necessary to transform text into a term-document matrix. In this matrix each row correspond to a document and each column is every word in the dataset and the values correspond to the frequency of that word in the document. After that, we calculate the Tf-Idf, which is a statistic that measures how important a word is to both the document and the entire dataset, then we apply Naive-Bayes. All this process is summarized in sklearn Pipeline object.

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

pipeline = Pipeline([('vect', CountVectorizer(stop_words=stopwords.words('english'))),
                  ('tfidf', TfidfTransformer()),
                  ('classifier', MultinomialNB())])

A model is fitted in this pipeline using training data.

In [13]:
model = pipeline.fit(X_train, y_train)

## 4. Results

It is necessary to use test data to check how we perform on new data. So the same transformations are applied in test data:

In [14]:
X_test_raw, y_test = df_test['text'], df_test['topic']

In [15]:
X_test = []

for i in range(0, len(X_test_raw)):
    document = re.sub(r'\W', ' ', X_test_raw[i])
    document = document.lower()
    document = document.split()
    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)
    X_test.append(document)

Some predictions are made with our model. 

In [16]:
predicted = model.predict(X_test)

Our predictions performed a 88.52% accuracy! Which is a great result.

In [17]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test,predicted)

0.885244811069718

## 5. Conclusions

In this project we used a Naive-Bayes algorithm to predict topics from text data. We cleaned, applied lemmatization and prepared data to train the algorithm. The model had a good performance.