# Stack Overflow Tag Prediction 2: Data analysis

Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers. The goal of this project is to predict as many tags as possible with high precision and recall. Incorrect tags could impact user experience on StackOverflow. 

In this notebook machine learning algorythms are applied to the pre-processed data (notebook 1)

## Import libraries and load dataset

In [1]:
import pandas as pd
import numpy as np
import nltk, re, pprint
from nltk import word_tokenize, pos_tag
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from pyspark import SparkConf, SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

In [3]:
df_raw = pd.read_csv('/home/marco/Documents/OC_Machine_learning/section_5/tags_stackoverflow/data-output/stackoverflow_processed_sample.csv', encoding='utf-8')
df_raw.head()

Unnamed: 0,ViewCount,CreationDate,Body,Lemma,tags,Score,CommentCount,AnswerCount,FavoriteCount
0,1483128,2012-06-27 13:51:36,<p>Here is a piece of C++ code that shows some...,"['piece', 'c++', 'code', 'show', 'peculiar', '...","['java', 'c++', 'performance', 'optimization',...",24320,22,26,10983
1,8547399,2009-05-29 18:09:14,<p>I accidentally committed the wrong files to...,"['accidentally', 'commit', 'wrong', 'file', 'g...","['git', 'version-control', 'git-commit', 'undo...",20895,13,83,6776
2,8115583,2010-01-05 01:12:15,<p>I want to delete a branch both locally and ...,"['want', 'delete', 'branch', 'locally', 'remot...","['git', 'version-control', 'git-branch', 'git-...",16826,6,40,5357
3,2782271,2008-11-15 09:51:09,<p>What are the differences between <code>git ...,"['difference', 'git', 'pull', 'git', 'fetch']","['git', 'version-control', 'git-pull', 'git-fe...",11833,9,35,2333
4,2783219,2009-01-25 15:25:19,"<p>I've been messing around with <a href=""http...","['mess', 'json', 'time', 'push', 'text', 'hurt...","['json', 'http-headers', 'content-type']",10204,0,36,1446


In [4]:
df_base = df_raw[['Lemma','tags']] #select columns to process

## 1. Extracting features from text files

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.
One common approach is called a Bag of Words. The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. 

In [5]:
# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 500) 
#The input to fit_transform should be a list of strings like column "Lemma" and "tags" in our dataframe.

train_data = df_base.Lemma

In [6]:
# apply the vectorizer
train_data_features = vectorizer.fit_transform(train_data)

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. 

In [7]:
# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_featnum = train_data_features.toarray()
print ('data array size is : ', train_data_featnum.shape) # let's  see what the training data array now looks like

data array size is :  (37175, 500)


What are the 10 most frequent words?

In [8]:
# define vocabulary words
vocab = vectorizer.get_feature_names()


# Sum up the counts of each vocabulary word
dist = np.sum(train_data_featnum, axis=0)

# For each, append to a list the vocabulary word and the number of times it 
# appears in the training set
counts = []
words = []
for word, count in zip(vocab, dist):
    counts.append(count)
    words.append(word)
    

In [9]:
df_words = pd.DataFrame({'words': words, 'counts':counts})
df_words = df_words.sort_values(by=['counts'], ascending=False).reset_index()
df_words[:10]

Unnamed: 0,index,words,counts
0,166,file,14932
1,253,like,14454
2,468,use,12761
3,481,want,11472
4,66,code,11127
5,483,way,11012
6,425,string,10844
7,61,class,10665
8,492,work,10555
9,146,error,10531


In [10]:
tfidfVectorizer =TfidfVectorizer(norm=None,analyzer='word', max_features = 500)
tf=tfidfVectorizer.fit_transform(train_data)

In [11]:
dense = tf.todense()

In [12]:
denselist = dense.tolist()

In [13]:
tf_df = pd.DataFrame(denselist, columns = tfidfVectorizer.get_feature_names())
#tf_df = pd.DataFrame(tf.to_array(), columns = tfidfVectorizer.get_feature_names())
tf_df.head()

Unnamed: 0,able,accept,access,achieve,action,activity,actually,add,address,alert,...,wonder,word,work,world,wrap_content,write,wrong,xcode,xml,yes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.2878,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
