# Stack Overflow Tag Prediction 2: Data analysis

Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers. The goal of this project is to predict as many tags as possible with high precision and recall. Incorrect tags could impact user experience on StackOverflow. 

In this notebook machine learning algorythms are applied to the pre-processed data (notebook 1)

## Import libraries and load dataset

In [65]:
import pandas as pd
import numpy as np
import nltk, re, pprint
from nltk import word_tokenize, pos_tag
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer


In [66]:
df_base = pd.read_csv('/home/marco/Documents/OC_Machine_learning/section_5/tags_stackoverflow/data-output/stackoverflow_processed_sample.csv', encoding='utf-8')
df_base.head()

Unnamed: 0,Lemma,tags
0,"['piece', 'c++', 'code', 'show', 'peculiar', '...","['java', 'c++', 'performance', 'optimization',..."
1,"['accidentally', 'commit', 'wrong', 'file', 'g...","['git', 'version-control', 'git-commit', 'undo..."
2,"['want', 'delete', 'branch', 'locally', 'remot...","['git', 'version-control', 'git-branch', 'git-..."
3,"['difference', 'git', 'pull', 'git', 'fetch']","['git', 'version-control', 'git-pull', 'git-fe..."
4,"['mess', 'json', 'time', 'push', 'text', 'hurt...","['json', 'http-headers', 'content-type']"


## 1. Extracting features from text files

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.
One common approach is called a Bag of Words. The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. 

In [67]:
# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 500) 
#The input to fit_transform should be a list of strings like column "Lemma" and "tags" in our dataframe.

train_data = df_base.Lemma

In [68]:
# apply the vectorizer
train_data_features = vectorizer.fit_transform(train_data)

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. 

In [69]:
# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_featnum = train_data_features.toarray()
print ('data array size is : ', train_data_featnum.shape) # let's  see what the training data array now looks like

data array size is :  (37175, 500)


### 1.1 Text frequency
What are the 10 most frequent words?

In [70]:
# define vocabulary words
vocab = vectorizer.get_feature_names()


# Sum up the counts of each vocabulary word
dist = np.sum(train_data_featnum, axis=0)

# For each, append to a list the vocabulary word and the number of times it 
# appears in the training set
counts = []
words = []
for word, count in zip(vocab, dist):
    counts.append(count)
    words.append(word)
    

In [71]:
df_tf = pd.DataFrame({'words': words, 'tf':counts})



### 1.2 Inverse document frequency
This means, how common or rare a word is in the entire document set.
It varies between 0 and 1. The closer it is to 0, the more common a word is.

I compute the IDF values on the word counts we computed earlier in the train_data_features vector

In [72]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(train_data_features)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [80]:
# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index= vectorizer.get_feature_names(),columns=["idf_weights"]).reset_index()
df_idf.rename(columns={'index':'words'},inplace=True) 
# sort ascending
df_words = pd.merge(df_idf, df_tf, on='words', how='inner')

In [81]:
df_words['TF_IDF score'] = df_words["idf_weights"]*df_words['tf']
df_words = df_words.sort_values(by=['TF_IDF score'], ascending=False).reset_index(drop = True)
df_words[:10]

Unnamed: 0,words,idf_weights,tf,TF_IDF score
0,file,2.759678,14932,41207.511153
1,android,4.161408,8764,36470.580759
2,class,3.15102,10665,33605.628985
3,string,3.097245,10844,33586.523454
4,like,2.269157,14454,32798.396893
5,error,3.032775,10531,31938.151572
6,use,2.445125,12761,31202.241142
7,function,3.124334,9400,29368.735538
8,code,2.632457,11127,29291.354135
9,new,2.891648,9887,28589.71957


TF-IDF scores to vector

In [75]:
tfidfVectorizer = TfidfVectorizer(norm=None,analyzer='word', max_features = 500, use_idf=True)
tfidf_vectorizer_vectors = tfidfVectorizer.fit_transform(train_data)

In [76]:
dense = tfidf_vectorizer_vectors.todense()

In [77]:
denselist = dense.tolist()

In [83]:
df_tf_idf = pd.DataFrame(denselist, index=tfidfVectorizer.get_feature_names()).reset_index()
print(df_tf_idf.shape)
df_tf_idf.head()

(500, 37176)


Unnamed: 0,index,0,1,2,3,4,5,6,7,8,...,37165,37166,37167,37168,37169,37170,37171,37172,37173,37174
0,able,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,7.923554,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,accept,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,access,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.273493
3,achieve,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,action,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
