# Stack Overflow Tag Prediction 2: Topic Modeling

Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers. The goal of this project is to predict as many tags as possible with high precision and recall. Incorrect tags could impact user experience on StackOverflow. 

To automatically assign tags we use in a first step an unsupervised approach: the text analysis technique called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

## Import libraries and load dataset

In [22]:
import pandas as pd
import numpy as np
import nltk, re, pprint
from nltk import word_tokenize, pos_tag
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer


In [23]:
df_base = pd.read_csv('/home/marco/Documents/OC_Machine_learning/section_5/tags_stackoverflow/data-output/stackoverflow_processed_sample.csv', encoding='utf-8')
df_base.head()

Unnamed: 0,Lemma,tags
0,"['piece', 'c++', 'code', 'show', 'peculiar', '...","['java', 'c++', 'performance', 'optimization',..."
1,"['accidentally', 'commit', 'wrong', 'file', 'g...","['git', 'version-control', 'git-commit', 'undo..."
2,"['want', 'delete', 'branch', 'locally', 'remot...","['git', 'version-control', 'git-branch', 'git-..."
3,"['difference', 'git', 'pull', 'git', 'fetch']","['git', 'version-control', 'git-pull', 'git-fe..."
4,"['mess', 'json', 'time', 'push', 'text', 'hurt...","['json', 'http-headers', 'content-type']"


## 1. Document-term matrix


To use a topic modeling technique, we calculate have to calculate (1) a document-term matrix and (2) choose the number of topics for the algorithm to pick up.
Documemt-term matrix is calculated using either using the "Bag of Words" or "TF-IDF" approach.
The number of topics is chosen beforehands

In [38]:
num_topics = 500

### 1.1 Bag of Words approach

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors, i.e build a document-term matrix.
One common approach is called a Bag of Words. The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. 

In [39]:
# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool
c_vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = num_topics) 
#The input to fit_transform should be a list of strings like column "Lemma" and "tags" in our dataframe.

train_data = df_base.Lemma

In [40]:
# apply the vectorizer
bag_of_words = c_vectorizer.fit_transform(train_data)

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. 

In [41]:
# Numpy arrays are easy to work with, so convert the result to an array
bag_of_words_array = bag_of_words.toarray()

data array size is :  (37175, 500)


document-term matrix calculated with "Bag of Words" method

In [42]:
# define vocabulary words
vocab = c_vectorizer.get_feature_names()
# build a dataframe out of the bag of words
df_tf = pd.DataFrame(bag_of_words_array, columns=vocab)
print(df_tf.shape)
df_tf.head() # visualize the matrix

(37175, 500)


Unnamed: 0,able,accept,access,achieve,action,activity,actually,add,address,alert,...,wonder,word,work,world,wrap_content,write,wrong,xcode,xml,yes
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 1.2 TF-IDF approach

Text frequency - inverse documents frequency is a more advanced approach to build a document-term matrix than bag of words where only word counts (text frequency) is used.
Inverse document frequency means, how common or rare a word is in the entire document set.
It varies between 0 and 1. The closer it is to 0, the more common a word is.
Before building the matrix here we check which are the most popular terms, i.e. the words with highest TF-IDF score. Those should correspond to the most popular tags.

#### 1.2.1 Word count and TF-IDF score


Word counts (term frequency)

In [44]:
# Sum up the counts of each vocabulary word
dist = np.sum(bag_of_words_array, axis=0)

# For each, append to a list the vocabulary word and the number of times it 
# appears in the training set
counts = []
words = []
for word, count in zip(vocab, dist):
    counts.append(count)
    words.append(word)
    
df_wordcount = pd.DataFrame({'words': words, 'count':counts})

IDF values calculated from the bag-of-words matrix

In [46]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(bag_of_words)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [50]:
# get idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index= c_vectorizer.get_feature_names(),columns=["idf_weight"]).reset_index()
df_idf.rename(columns={'index':'words'},inplace=True) 

# join with word counts data
df_words = pd.merge(df_idf, df_wordcount, on='words', how='inner')

Finally we get a words dataframe ordered by TF-IDF scores

In [51]:
#calculate the TF-IDF score
df_words['TF_IDF score'] = df_words["idf_weight"]*df_words['count']
df_words = df_words.sort_values(by=['TF_IDF score'], ascending=False).reset_index(drop = True)# sort ascending
df_words[:10]

Unnamed: 0,words,idf_weight,count,TF_IDF score
0,file,2.759678,14932,41207.511153
1,android,4.161408,8764,36470.580759
2,class,3.15102,10665,33605.628985
3,string,3.097245,10844,33586.523454
4,like,2.269157,14454,32798.396893
5,error,3.032775,10531,31938.151572
6,use,2.445125,12761,31202.241142
7,function,3.124334,9400,29368.735538
8,code,2.632457,11127,29291.354135
9,new,2.891648,9887,28589.71957


#### 1.2.2 Build the matrix: TF-IDF scores to vectors

In [52]:
tfidfVectorizer = TfidfVectorizer(norm=None,analyzer='word', max_features = num_topics, use_idf=True)
tfidf_vectorizer_vectors = tfidfVectorizer.fit_transform(train_data)

In [56]:
dense = tfidf_vectorizer_vectors.todense()

In [57]:
denselist = dense.tolist()

In [58]:
df_tf_idf = pd.DataFrame(denselist, columns = tfidfVectorizer.get_feature_names())
df_tf_idf.rename(columns={'index':'words'},inplace=True)
print(df_tf_idf.shape)
df_tf_idf.head()

(37175, 500)


Unnamed: 0,able,accept,access,achieve,action,activity,actually,add,address,alert,...,wonder,word,work,world,wrap_content,write,wrong,xcode,xml,yes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.2878,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
