## Problem Definition:

This will apply TF and IDF on posts collected from stackoverflow in order to determine what is the topic of the post.

## Problem usage

Using this model , when you see a new question, you can tell which category it is talking about. So, usage: text classification.

Tutorial followed:
http://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.XBuSavmnGUn.
 

In [26]:
import pandas as pd
from time import time

n_features=10000 #number of words/features to consider from all documents

print("Loading dataset...")
t0 = time()
# read json into a dataframe
df= pd.read_json("data/stackoverflow-data-idf.json",lines=True)
print("Dataset Loaded in %0.3fs." % (time() - t0))

Loading dataset...
Dataset Loaded in 1.680s.


In [3]:
df.count()

id                          20000
title                       20000
body                        20000
answer_count                20000
comment_count               20000
creation_date               20000
last_activity_date          20000
last_editor_display_name    20000
owner_display_name          20000
owner_user_id               19762
post_type_id                20000
score                       20000
tags                        20000
view_count                  20000
accepted_answer_id          10711
favorite_count               4471
last_edit_date              10708
last_editor_user_id         10595
community_owned_date           15
dtype: int64

## Dataset understanding

In this example, we are using a Stackoverflow dataset which is slightly noisier and simulates what you could be dealing with in real life. You can find this dataset in the tutorial http://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.XBuSavmnGUn.

Notice that there are two files, the larger file with (20,000 posts)[https://github.com/kavgan/data-science-tutorials/tree/master/tf-idf/data] is used to compute the Inverse Document Frequency (IDF) and the smaller file with 500 posts would be used as a test set for us to extract keywords from. This dataset is based on the publicly available Stackoverflow dump on Google's Big Query.

Let's take a peek at our dataset. The code below reads a one per line json string from data/stackoverflow-data-idf.json into a pandas data frame and prints out its schema and total number of posts. Here, lines=True simply means we are treating each line in the text file as a separate json string. With this, the json in line 1 is not related to the json in line 2.

In [5]:
## Understanding the data

pd.set_option('display.max_colwidth', -1)
df.head(2)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,id,title,body,answer_count,comment_count,creation_date,last_activity_date,last_editor_display_name,owner_display_name,owner_user_id,post_type_id,score,tags,view_count,accepted_answer_id,favorite_count,last_edit_date,last_editor_user_id,community_owned_date
0,4821394,Serializing a private struct - Can it be done?,"<p>I have a public class that contains a private struct. The struct contains properties (mostly string) that I want to serialize. When I attempt to serialize the struct and stream it to disk, using XmlSerializer, I get an error saying only public types can be serialized. I don't need, and don't want, this struct to be public. Is there a way I can serialize it and keep it private?</p>",1,0,2011-01-27 20:19:13.563 UTC,2011-01-27 20:21:37.59 UTC,,,163534.0,1,0,c#|serialization|xml-serialization,296,,,,,
1,3367882,How do I prevent floated-right content from overlapping main content?,"<p>I have the following HTML:</p>\n\n<pre><code>&lt;td class='a'&gt;\n &lt;img src='/images/some_icon.png' alt='Some Icon' /&gt;\n &lt;span&gt;Some content that's waaaaaaaaay too long to fit in the allotted space, but which can get cut off.&lt;/span&gt;\n&lt;/td&gt;\n</code></pre>\n\n<p>It should display as follows:</p>\n\n<pre><code>[Some content that's wa [ICON]]\n</code></pre>\n\n<p>I have the following CSS:</p>\n\n<pre><code>td.a span {\n overflow: hidden;\n white-space: nowrap;\n z-index: 1;\n}\n\ntd.a img {\n display: block;\n float: right;\n z-index: 2;\n}\n</code></pre>\n\n<p>When I resize the browser to cut off the text, it cuts off at the edge of the <code>&lt;td&gt;</code> rather than before the <code>&lt;img&gt;</code>, which leaves the <code>&lt;img&gt;</code> overlapping the <code>&lt;span&gt;</code> content. I've tried various <code>padding</code> and <code>margin</code>s, but nothing seemed to work. Is this possible?</p>\n\n<p>NB: It's <em>very</em> difficult to add a <code>&lt;td&gt;</code> that just contains the <code>&lt;img&gt;</code> here. If it were easy, I'd just do that :)</p>",2,2,2010-07-30 00:01:50.9 UTC,2012-05-10 14:16:05.143 UTC,,,1190.0,1,2,css|overflow|css-float|crop,4121,3367943.0,0.0,2012-05-10 14:16:05.143 UTC,44390.0,


In [6]:
print("Column datatypes:\n\n", df.dtypes)

Column datatypes:

 id                          int64  
title                       object 
body                        object 
answer_count                int64  
comment_count               int64  
creation_date               object 
last_activity_date          object 
last_editor_display_name    object 
owner_display_name          object 
owner_user_id               float64
post_type_id                int64  
score                       int64  
tags                        object 
view_count                  int64  
accepted_answer_id          float64
favorite_count              float64
last_edit_date              object 
last_editor_user_id         float64
community_owned_date        object 
dtype: object


In [7]:
print("Total number of posts", df.shape)

Total number of posts (20000, 19)


## Data Preparation and cleaning
- Remove digits and special characters etc in the data
- Remove stop words from the data

In [8]:
import re

# More in preprocess can be done like
# eliminate all code sections, normalize the words to its root, etc, 
# but for simplicity we perform only some mild pre-processing.

def pre_process(text):    
    # lowercase
    text=text.lower()    
    # remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)    
    # remove special characters (like parenthesis, dots etc) and digits
    text=re.sub("(\\d|\\W)+"," ",text)    
    return text
 
df['text'] = df['title'] + df['body']
#show the second 'text' just for fun
df['text'][1]
 

"How do I prevent floated-right content from overlapping main content?<p>I have the following HTML:</p>\n\n<pre><code>&lt;td class='a'&gt;\n  &lt;img src='/images/some_icon.png' alt='Some Icon' /&gt;\n  &lt;span&gt;Some content that's waaaaaaaaay too long to fit in the allotted space, but which can get cut off.&lt;/span&gt;\n&lt;/td&gt;\n</code></pre>\n\n<p>It should display as follows:</p>\n\n<pre><code>[Some content that's wa [ICON]]\n</code></pre>\n\n<p>I have the following CSS:</p>\n\n<pre><code>td.a span {\n  overflow: hidden;\n  white-space: nowrap;\n  z-index: 1;\n}\n\ntd.a img {\n  display: block;\n  float: right;\n  z-index: 2;\n}\n</code></pre>\n\n<p>When I resize the browser to cut off the text, it cuts off at the edge of the <code>&lt;td&gt;</code> rather than before the <code>&lt;img&gt;</code>, which leaves the <code>&lt;img&gt;</code> overlapping the <code>&lt;span&gt;</code> content. I've tried various <code>padding</code> and <code>margin</code>s, but nothing seemed to

In [9]:
df['text'] = df['text'].apply(lambda x:pre_process(x))
 
#show the second 'text' just for fun
df['text'][1]

'how do i prevent floated right content from overlapping main content p i have the following html p pre code lt gt lt gt lt gt some content that s waaaaaaaaay too long to fit in the allotted space but which can get cut off lt gt lt gt code pre p it should display as follows p pre code some content that s wa icon code pre p i have the following css p pre code td a span overflow hidden white space nowrap z index td a img display block float right z index code pre p when i resize the browser to cut off the text it cuts off at the edge of the code lt gt code rather than before the code lt gt code which leaves the code lt gt code overlapping the code lt gt code content i ve tried various code padding code and code margin code s but nothing seemed to work is this possible p p nb it s em very em difficult to add a code lt gt code that just contains the code lt gt code here if it were easy i d just do that p '

In [25]:
# Read common stopwords from a list

def get_stop_words(stop_file_path):
    """load stop words & return as immutable frozen set """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        print("Stopwords:",stopwords)
        stop_set = set(m.strip() for m in stopwords)
        print("Stopset:", stop_set)
        return frozenset(stop_set)
    
#load a set of stop words
stopwords=get_stop_words("data/stopwords.txt")

Stopwords: ['x\n', 'y\n', 'your\n', 'yours\n', 'yourself\n', 'yourselves\n', 'you\n', 'yond\n', 'yonder\n', 'yon\n', 'ye\n', 'yet\n', 'z\n', 'zillion\n', 'j\n', 'u\n', 'umpteen\n', 'usually\n', 'us\n', 'username\n', 'uponed\n', 'upons\n', 'uponing\n', 'upon\n', 'ups\n', 'upping\n', 'upped\n', 'up\n', 'unto\n', 'until\n', 'unless\n', 'unlike\n', 'unliker\n', 'unlikest\n', 'under\n', 'underneath\n', 'use\n', 'used\n', 'usedest\n', 'r\n', 'rath\n', 'rather\n', 'rathest\n', 'rathe\n', 're\n', 'relate\n', 'related\n', 'relatively\n', 'regarding\n', 'really\n', 'res\n', 'respecting\n', 'respectively\n', 'q\n', 'quite\n', 'que\n', 'qua\n', 'n\n', 'neither\n', 'neaths\n', 'neath\n', 'nethe\n', 'nethermost\n', 'necessary\n', 'necessariest\n', 'necessarier\n', 'never\n', 'nevertheless\n', 'nigh\n', 'nighest\n', 'nigher\n', 'nine\n', 'noone\n', 'nobody\n', 'nobodies\n', 'nowhere\n', 'nowheres\n', 'no\n', 'noes\n', 'nor\n', 'nos\n', 'no-one\n', 'none\n', 'not\n', 'notwithstanding\n', 'nothings\n',

In [11]:
# Treat each question as a document and form a corpus of documents. 
# Then count occurence of specific words in each document of the corpus.
# Learn the vocabulary dictionary and return term-document matrix.

corpus = df['text'].tolist()
#corpus = ["Trying to understand", "I understand well","Whateber","What","so so"]

## Calculate Term frequency
cv.fit_transform() creates the vocabulary and returns a term-document matrix. With this, each column in the matrix represents a word in the vocabulary, while each row represents the document in our dataset where the values in this case are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

In [12]:
# create a vocabulary of words in the question, 
# ignore words that appear in more than 85% of documents, 
# or in just one document
# eliminate stop words
# Using scikit TF-IDF algorithm

from sklearn.feature_extraction.text import CountVectorizer
cv= CountVectorizer(max_df=0.85,stop_words=stopwords)
word_count_vector=cv.fit_transform(corpus)
# Stop word removal and also words having 85% occurence in the document.



  'stop_words.' % sorted(inconsistent))


In [13]:
print(len(cv.get_feature_names()))

121131


In [14]:
word_count_vector.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [15]:
## From here, we can see that the vocabulary is quite alot , so limit the vocab to 10,000
# min_df=2 could be set but that will remove occurence of words once, 
# instead smooth_idf=True is used so each word occurs atleast once

cv=CountVectorizer(max_df=0.85,stop_words=stopwords, max_features=n_features)
word_count_vector= cv.fit_transform(corpus)
print(word_count_vector.shape)

  'stop_words.' % sorted(inconsistent))


(20000, 10000)


In [16]:
## Printing the first 10 words in the vocabulary
list(cv.vocabulary_.keys())[:10]

['serializing',
 'private',
 'struct',
 'public',
 'class',
 'contains',
 'properties',
 'string',
 'serialize',
 'attempt']

In [17]:
# you only need to do this once, this is a list of features extracted
feature_names=cv.get_feature_names()

## Calculate TF-IDF Frequency

TODO: One improvement could be to divide the frequency with the size of the document in order to not penalize small documents.
Also: Instead of Countvectorizer & then TfidfTransformer, we could have directly used TfidfVectorizer as well. But this gives better understanding, hence did like this.

Now calculate TF-IDF frequency
 
TF: Term Frequency, which measures how frequently a term occurs in a document. This was calculated earlier. 

IDF: Inverse Document Frequency, which measures how important a term is in the whole corpus.While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

"An extremely important point" to note here is that the IDF should always be based on a large corpora and should be representative of texts you would be using to extract keywords.

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer
 
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)


TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

## Reading the test dataset

In [19]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
    
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]
 
    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
 
    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results


In [20]:
# read test docs into a dataframe, concatenate title and body and then apply same preprocessing.

df_test=pd.read_json("data/ml1/stackoverflow-test.json",lines=True)
df_test['text'] = df_test['title'] + df_test['body']
df_test['text'] =df_test['text'].apply(lambda x:pre_process(x))
 
# get test docs into a list
corpus_test=df_test['text'].tolist()


## Apply the model to do the topic/keyword extraction

In [23]:
# get the document that we want to extract keywords from:
doc=corpus_test[0]
 
#generate tf-idf for the new test document using the vocabulary and document frequencies (df) learned by fit
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

In [24]:
 
#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())
 
#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,10)
 
# now print the results
print("\n=====Doc=====")
print(doc)
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


=====Doc=====
integrate war plugin for m eclipse into eclipse project p i set up a small web project with jsf and maven now i want to deploy on a tomcat server is there a possibility to automate that like a button in eclipse that automatically deploys the project to tomcat p p i read about a the a href http maven apache org plugins maven war plugin rel nofollow noreferrer maven war plugin a but i couldn t find a tutorial how to integrate that into my process eclipse m eclipse p p can you link me to help or try to explain it thanks p 

===Keywords===
eclipse 0.49
maven 0.451
war 0.393
plugin 0.265
integrate 0.232
tomcat 0.223
project 0.197
automate 0.13
jsf 0.125
possibility 0.121


## Results
We can see that the question was about eclipse and maven so th model is doing a great job.