#### > Gozde Orhan

# Project Scope: Keyword Extraction

According to Clayton Stanley and Michael D. Byrne [1], tags and keywords are providing a view of user’s interests and goals. Thus, extracting tags can lead to improved and more user centred human-computer systems. Stack Overflow allows their users to assign keywords to the questions in order to make them easier to find during a search. So, it is both in the interest of the original poster and the people who are interested in the answer that a question gets assigned to appropriate tags. Therefore, an automated tagging system which assigns respective tags and keywords is believed to be useful.

[1] Stanley, Clayton, and Michael Byrne. "Predicting Tags For Stackoverflow Posts". 2013,
Accessed 17 April 2019.

In [1]:
import numpy as np
import pandas as pd

#tqdm package is used for for-loops in order to track the process of it.
from tqdm import tqdm_notebook as tqdm

#os package is used to get the data sizes in bytes.
import os

# 1. Dataset

A dataset regarding to the competition named “Facebook Recruiting III - Keyword Extraction” is acquired through Kaggle Competitions page [2]. The fact that this dataset was used in a recruiting competition was a great motivation for me to choose this particular data. The dataset contains a large number of posts and associated tags where the facilitators of the competition collected a large sample of Stack Exchange posts to provide. These posts cover both technical and non-technical topics where it can be seen that text data consist natural language as well as programming language. The file is organized in a .CSV file format, contains text data and have a size of 7.26 GB.

[2] Facebook Recruiting III - Keyword Extraction

Data can be accessed at: https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data

In [2]:
#Actual dataset without any pre-processing. This dataset can be accessed at above link.

big_df = pd.read_csv('BigDataTrain.csv')

print('Number of rows in the dataset: '+ str(len(big_df)))
print('Size of the file: ' + str(os.path.getsize('/Users/gozdeorhan/Desktop/BigData/BigDataTrain.csv')) + ' bytes.')

big_df.head(10)

Number of rows in the dataset: 6034195
Size of the file: 7253917400 bytes.


Unnamed: 0,Id,Title,Body,Tags
0,1,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,2,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,3,R Error Invalid type (list) for variable,<p>I am import matlab file and construct a dat...,r matlab machine-learning
3,4,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,5,How to modify whois contact details?,<pre><code>function modify(.......)\n{\n $mco...,php api file-get-contents
5,6,setting proxy in active directory environment,<p>I am using a machine on which active direct...,proxy active-directory jmeter
6,7,How to draw barplot in this way with Coreplot,<p>My image is cannot post so the link is my ...,core-plot
7,8,How to fetch an XML feed using asp.net,<p>I've decided to convert a Windows Phone 7 a...,c# asp.net windows-phone-7
8,9,.NET library for generating javascript?,<p>Do you know of a .NET library for generatin...,.net javascript code-generation
9,10,"SQL Server : procedure call, inline concatenat...",<p>I'm using SQL Server 2008 R2 and was wonder...,sql variables parameters procedure calls


# 2. Initial Pre-processing

Even though the data is already tidy, there are still some pre-processing needed to align the data with the project's scope and fit the models to a congruent dataset. In terms of initial preprocessing; duplicate questions and programming languages embedded in 'Body' column are removed. NA values are handled and text data is processed to get more accurate results through tokenization and any analysis to be conducted. 

## 2.1 Removing duplicates

Since the dataset is already huge, duplicate questions are needed to be removed in order to get a cleaner data. At this stage, 'Title' column is used to detect any duplicates. Also, indexes are updated after removing any duplicates to not cause any mistakes in further analyses. Thus, even tough the indexes follow an order, it should be noted that 'Id' column is not following an order anymore. It can be seen that 1.908.962 questions are removed since they were duplicates of other questions, which correspond to 31.6% of the data.

In [3]:
#Drop duplicate, if any
df_d = big_df.drop_duplicates(subset='Title',keep='first', inplace=False)

#Reset index after removing duplicates
df_d = df_d.reset_index(drop=True)

df_d.to_csv(r'train_no_dup')

print('Number of rows after removing duplicate questions: ' + str(len(df_d)))
print('Size of the file: ' + str(os.path.getsize('/Users/gozdeorhan/Desktop/BigData/train_no_dup.csv')) + ' bytes.')

df_d.head(10)

Number of rows after removing duplicate questions: 4125233
Size of the file: 5049883177 bytes.


Unnamed: 0,Id,Title,Body,Tags
0,1,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,2,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,3,R Error Invalid type (list) for variable,<p>I am import matlab file and construct a dat...,r matlab machine-learning
3,4,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,5,How to modify whois contact details?,<pre><code>function modify(.......)\n{\n $mco...,php api file-get-contents
5,6,setting proxy in active directory environment,<p>I am using a machine on which active direct...,proxy active-directory jmeter
6,7,How to draw barplot in this way with Coreplot,<p>My image is cannot post so the link is my ...,core-plot
7,8,How to fetch an XML feed using asp.net,<p>I've decided to convert a Windows Phone 7 a...,c# asp.net windows-phone-7
8,9,.NET library for generating javascript?,<p>Do you know of a .NET library for generatin...,.net javascript code-generation
9,10,"SQL Server : procedure call, inline concatenat...",<p>I'm using SQL Server 2008 R2 and was wonder...,sql variables parameters procedure calls


<font color=red>Please note that starting from this point, a sample dataset is used since local computer and cluster couldn't be utilized to handle 5.04 GB of data. </font>

#### *Sample Dataset*

A dataset containing 1.000.000 questions (thus, rows) are extracted from the above dataset. The size of this dataset is **1.2 GB.**

In [4]:
df_1m=df_d[0:1000000]

df_1m.to_csv(r'train_1m.csv')

df_1m_csv = pd.read_csv('train_1m.csv',index_col=0)

print('Number of rows of sample dataset: ' + str(len(df_1m_csv)))
print('Size of the file: ' + str(os.path.getsize('/Users/gozdeorhan/Desktop/BigData/train_1m.csv')) + ' bytes.')

df_1m_csv.head(10)

Number of rows of sample dataset: 1000000
Size of the file: 1202992044 bytes.


Unnamed: 0,Id,Title,Body,Tags
0,1,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,2,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,3,R Error Invalid type (list) for variable,<p>I am import matlab file and construct a dat...,r matlab machine-learning
3,4,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,5,How to modify whois contact details?,<pre><code>function modify(.......)\n{\n $mco...,php api file-get-contents
5,6,setting proxy in active directory environment,<p>I am using a machine on which active direct...,proxy active-directory jmeter
6,7,How to draw barplot in this way with Coreplot,<p>My image is cannot post so the link is my ...,core-plot
7,8,How to fetch an XML feed using asp.net,<p>I've decided to convert a Windows Phone 7 a...,c# asp.net windows-phone-7
8,9,.NET library for generating javascript?,<p>Do you know of a .NET library for generatin...,.net javascript code-generation
9,10,"SQL Server : procedure call, inline concatenat...",<p>I'm using SQL Server 2008 R2 and was wonder...,sql variables parameters procedure calls


## 2.2 Remove programming language

At this stage of pre-processing, any programming language embedded in 'Body' column is aimed to be extracted. In order to do so, the dataset is partitioned into 20 smaller datasets each with 50.000 rows. The code below is given as an example, showing how the extraction process is done. The below example is for first 50.000 rows. Total 8 subset is processed on local computer whereas 12 of them processed on cluster.

<font color=red>Please note that the below markdown cells were not actually markdown cells and the regarding code executed 20 times for 20 different subsets, this part is included in the documentation to demonstrate an example! </font> 

#Create a copy to manipulate
#Extract first 50.000 rows

df_r=(df1.copy())[0:50000]

#Print the length of the sample dataset and the smaller subset we have just created

print('Number of questions: ' + str(len(df1_1m_csv)))
print('Number of questions in the sample: ' + str(len(df_r)))

#Preview the subset data
df_r.head(10)

##Based on https://stackoverflow.com/questions/18807333/python-remove-text-that-is-inside-certain-tag

from bs4 import BeautifulSoup

#Function is defined to extract code data

def extract_code(data):
    
    for i in tqdm(range(len(df_r))):
        datacode=data['Body'][i]
        soup = BeautifulSoup(datacode)
        codetags = soup.find_all('code')
        
        for codetag in codetags:
            codetag.extract()
            data['Body'][i]=soup
            
    return data

#Function is executed on the subset data
extract_code(df_r)

df_r.to_csv(r'reduced50k.csv')

In [5]:
df_r = pd.read_csv('reduced50k.csv',index_col=0)

print('Number of rows of the first subset of the sample dataset: ' + str(len(df_r)))
print('Size of the file: ' + str(os.path.getsize('/Users/gozdeorhan/Desktop/BigData/reduced50k.csv')) + ' bytes.')

df_r.head(10)

Number of rows of the first subset of the sample dataset: 50000
Size of the file: 35745580 bytes.


Unnamed: 0,Id,Title,Body,Tags
0,1,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,2,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,3,R Error Invalid type (list) for variable,<html><body><p>I am import matlab file and con...,r matlab machine-learning
3,4,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,5,How to modify whois contact details?,<html><body><pre></pre>\n<p>using this modify ...,php api file-get-contents
5,6,setting proxy in active directory environment,<p>I am using a machine on which active direct...,proxy active-directory jmeter
6,7,How to draw barplot in this way with Coreplot,<p>My image is cannot post so the link is my ...,core-plot
7,8,How to fetch an XML feed using asp.net,<html><body><p>I've decided to convert a Windo...,c# asp.net windows-phone-7
8,9,.NET library for generating javascript?,<p>Do you know of a .NET library for generatin...,.net javascript code-generation
9,10,"SQL Server : procedure call, inline concatenat...",<html><body><p>I'm using SQL Server 2008 R2 an...,sql variables parameters procedure calls


In [12]:
#Let's check if the function executed as wanted

print("4th question's body in the dataset: " + df_1m_csv['Body'][4])
print("---------------------------------------------------------------------------------------")
print("---------------------------------------------------------------------------------------")
print('')
print("4th question's body in the subset: " + df_r['Body'][4])

4th question's body in the dataset: <pre><code>function modify(.......)
{
  $mcontact = file_get_contents( "https://test.httpapi.com/api/contacts/modify.json?auth-userid=$uid&amp;auth-password=$pass&amp;contact-id=$cid&amp;name=$name &amp;company=$company&amp;email=$email&amp;address-line-1=$street&amp;city=$city&amp;country=$country&amp;zipcode=$pincode&amp;phone-cc=$countryCodeList[$phc]&amp;phone=$phone" );

  $mdetails = json_decode( $mcontact, true );

  return $mdetails;
}
</code></pre>

</p>

  [function.file-get-contents]: failed to open stream: HTTP request failed!
  HTTP/1.0 400 Bad request in /home/gfdgfd/public_html/new_one/customer/account/class.whois.php
  on line 49
</code></pre>

<p>Please help me, modify contact details..</p>

---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------

4th question's body in the subset: <html><body><pre></pre>
</p>
<pre 

<font color=red>Please note that starting from this point, a dataset with 1.000.000 rows which is the result of concatenation of 20 subset datasets will be used. </font> 

#### *Concatenated Dataset*
The size of this dataset is **0.72 GB.**

In [None]:
#Read the 20 subset datasets

df1 = pd.read_csv('reduced50k.csv',index_col=0)
df2 = pd.read_csv('reduced50-100k.csv',index_col=0)
df3 = pd.read_csv('reduced100-150k.csv',index_col=0)
df4 = pd.read_csv('reduced150-200k.csv',index_col=0)
df5 = pd.read_csv('reduced200-250k.csv',index_col=0)
df6 = pd.read_csv('reduced250-300k.csv',index_col=0)
df7 = pd.read_csv('reduced300-350k.csv',index_col=0)
df8 = pd.read_csv('reduced350-400k.csv',index_col=0)
df9 = pd.read_csv('reduced400-450k.csv',index_col=0)
df10 = pd.read_csv('reduced450-500k.csv',index_col=0)
df11 = pd.read_csv('reduced500-550k.csv',index_col=0)
df12 = pd.read_csv('reduced550-600k.csv',index_col=0)
df13 = pd.read_csv('reduced600-650k.csv',index_col=0)
df14 = pd.read_csv('reduced650-700k.csv',index_col=0)
df15 = pd.read_csv('reduced700-750k.csv',index_col=0)
df16 = pd.read_csv('reduced750-800k.csv',index_col=0)
df17 = pd.read_csv('reduced800-850k.csv',index_col=0)
df18 = pd.read_csv('reduced850-900k.csv',index_col=0)
df19 = pd.read_csv('reduced900-950k.csv',index_col=0)
df20 = pd.read_csv('reduced950-1000k.csv',index_col=0)

In [13]:
#Concatenate subsets

frames = [df1, df2, df3, df4, df5, df6, df7, df8, df9, df10,
         df11, df12, df13, df14, df15, df16, df17, df18, df19, df20]
df = pd.concat(frames)

df.to_csv(r'cleaned_train1m.csv')

df = pd.read_csv('cleaned_train1m.csv',index_col=0)

print('Number of rows of sample dataset: ' + str(len(df)))
print('Size of the file: ' + str(os.path.getsize('/Users/gozdeorhan/Desktop/BigData/cleaned_train1m.csv')) + ' bytes.')


df.head(10)

Number of rows of sample dataset: 1000000
Size of the file: 715748515 bytes.


Unnamed: 0,Id,Title,Body,Tags
0,1,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,2,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,3,R Error Invalid type (list) for variable,<html><body><p>I am import matlab file and con...,r matlab machine-learning
3,4,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,5,How to modify whois contact details?,<html><body><pre></pre>\n<p>using this modify ...,php api file-get-contents
5,6,setting proxy in active directory environment,<p>I am using a machine on which active direct...,proxy active-directory jmeter
6,7,How to draw barplot in this way with Coreplot,<p>My image is cannot post so the link is my ...,core-plot
7,8,How to fetch an XML feed using asp.net,<html><body><p>I've decided to convert a Windo...,c# asp.net windows-phone-7
8,9,.NET library for generating javascript?,<p>Do you know of a .NET library for generatin...,.net javascript code-generation
9,10,"SQL Server : procedure call, inline concatenat...",<html><body><p>I'm using SQL Server 2008 R2 an...,sql variables parameters procedure calls


## 2.3 Missing values
After concatenation, we got closer to the final dataset. Hence, we have to handle missing values before going any further. 

In [14]:
#Control missing data
df.isna().sum()

Id       0
Title    0
Body     0
Tags     2
dtype: int64

*From the above code, it can be observed that there are some 'NaN' values in the dataset. However, it should be further investigated before taking any action!*

In [15]:
#In order to decide what to do with NaN value, the respective rows are extracted
df_na=df[pd.isnull(df).any(axis=1)]
df_na

Unnamed: 0,Id,Title,Body,Tags
847560,895319,Do we really need NULL?,<blockquote>\n <p><strong>Possible Duplicate:...,
967650,1030864,Page cannot be null. Please ensure that this o...,<p>I get this error when i remove dynamically ...,


*It can be observed that, 'NaN' values are in fact correspond to tags related to the questions related to 'NULL' values. So they shouldn't be removed.*

In [16]:
#Fillna with 'null' tag because NaN is not in fact NaN!
df['Tags'].fillna('null', inplace=True)

In [17]:
#Contol whether null exists as tag rather than these 2
len(df[df['Tags'].str.contains("null")])

1548

## 2.4 Text pre-processing
In order to gain more insight to data, an exploratory analysis will be done. During this analysis, tokenization will be utilized. Hence, all text data are changed to lowercase and some punctuation marks, html tags and abbreviations are replaced to have more efficient tokenization process.

In [18]:
#Create a copy to manipulate
df_c=(df.copy())

#Print the number of questions in the copy
print('Number of questions: ' + str(len(df_c)))

#Preview data
df_c.head()

Number of questions: 1000000


Unnamed: 0,Id,Title,Body,Tags
0,1,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,2,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,3,R Error Invalid type (list) for variable,<html><body><p>I am import matlab file and con...,r matlab machine-learning
3,4,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,5,How to modify whois contact details?,<html><body><pre></pre>\n<p>using this modify ...,php api file-get-contents


In [19]:
#Tags are already in lowercase, so Title and Body columns are changed to lowercase
df_c['Title']=df_c['Title'].str.lower()
df_c['Body']=df_c['Body'].str.lower()

In [20]:
#Replacing punctuation marks, html tags and abbreviations

df_c['Title']=df_c['Title'].str.replace("?","")
df_c['Body']=df_c['Body'].str.replace("?","")
df_c['Body']=df_c['Body'].str.replace('\d+', '')
df_c['Body']=df_c['Body'].str.replace('<p>', '')
df_c['Body']=df_c['Body'].str.replace('</p>', '')
df_c['Body']=df_c['Body'].str.replace('\n', '')
df_c['Body']=df_c['Body'].str.replace('<pre>', '')
df_c['Body']=df_c['Body'].str.replace('</pre>', '')
df_c['Body']=df_c['Body'].str.replace('<html>', '')
df_c['Body']=df_c['Body'].str.replace('</html>', '')
df_c['Body']=df_c['Body'].str.replace('<body>', '')
df_c['Body']=df_c['Body'].str.replace('</body>', '')
df_c['Body']=df_c['Body'].str.replace('<strong>', '')
df_c['Body']=df_c['Body'].str.replace('</strong>', '')
df_c['Body']=df_c['Body'].str.replace('<a>', '')
df_c['Body']=df_c['Body'].str.replace('<a', '')
df_c['Body']=df_c['Body'].str.replace('</a>', '')
df_c['Body']=df_c['Body'].str.replace('<li>', '')
df_c['Body']=df_c['Body'].str.replace('</li>', '')
df_c['Body']=df_c['Body'].str.replace('"', '')
df_c['Body']=df_c['Body'].str.replace('``', '')
#'wouldn' used instead of 'would' since only 'wouldn' considered as stopword in nltk stopword list
df_c['Body']=df_c['Body'].str.replace("would", ' wouldn')
df_c['Body']=df_c['Body'].str.replace("'d", ' wouldn')
df_c['Body']=df_c['Body'].str.replace("'s", ' is')
df_c['Body']=df_c['Body'].str.replace("'m", ' am')
df_c['Body']=df_c['Body'].str.replace("'ve", ' have')
df_c['Body']=df_c['Body'].str.replace("'ll", ' will')


df_c.head()

Unnamed: 0,Id,Title,Body,Tags
0,1,how to check if an uploaded file is an image w...,i wouldn like to check if an uploaded file is ...,php image-processing file-upload upload mime-t...
1,2,how can i prevent firefox from closing when i ...,"in my favorite editor (vim), i regularly use c...",firefox
2,3,r error invalid type (list) for variable,i am import matlab file and construct a data f...,r matlab machine-learning
3,4,how do i replace special characters in a url,"this is probably very simple, but i simply can...",c# url encoding
4,5,how to modify whois contact details,"using this modify function, displays warning m...",php api file-get-contents


# 3. Exploratory Analysis
At this stage of project, an exploratory analysis regarding to unique and common tokens is done. 
- Title column has 300.609 unique tokens.
- Body column has 3.202.712 unique tokens.
- Tags column has 35.314 unique tokens.

In [21]:
#Create a copy to perform any exploratory analysis on it
df_t = df_c.copy()

#Tokenize the columns

df_t['Title']=df_t.apply(lambda row: (row['Title']).split(" "), axis=1)
df_t['Body']=df_t.apply(lambda row: (row['Body']).split(" "), axis=1)
df_t['Tags']=df_t.apply(lambda row: (row['Tags']).split(" "), axis=1)

#Preview data
df_t.head()

Unnamed: 0,Id,Title,Body,Tags
0,1,"[how, to, check, if, an, uploaded, file, is, a...","[i, wouldn, like, to, check, if, an, uploaded,...","[php, image-processing, file-upload, upload, m..."
1,2,"[how, can, i, prevent, firefox, from, closing,...","[in, my, favorite, editor, (vim),, i, regularl...",[firefox]
2,3,"[r, error, invalid, type, (list), for, variable]","[i, am, import, matlab, file, and, construct, ...","[r, matlab, machine-learning]"
3,4,"[how, do, i, replace, special, characters, in,...","[this, is, probably, very, simple,, but, i, si...","[c#, url, encoding]"
4,5,"[how, to, modify, whois, contact, details]","[using, this, modify, function,, displays, war...","[php, api, file-get-contents]"


In [22]:
from nltk.corpus import stopwords
import string

#Create a list of words to be removed, stopwords and punctuations
stop = stopwords.words('english') + list(string.punctuation)

df_t['Title']=df_t['Title'].apply(lambda x: [item for item in x if item not in stop])
df_t['Body']=df_t['Body'].apply(lambda x: [item for item in x if item not in stop])

#Preview data
df_t.head()

Unnamed: 0,Id,Title,Body,Tags
0,1,"[check, uploaded, file, image, without, mime, ...","[like, check, uploaded, file, image, file, (e....","[php, image-processing, file-upload, upload, m..."
1,2,"[prevent, firefox, closing, press, ctrl-w]","[favorite, editor, (vim),, regularly, use, ctr...",[firefox]
2,3,"[r, error, invalid, type, (list), variable]","[import, matlab, file, construct, data, frame,...","[r, matlab, machine-learning]"
3,4,"[replace, special, characters, url]","[probably, simple,, simply, cannot, find, answ...","[c#, url, encoding]"
4,5,"[modify, whois, contact, details]","[using, modify, function,, displays, warning, ...","[php, api, file-get-contents]"


**In the following section, unique token counts and most common 20 tokens are reported.**

In [23]:
import nltk

flat_list_title = [item for sublist in df_t['Title'] for item in sublist]
fd_title = nltk.FreqDist(flat_list_title)
display('There are ' + str(len(fd_title)) + ' unique tokens')
display(pd.DataFrame(list(fd_title.items()), 
                     columns = ["Title Token","Frequency"]).sort_values('Frequency',ascending=False)[0:20])

'There are 300609 unique tokens'

Unnamed: 0,Title Token,Frequency
37,using,61866
2,file,37091
99,get,29470
198,jquery,27257
202,data,26947
44,server,26560
117,use,26012
71,php,25842
477,android,25479
13,error,24688


In [24]:
flat_list_body = [item for sublist in df_t['Body'] for item in sublist]
fd_body = nltk.FreqDist(flat_list_body)
display('There are ' + str(len(fd_body)) + ' unique tokens')
display(pd.DataFrame(list(fd_body.items()), 
             columns = ["Body Token","Frequency"]).sort_values('Frequency',ascending=False)[0:20])

'There are 3202712 unique tokens'

Unnamed: 0,Body Token,Frequency
153,,3371936
14,using,372165
0,like,365185
95,want,313325
144,get,286411
36,use,275927
81,code,257131
337,one,224599
27,way,220485
212,need,218704


In [25]:
flat_list_tags = [item for sublist in df_t['Tags'] for item in sublist]
fd_tags = nltk.FreqDist(flat_list_tags)
display('There are ' + str(len(fd_tags)) + ' unique tokens')
df_tags=pd.DataFrame(list(fd_tags.items()), 
             columns = ["Tag","Frequency"]).sort_values('Frequency',ascending=False)[0:20]
display(df_tags)

'There are 35314 unique tokens'

Unnamed: 0,Tag,Frequency
9,c#,77400
91,java,68693
0,php,65386
21,javascript,61279
112,android,53747
102,jquery,51098
68,c++,33662
94,python,30963
96,iphone,30236
18,asp.net,29449


In [26]:
#Since tags will be the targets in this project, in order to gain more insight to them, the following function is written
def token_avg(column):
    count=0
    for i in range(len(df_t)):
        count=count+len(df_t[column][i])
    return count/len(df_t)

In [27]:
#Lets see the average tag usage
token_avg('Tags')

2.886844

*It can be observed that in average, users used approximately 3 tags to tag their questions.*

# 4. Final pre-processing
Even tough it is observed that users used 3 tags in average, since there are 35.314 unique tokens, it is decided to build a model to assign only 1 tag. However, the important question was 'which tags?'. In order to decide how to set target values, following steps are taken.

In [28]:
#First tag of 'Tags' are extracted and most common 20 are displayed
df_n=df_c.drop(['Id'], axis = 1)
df_n =df_n.apply(lambda x: x.str.split().str[0])
df_n['Tags'].value_counts()[0:20]

c#               77280
java             67627
php              63667
javascript       54404
android          45187
c++              31913
python           29136
iphone           28924
jquery           27446
ruby-on-rails    17493
linux            16367
asp.net          15918
sql              14537
mysql            14412
c                12043
html             11782
.net             11736
objective-c      11701
windows          10455
ios               9713
Name: Tags, dtype: int64

In [29]:
#A tags list corresponding to the above object is created
tags = ['c#','java','php','javascript','android','c++','python','iphone','jquery','ruby-on-rails',
     'linux','asp.net','sql','mysql','c','html','.net','objective-c','windows','ios']

In [30]:
#df_tags corresponding to the overall most common 20 tags (see section 3) are compared to the above list to detect
#whether most common tags appear as the first tags in 'Tags' column
df_tags.Tag.isin(tags).astype(int)

9      1
91     1
0      1
21     1
112    1
102    1
68     1
94     1
96     1
18     1
289    1
65     1
20     1
115    1
372    1
23     1
75     0
99     1
272    1
212    1
Name: Tag, dtype: int64

*It can be observed that beside the 'css' tag most common tags appear as the first tag.*

In [31]:
#Hence, target column is set to contain only first tag appears
df_c['Tags']=df_n['Tags']
df_c.head()

Unnamed: 0,Id,Title,Body,Tags
0,1,how to check if an uploaded file is an image w...,i wouldn like to check if an uploaded file is ...,php
1,2,how can i prevent firefox from closing when i ...,"in my favorite editor (vim), i regularly use c...",firefox
2,3,r error invalid type (list) for variable,i am import matlab file and construct a data f...,r
3,4,how do i replace special characters in a url,"this is probably very simple, but i simply can...",c#
4,5,how to modify whois contact details,"using this modify function, displays warning m...",php
5,6,setting proxy in active directory environment,i am using a machine on which active directory...,proxy
6,7,how to draw barplot in this way with coreplot,my image is cannot post so the link is my pic...,core-plot
7,8,how to fetch an xml feed using asp.net,i have decided to convert a windows phone app...,c#
8,9,.net library for generating javascript,do you know of a .net library for generating j...,.net
9,10,"sql server : procedure call, inline concatenat...",i am using sql server r and was wondering if ...,sql


In [33]:
#Observe unique tags
print('Unigue tag count: ' + str(len(df_c.Tags.unique())))

Unigue tag count: 10016


*However, since there are still thousands of unique values, it is decided to only keep most common 20 tags appearing as first tag as well as adding 1 more tag (css) which is frequently used through overall dataset.*

In [34]:
tags_1 = ['c#','java','php','javascript','android','c++','python','iphone','jquery','ruby-on-rails',
     'linux','asp.net','sql','mysql','c','html','.net','objective-c','windows','ios','css']

In [37]:
df_m=df_c.copy()
df_m=df_m[df_m.Tags.isin(tags_1)]

df_m.to_csv(r'train1m_common.csv')

df_m=df_m.reset_index(drop=True)

print('Number of rows of sample dataset: ' + str(len(df_m)))
print('Size of the file: ' + str(os.path.getsize('/Users/gozdeorhan/Desktop/BigData/train1m_common.csv')) + ' bytes.')


df_m.head(10)

Number of rows of sample dataset: 579127
Size of the file: 347437109 bytes.


Unnamed: 0,Id,Title,Body,Tags
0,1,how to check if an uploaded file is an image w...,i wouldn like to check if an uploaded file is ...,php
1,4,how do i replace special characters in a url,"this is probably very simple, but i simply can...",c#
2,5,how to modify whois contact details,"using this modify function, displays warning m...",php
3,8,how to fetch an xml feed using asp.net,i have decided to convert a windows phone app...,c#
4,9,.net library for generating javascript,do you know of a .net library for generating j...,.net
5,10,"sql server : procedure call, inline concatenat...",i am using sql server r and was wondering if ...,sql
6,11,how do commercial obfuscators achieve to crash...,some commercial obfuscators href=http://www.r...,.net
7,16,php framework url conventions,a lot of frameworks use url conventions like ...,php
8,19,play framework auto javascript and css minifier,does anyone know a good play plugin that autom...,javascript
9,20,creating a repetitive node from a hash array w...,=) i need your kindly help to accomplish a sim...,php


# 5. Classification Models
At this stage **df_m** dataset is used and **Multinomial Naive Bayes** is preferred. This dataset has the size of **0.35 GB**. Two different models are built. First model takes 'Title' column as a predictor and 'Tags' column as the target whereas second one takes 'Body' column as a predictor and 'Tags' column as the target.

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

At this stage, in order to feed text documents in to the model, turning them into numerical feature vectors is required. Despite the fact that there are many methods to convert text data to vectors which the model can understand, by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency — Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

- Term Frequency: This summarizes how often a given word appears within a document.
- Inverse Document Frequency: This down scales words that appear a lot across documents.

In [51]:
#Lets see an example of a TF-IDF transformation
print(Train_X_Tfidf)

  (0, 42962)	0.29965836025490755
  (0, 40789)	0.17171079989742588
  (0, 37278)	0.3992129990996897
  (0, 24347)	0.28406204833255494
  (0, 21704)	0.48127006723717974
  (0, 20582)	0.4075745549767803
  (0, 18953)	0.35998105257851803
  (0, 18555)	0.28283708323193774
  (0, 4633)	0.18258404373105214
  (1, 41543)	0.1326846661743204
  (1, 41199)	0.3223808771466372
  (1, 33133)	0.34704296232510684
  (1, 30356)	0.37820913241142606
  (1, 24023)	0.2937926262613973
  (1, 12636)	0.364790519745673
  (1, 6037)	0.33957678809434
  (1, 4255)	0.5294202095870814
  (2, 49526)	0.33695259943996486
  (2, 45809)	0.27922178736628733
  (2, 40789)	0.17870278020030486
  (2, 37610)	0.27678244765028254
  (2, 28793)	0.29523845686162237
  (2, 27196)	0.3452262032800677
  (2, 24184)	0.18858259701421967
  (2, 19089)	0.241181123485891
  :	:
  (405385, 37199)	0.3520120863361464
  (405385, 22826)	0.22703488113463974
  (405385, 19156)	0.4275593139418577
  (405385, 19089)	0.2007091831330908
  (405385, 16390)	0.34960109525225064

## 5.1 'Title' Model

Dataset is split in to training set and test set. Training set will correspond to 70% of the data whereas test will correspond to 30%.

In [47]:
from sklearn import model_selection, naive_bayes

np.random.seed(500)

Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(df_m['Title'],df_m['Tags'],test_size=0.3)

In [48]:
#Based on https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34

trial =[500,1000,2000,3000,4000,5000,6000,7000,8000,9000,10000,20000,30000,40000,50000]

for i in tqdm(trial):
    
    Tfidf_vect = TfidfVectorizer(max_features=i)
    Tfidf_vect.fit(df_m['Title'])
    
    Train_X_Tfidf = Tfidf_vect.transform(Train_X)
    Test_X_Tfidf = Tfidf_vect.transform(Test_X)
    
    #Encoding labels into numerical values
    Encoder = LabelEncoder()
    Train_Y = Encoder.fit_transform(Train_Y)
    Test_Y = Encoder.fit_transform(Test_Y)
    
    # fit the training dataset on the NB classifier
    Naive = naive_bayes.MultinomialNB()
    Naive.fit(Train_X_Tfidf,Train_Y)
    
    # predict the labels on validation dataset
    predictions_NB = Naive.predict(Test_X_Tfidf)
    
    # Use accuracy_score function to get the accuracy
    print("Naive Bayes Accuracy Score with max feature of " + str(i) + ' -> ',accuracy_score(predictions_NB, Test_Y)*100)

HBox(children=(IntProgress(value=0, max=15), HTML(value='')))

Naive Bayes Accuracy Score with max feature of 500 ->  45.39913318253242
Naive Bayes Accuracy Score with max feature of 1000 ->  48.59415560121792
Naive Bayes Accuracy Score with max feature of 2000 ->  51.830619492457075
Naive Bayes Accuracy Score with max feature of 3000 ->  53.63850373261041
Naive Bayes Accuracy Score with max feature of 4000 ->  54.75454561152073
Naive Bayes Accuracy Score with max feature of 5000 ->  55.43199857257151
Naive Bayes Accuracy Score with max feature of 6000 ->  55.82396583380819
Naive Bayes Accuracy Score with max feature of 7000 ->  56.10830038160689
Naive Bayes Accuracy Score with max feature of 8000 ->  56.28039760790611
Naive Bayes Accuracy Score with max feature of 9000 ->  56.36155382499036
Naive Bayes Accuracy Score with max feature of 10000 ->  56.39205935339792
Naive Bayes Accuracy Score with max feature of 20000 ->  55.41876032439463
Naive Bayes Accuracy Score with max feature of 30000 ->  54.02644196179327
Naive Bayes Accuracy Score with max

## 5.2 'Body' Model

Dataset is split in to training set and test set. Training set will correspond to 70% of the data whereas test will correspond to 30%.

In [49]:
np.random.seed(500)

Train_X_b, Test_X_b, Train_Y_b, Test_Y_b = model_selection.train_test_split(df_m['Body'],df_m['Tags'],test_size=0.3)

In [50]:
trial_b =[1000,2000,3000,4000,5000,6000,7000,8000,9000,10000,
          20000,30000,40000,50000,60000,70000,80000]

for i in tqdm(trial_b):
    
    Tfidf_vect_b = TfidfVectorizer(max_features=i)
    Tfidf_vect_b.fit(df_m['Body'])
    
    Train_X_Tfidf_b = Tfidf_vect_b.transform(Train_X_b)
    Test_X_Tfidf_b = Tfidf_vect_b.transform(Test_X_b)
    
    Encoder_b = LabelEncoder()
    Train_Y_b = Encoder_b.fit_transform(Train_Y_b)
    Test_Y_b = Encoder_b.fit_transform(Test_Y_b)
    
    # fit the training dataset on the NB classifier
    Naive_b = naive_bayes.MultinomialNB()
    Naive_b.fit(Train_X_Tfidf_b,Train_Y_b)
    
    # predict the labels on validation dataset
    predictions_NB_b = Naive_b.predict(Test_X_Tfidf_b)
    
    # Use accuracy_score function to get the accuracy
    print("Naive Bayes Accuracy Score with max feature of " + str(i) + ' -> ',accuracy_score(predictions_NB_b, Test_Y_b)*100)

HBox(children=(IntProgress(value=0, max=17), HTML(value='')))

Naive Bayes Accuracy Score with max feature of 1000 ->  44.827586206896555
Naive Bayes Accuracy Score with max feature of 2000 ->  49.29750948261473
Naive Bayes Accuracy Score with max feature of 3000 ->  51.59405775329661
Naive Bayes Accuracy Score with max feature of 4000 ->  52.74751207270676
Naive Bayes Accuracy Score with max feature of 5000 ->  53.567132307656884
Naive Bayes Accuracy Score with max feature of 6000 ->  54.0074479535395
Naive Bayes Accuracy Score with max feature of 7000 ->  54.31768342168425
Naive Bayes Accuracy Score with max feature of 8000 ->  54.57611704913692
Naive Bayes Accuracy Score with max feature of 9000 ->  54.759725795589944
Naive Bayes Accuracy Score with max feature of 10000 ->  54.884050213250916
Naive Bayes Accuracy Score with max feature of 20000 ->  54.13119679519279
Naive Bayes Accuracy Score with max feature of 30000 ->  52.564478902261435
Naive Bayes Accuracy Score with max feature of 40000 ->  51.169282659621615
Naive Bayes Accuracy Score wi