# Lab 2 - Exploring Text Data

### Eric Smith and Jake Carlson

## Introduction
For this lab we will be examining questions and answers on the popular programming Q&A website, Stack Overflow. Stack Overflow allows programmers who are stuck on a problem ask the community about ways to resolve or circumvent said problem. This helps people write more accurate code faster. However, the community can be pretty hard on those who do not fully understand their question or are disrespectful. It could also be the case that the few developers who have experience with a problem similar to yours are not on the website when you post your question. If too much time passes, your post may be burried forever. This makes for some interesing questions. When should you make your post so that it has the highest chance of being answered? Are there specific keywords or phasing that encourage other programmers to answer your questions? Are there specific languages that have their questions answered faster than others? These are questions we will be exploring int his lab.

## Business Understanding

### Motivations
Stack Overflow is a go-to resource for developers. Questions are often distilled down to a single block of code that is easily digestible, and can be matched to the question you came to the site for. However, if you're truly stuck on something, it can seem like an eternity of waiting before someone responds to your question, and sometimes, it may not be answered at all. If you're working in industry and spending a lot of time waiting for a question to be answered, you could fall behind schedule and miss deadlines. These delays could cost your company a great deal in lost revenue and tech debt. Therefore, accurately predicting the amount of time to get a question answered can be seen as a valuable project management tool.

### Objectives
It would be useful if you could predict how quickly your question will be answered if you post it at various times through out the day. For a prediction tool like this to be useful, we want something that is 80-95% accurate in predicting when a question will be answered by. As a project manager, you want your developers to be asking questions that are concise and respectful. Therefore, a tool that could recommend keywords to add to your post to make it more attractive, and provide you with approximate gains in time for reformatting, would also be a valuable project management tool.

## Data Understanding

### Data Attributes
The following is a list of attributes in the data, their data types, and a brief description of the attribute.

#### Questions
- **Id** (nominal): A unique identifier for each question
- **OwnerUserId** (nominal): A unique identifier for the person who posted the question
- **CreationDate** (ordinal): A timestamp of when the question was posted
- **ClosedDare** (ordinal): A timestamp of when the question was closed, if the question wasn't closed this field is NaN
- **Score** (ordinal): ratio?? The number of upvotes a post has
- **Title** (text): A title for the question
- **Body** (text): The question body


## Data Quality
Our data it too big, so we will take a rendom subsample of 40% of the original data set.

In [2]:
import numpy as np
import pandas as pd

In [3]:
# read data
df = pd.read_csv('./data/Questions.csv', encoding='ISO-8859-1')

In [4]:
df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
3,180,2089740.0,2008-08-01T18:42:19Z,,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
4,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1264216 entries, 0 to 1264215
Data columns (total 7 columns):
Id              1264216 non-null int64
OwnerUserId     1249762 non-null float64
CreationDate    1264216 non-null object
ClosedDate      55959 non-null object
Score           1264216 non-null int64
Title           1264216 non-null object
Body            1264216 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 67.5+ MB


Trial and error has shown us that 1,264,000 elements is too many to work with. Therefore, we will use a random sample to reduce our data set size to 12,000 elements.

In [6]:
df = df.sample(n=12000, replace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 377436 to 1028120
Data columns (total 7 columns):
Id              12000 non-null int64
OwnerUserId     11851 non-null float64
CreationDate    12000 non-null object
ClosedDate      563 non-null object
Score           12000 non-null int64
Title           12000 non-null object
Body            12000 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 750.0+ KB


The body of each question is formatted in HTML, so we will need to parse through each body and remove the HTML tags. We also need to replace HTML entities with their character representations. Finally, many questions have accompanying code blocks, so we will parse out the code blocks into a new attribute for each entry.

In [10]:
from bs4 import BeautifulSoup
import html
import re

# compile regexes
tag_re = re.compile('<[^<]+?>')
newline_re = re.compile('\n+')

def clean_html(body_text):
    soup = BeautifulSoup(body_text, 'html.parser')
    code_tags = soup.findAll('code')
    code_text = ""
    
    # unescape html entities
    body_text = html.unescape(body_text)
    
    # remove code blocks, saving blocks to new string
    for c in code_tags:
        body_text = body_text.replace("<code>{}</code>".format(c.string),
                                      '')
        if c.string:
            code_text += c.string

    # remove remaining tags and multiple newlines
    body_text = tag_re.sub('', body_text)
    body_text = newline_re.sub('\n', body_text)
    
    return body_text, code_text

as_list = df.Body.tolist()
body_list = []
code_list = []
for i in range(len(as_list)):
    body_text, code_text = clean_html(as_list[i])
    body_list.append(body_text)
    code_list.append(code_text)

# update body column
df = df.assign(Body=body_list)
# add code column
df = df.assign(Code=code_list)
df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,Code
377436,13420650,846550.0,2012-11-16T16:34:52Z,,2,Visual Studio Debugger Watch problems,How can I find out the address of a variable o...,streets streets [11790](0x1c66a690 [...] s...
1027704,33858900,5381833.0,2015-11-22T19:03:32Z,,0,Getting an empty string when scanned,So I'm trying to create a Graph from a file in...,"7\nD\n(0, 1)\n(0, 3)\n.\n.\n.\npublic static G..."
572225,19771030,1450675.0,2013-11-04T15:24:50Z,,0,TypeError: e is not defined - when trying to u...,"I'm trying to select all ""img"" tags within ""#g...","<div id=""gallery"">\n <p><img src=""/userfile..."
839303,28227710,3511219.0,2015-01-30T01:16:47Z,,0,Java Triangle Pattern Printing,Trying to write a program that will print a nu...,"Triangle s1 = new Triangle(7, '*');\ns1.displa..."
1036620,34097250,5641006.0,2015-12-04T20:53:06Z,,1,"Using Plotly in shinydashboard, buttons too la...",I'm using Plotly's tutorial for shiny in my sh...,library(ggplot2)\nlibrary(plyr)\nlibrary(rChar...


With the HTML tags removed and the code isolated, the question body is much cleaner. Now lets add another column for the time in minutes it took to get the question answered. This attribute will be NaN if ClosedDate is NaN.

In [19]:
# adapted from https://stackoverflow.com/questions/2788871/date-difference-in-minutes-in-python
from datetime import datetime
import time

fmt = '%Y-%m-%dT%H:%M:%SZ'
def get_minutes_diff(x):
#     if np.isnan(x.ClosedDate):
#         return np.nan
    d1 = datetime.strptime(x.CreationDate, fmt)
    d2 = datetime.strptime(x.ClosedDate, fmt)
    
    # convert to unix timestamp
    d1_ts = time.mktime(d1.timetuple())
    d2_ts = time.mktime(d2.timetuple())
    
    # convert to minutes
    return int(d2_ts - d1_ts) / 60

# df['Durration'] = df[~df.ClosedDate.isnull()].apply(func=get_minutes_diff, axis=1)
df[~df.ClosedDate.isnull()]

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,Code
731456,24849700,3857659.0,2014-07-20T10:46:06Z,2016-05-26T07:14:16Z,0,Apache Tomcat error http status 404,To be honest i am a learner and this is my fir...,http://localhost:8080\n- <web-app>\n- <servlet...
1231983,39352830,6801021.0,2016-09-06T15:34:42Z,2016-09-07T12:51:29Z,-4,unsupported operand type(s) for +=: 'int' and ...,"I am new in python, but the code below keep re...",unsupported operand type(s) for +=: 'int' and ...
1261314,40075210,7026835.0,2016-10-16T20:40:40Z,2016-10-16T23:07:54Z,-9,"Str.length ,if..else",I want my code to run a statement if a string ...,"<script>\n var str = ""Hello World!"";\n if(str..."
98539,4138720,24224.0,2010-11-09T21:17:00Z,2012-06-06T12:24:26Z,1,Concise way to Disallow Spidering of all Direc...,Is there anyway to write a robots.txt file tha...,
497257,17331440,1278201.0,2013-06-26T22:16:02Z,2013-06-27T10:47:45Z,0,Order by frequency of occurence within a field,Is there a way in MySQL to order based on the ...,"id, value\n1, 'Bob'\n2, 'Bob Bob Bob'\n3, 'Bob..."
924661,30832870,4441904.0,2015-06-14T18:15:35Z,2015-06-15T00:19:36Z,-1,Spring DI Web-App @Autowired Class is null,I'm trying to autowire a Repository (backend p...,NullPointerException <!--Load Spring Config...
993245,32910280,4331985.0,2015-10-02T15:07:56Z,2015-10-02T15:08:43Z,0,How Java compiler knows which function to use ...,I am making a Linked List of Integers \nI want...,LinkedList<Integer> l1 = new LinkedList<Intege...
1092129,35618030,5583099.0,2016-02-25T04:00:16Z,2016-02-25T13:49:14Z,1,A float that 0.0 equals one transform position...,I have a Healthbar with a mask over it. \nI ne...,
817381,27538900,4148529.0,2014-12-18T03:43:16Z,2014-12-18T07:08:25Z,1,How to manage large data in java web services,I have a problem my java web service. How can ...,
1219916,39027360,6702038.0,2016-08-18T20:55:53Z,2016-08-19T01:52:49Z,0,sending mail from lcoalhost mamp,I'm seriously getting crazy with sending mails...,function contactMail () {\n\n global $conne...


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(stop_words='english')
bag_words = count_vect.fit_transform(df.Title.tolist())

In [7]:
len(count_vect.vocabulary_)

10158

In [8]:
df_title = pd.DataFrame(data=bag_words.toarray(), 
                        columns=count_vect.get_feature_names())

In [9]:
# print 10 most common words
df_title.sum().sort_values()[-10:]

function    318
value       323
php         356
java        377
jquery      379
error       475
data        476
android     479
file        529
using       890
dtype: int64