# Lab 2 - Exploring Text Data

### Eric Smith and Jake Carlson

## Introduction
For this lab we will be examining questions and answers on the popular programming Q&A website, Stack Overflow. Stack Overflow allows programmers who are stuck on a problem ask the community about ways to resolve or circumvent said problem. This helps people write more accurate code faster. However, the community can be pretty hard on those who do not fully understand their question or are disrespectful. It could also be the case that the few developers who have experience with a problem similar to yours are not on the website when you post your question. If too much time passes, your post may be burried forever. This makes for some interesing questions. When should you make your post so that it has the highest chance of being answered? Are there specific keywords or phasing that encourage other programmers to answer your questions? Are there specific languages that have their questions answered faster than others? These are questions we will be exploring int his lab.

## Business Understanding

### Motivations
Stack Overflow is a go-to resource for developers. Questions are often distilled down to a single block of code that is easily digestible, and can be matched to the question you came to the site for. However, if you're truly stuck on something, it can seem like an eternity of waiting before someone responds to your question, and sometimes, it may not be answered at all. If you're working in industry and spending a lot of time waiting for a question to be answered, you could fall behind schedule and miss deadlines. These delays could cost your company a great deal in lost revenue and tech debt. Therefore, accurately predicting the amount of time to get a question answered can be seen as a valuable project management tool.

### Objectives
It would be useful if you could predict how quickly your question will be answered if you post it at various times through out the day. For a prediction tool like this to be useful, we want something that is 80-95% accurate in predicting when a question will be answered by. As a project manager, you want your developers to be asking questions that are concise and respectful. Therefore, a tool that could recommend keywords to add to your post to make it more attractive, and provide you with approximate gains in time for reformatting, would also be a valuable project management tool.

## Data Understanding

### Data Attributes
The following is a list of attributes in the data, their data types, and a brief description of the attribute.

#### General Information


In [1]:
import numpy as np
import pandas as pd

In [2]:
# read data
df = pd.read_csv('./data/Questions.csv', encoding='ISO-8859-1')

In [3]:
df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
3,180,2089740.0,2008-08-01T18:42:19Z,,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
4,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1264216 entries, 0 to 1264215
Data columns (total 7 columns):
Id              1264216 non-null int64
OwnerUserId     1249762 non-null float64
CreationDate    1264216 non-null object
ClosedDate      55959 non-null object
Score           1264216 non-null int64
Title           1264216 non-null object
Body            1264216 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 67.5+ MB


In [5]:
df.describe()

Unnamed: 0,Id,OwnerUserId,Score
count,1264216.0,1249762.0,1264216.0
mean,21327450.0,2155177.0,1.781537
std,11514450.0,1801265.0,13.66389
min,80.0,1.0,-73.0
25%,11425980.0,658911.0,0.0
50%,21725420.0,1611830.0,0.0
75%,31545420.0,3353792.0,1.0
max,40143380.0,7046594.0,5190.0


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
bag_words = count_vect.fit_transform(df.Title.tolist())

In [7]:
del df

In [8]:
print(bag_words[0])

  (0, 127217)	1
  (0, 93245)	1
  (0, 64893)	1
  (0, 106637)	1
  (0, 85978)	1
  (0, 46576)	1
  (0, 126200)	1


In [9]:
len(count_vect.vocabulary_)

151918

In [10]:
count_vect.vocabulary_

{'sqlstatement': 126200,
 'execute': 46576,
 'multiple': 85978,
 'queries': 106637,
 'in': 64893,
 'one': 93245,
 'statement': 127217,
 'good': 57617,
 'branching': 19702,
 'and': 10573,
 'merging': 81735,
 'tutorials': 137535,
 'for': 51360,
 'tortoisesvn': 135706,
 'asp': 13360,
 'net': 88194,
 'site': 122615,
 'maps': 79856,
 'function': 53063,
 'creating': 31898,
 'color': 27197,
 'wheels': 146828,
 'adding': 8062,
 'scripting': 116766,
 'functionality': 53086,
 'to': 135064,
 'applications': 12063,
 'should': 121360,
 'use': 141876,
 'nested': 88153,
 'classes': 25306,
 'this': 133878,
 'case': 22295,
 'homegrown': 61322,
 'consumption': 29527,
 'of': 92483,
 'web': 145909,
 'services': 118971,
 'deploying': 37155,
 'sql': 125798,
 'server': 118727,
 'databases': 34251,
 'from': 52534,
 'test': 132854,
 'live': 76787,
 'automatically': 15302,
 'update': 141214,
 'version': 143735,
 'number': 91311,
 'visual': 144594,
 'studio': 128709,
 'setup': 120419,
 'project': 103758,
 'per':

In [None]:
df_title = pd.DataFrame(data=bag_words.toarray(), columns=count_vect.get_feature_names())

In [None]:
# print 10 most common words
df_title.sum().sort_values()[-10:]