# Scraping Raw Data From Stack Overflow

Systematically extracting business intelligence from data.

In [1]:
# get inline, interactivate plots
%matplotlib inline

# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2 

# What is our question?

Essentially: What are the biggest problem areas for Python programmers?

# Identify the data (Stack Overflow)

Data Source: All the questions on Stack Overflow that have the "Python" tag on them.

Question: Where exactly does all of this data live -- what is the URL structure we can use to 
acquire all of this user data?

Task: Work the URL into a formattable string template you can feed into a scraper.

In [2]:
SO_URL = "https://stackoverflow.com/questions/tagged/python?page={0}&sort=frequent&pagesize=50"

## Acquire the raw data

In [3]:
number_of_pages_to_gather = 50

# create sequence of page
page_range = range(1, number_of_pages_to_gather + 1)

So you may have noticed when we were formatting the URL that there are actually 1000s of pages of Python questions,
but here we're only collecting 50. This is intentional and temporary. Eventually we should collect the entire corpus of data, but right now we are trying to prototype a workflow.  So we are going to temporarily **downsample** to more rapidly prototype.

As a matter of fact, 50 is pretty high.  Let's kick it down to 5 files.  That way, we are still coding with the for conditions where we need to take multiple files (as opposed to just one), but not introducing lots of computing time.

In [4]:
page_range = range(1, 6)  
print(page_range)

[1, 2, 3, 4, 5]


In [5]:
# TODO: MAKE A PAGE URLS GENERATOR (argument: array of numbers, yields URLs )

In [6]:
import requests 

def http_get (URL):
    response = requests.get(URL)
    return response

# 
#  Once you have the data, it can be helpful to comment the following loop out.
# 
for i in page_range:
    so_response = http_get(SO_URL.format(i))
    
    if so_response.status_code == 200:
        html_file = open('FILENAME_00{0}.html'.format(i),'w')
        html_file.write(so_response.text.encode('ascii', 'ignore'))
        html_file.close()
    else:
        print("Failed at loop: ", i)

## Mash Until No Good! Data Munging/Wrangling/Transforming

> Bad programmers worry about the code. 
>
> Good programmers worry about data structures and their relationships.
>
> Linus Torvalds, creator of Linux and git

> I be in the kitchen whipping
>
> trying to cook the sauce.
>
>   Yo Gotti, _The Art of the Hustle_

We are not going to begin whipping this data into shape for various levels of analysis - it's hard to do 
analysis on a bunch of data locked up in an HTML structure though.  

#### Extracting the maximum number of dimensions from the data

Look at the stack overflow page and think about what our granular data points are.  For the pages
we have decided to 

< INSERT PICTURE OF SO PAGE HERE >

The granular logical data point is a question.  So what are the dimensions/attributes of a question object?
- question text
- vote score
- views 
- details
- author
- question details 

Beautiful soup parses the HTML into a Python tree structure (DOM).  You can then use a variety of BS4 methods to extract specific HTML elements based on HTML attritbute.  Since classes and IDs are HTML attributes, you can use CSS selectors to extract information.

In [7]:
# insert quick BS4 demo before doing the real code in the next block

OK, let's actually get the question text, vote score, views, etc. out of the data.

In [8]:
from bs4 import BeautifulSoup

In [9]:
for i in page_range:
    file_name = 'FILENAME_00{0}.html'.format(i)
    print(file_name)
    
    with open(file_name,'r') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        
        # for each file, get all of the question objects
        questions = soup.find_all("div", class_="question-summary")
        
        for question in questions:
            text = question.find('a', class_="question-hyperlink").text
            tags = [tag.text for tag in question.find_all('a', class_="post-tag")]
            views = int(question.find('div', class_="views")['title'].split(" ")[0].replace(",", ""))
#             print(tags)
        
    
        
        

FILENAME_001.html
FILENAME_002.html
FILENAME_003.html
FILENAME_004.html
FILENAME_005.html


OK, so we have figured out how to get at the data with Beautiful Soup above.  Let's wrap all of that logic into a **function** that accepts an HTML file as an argument and returns a sequence of question objects -- each object will contain all of the attributes.  Each of these will become a row in a Pandas DataFrame.

In [63]:
def get_question_info_from_summary(summary_div):
    qid = summary_div['id'].split("-")[2]
    text = summary_div.find('a', class_="question-hyperlink").text
    tags = [tag.text for tag in summary_div.find_all('a', class_="post-tag")]
    views = int(summary_div.find('div', class_="views")['title'].split(" ")[0].replace(",", ""))

    # data isn't always there
    date = summary_div.find('span', class_='relativetime')
    date_asked = date['title'] if date else None
    
    return [qid, views, text, tags, date_asked]


def extract_question_objects(relative_html_path):
    """
        :relative_html_path: file to read and parse for stack overflow questions
    """
    questions_objects = []
    with open(relative_html_path, 'r') as f:
    
        soup = BeautifulSoup(f.read(), 'html.parser')
        question_divs = soup.find_all("div", class_="question-summary")

        for question in question_divs:
                q_info = get_question_info_from_summary(question)
                questions_objects.append(q_info)
    
    
    return questions_objects

In [64]:
dataset = []

for i in page_range:
    filename = "FILENAME_00{}.html".format(i)
    
    qs = extract_question_objects(filename)
    
    dataset.extend(qs)


print(len(dataset))

250


In [65]:
print(dataset[0])

[u'1132941', 103641, u'\u201cLeast Astonishment\u201d and the Mutable Default Argument', [u'python', u'language-design', u'least-astonishment'], u'2009-07-15 18:00:37Z']


OK, this raw data looks pretty clean.  It's time to explore.

## Exploratory Data Analysis

Enter pandas, the lingua franca of data analysis.  It works very well with tabular data.

In [72]:
import pandas as pd

import numpy as np

import json

# dataset2 = {}

# # munge dict of lists into pd dataframe-acceptable format
# for q in dataset:
#     print(q)

df = pd.DataFrame(columns=['views', 'text', 'tags', 'date_asked'], )

for data in dataset:
    qid, views, text, tags, date_asked = data
    df.loc[qid] = [views, text, tuple(np.array(tags)), date_asked]



In [73]:
df.columns

Index([u'views', u'text', u'tags', u'date_asked'], dtype='object')

In [74]:
df.index

Index([u'1132941', u'15112125', u'509211', u'23294658', u'240178', u'2612802',
       u'1373164', u'312443', u'986006', u'1207406',
       ...
       u'4856717', u'2464959', u'6618002', u'33759623', u'237079', u'3768895',
       u'12065885', u'19339', u'6318156', u'5595425'],
      dtype='object', length=250)

In [75]:
df['tags']

1132941         (python, language-design, least-astonishment)
15112125    (python, if-statement, comparison, match, bool...
509211                                  (python, list, slice)
23294658    (python, validation, loops, python-3.x, user-i...
240178                  (python, list, nested-lists, mutable)
2612802                           (python, list, copy, clone)
1373164                          (python, variable-variables)
312443                          (python, list, split, chunks)
986006      (python, reference, parameter-passing, pass-by...
1207406                                   (python, iteration)
952914        (python, list, multidimensional-array, flatten)
20109391                                     (python, pandas)
20449427                (python, python-2.7, python-3.x, int)
231767        (python, iterator, generator, yield, coroutine)
291978                     (python, scope, dynamic-languages)
36901       (python, syntax, parameter-passing, identifier...
89228   

### Exploratory Data Analysis: Profiling

In [76]:
import pandas_profiling

In [77]:
pandas_profiling.ProfileReport(df)

0,1
Number of variables,5
Number of observations,250
Total Missing (%),0.8%
Total size in memory,9.8 KiB
Average record size in memory,40.3 B

0,1
Numeric,1
Categorical,2
Date,0
Text (Unique),2
Rejected,0

0,1
Distinct count,241
Unique (%),100.4%
Missing (%),4.0%
Missing (n),10

0,1
2012-03-27 06:03:37Z,1
2012-03-30 12:06:41Z,1
2008-10-23 17:59:39Z,1
Other values (237),237
(Missing),10

Value,Count,Frequency (%),Unnamed: 3
2012-03-27 06:03:37Z,1,0.4%,
2012-03-30 12:06:41Z,1,0.4%,
2008-10-23 17:59:39Z,1,0.4%,
2008-09-28 05:34:20Z,1,0.4%,
2008-09-02 07:44:30Z,1,0.4%,
2011-07-25 21:41:58Z,1,0.4%,
2008-11-04 11:57:27Z,1,0.4%,
2008-09-17 12:55:00Z,1,0.4%,
2010-07-08 19:31:22Z,1,0.4%,
2011-06-07 18:14:52Z,1,0.4%,

First 3 values
3906137
10434599
890128

Last 3 values
67631
1101750
2386714

Value,Count,Frequency (%),Unnamed: 3
100003,1,0.4%,
1009860,1,0.4%,
101268,1,0.4%,
10434599,1,0.4%,
104420,1,0.4%,

Value,Count,Frequency (%),Unnamed: 3
972,1,0.4%,
986006,1,0.4%,
988228,1,0.4%,
9884132,1,0.4%,
9942594,1,0.4%,

0,1
Distinct count,226
Unique (%),90.4%
Missing (%),0.0%
Missing (n),0

0,1
"(u'python',)",7
"(u'python', u'python-import')",4
"(u'python', u'list')",3
Other values (223),236

Value,Count,Frequency (%),Unnamed: 3
"(u'python',)",7,2.8%,
"(u'python', u'python-import')",4,1.6%,
"(u'python', u'list')",3,1.2%,
"(u'python', u'dictionary')",3,1.2%,
"(u'python', u'python-3.x')",3,1.2%,
"(u'python', u'tkinter')",3,1.2%,
"(u'python', u'file')",2,0.8%,
"(u'python', u'string')",2,0.8%,
"(u'python', u'subprocess')",2,0.8%,
"(u'python', u'global-variables', u'scope')",2,0.8%,

First 3 values
Difference between append vs. extend list meth...
'import module' or 'from module import'
What is the difference between re.search and r...

Last 3 values
What is memoization and how can I use it in Py...
"What exactly do “u” and “r” string flags do, a..."
Split Strings with Multiple Delimiters?

Value,Count,Frequency (%),Unnamed: 3
'import module' or 'from module import',1,0.4%,
*args and **kwargs? [duplicate],1,0.4%,
Accessing class variables from a list comprehension in the class definition,1,0.4%,
Accessing the index in Python 'for' loops,1,0.4%,
Adding Python Path on Windows 7,1,0.4%,

Value,Count,Frequency (%),Unnamed: 3
"python open built-in function: difference between modes a, a+, w, w+, and r+?",1,0.4%,
strange result when removing item from a list [duplicate],1,0.4%,
“Large data” work flows using pandas,1,0.4%,
“Least Astonishment” and the Mutable Default Argument,1,0.4%,
“is” operator behaves unexpectedly with integers,1,0.4%,

0,1
Distinct count,250
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,437300
Minimum,1581
Maximum,2656298
Zeros (%),0.0%

0,1
Minimum,1581.0
5-th percentile,9510.1
Q1,92842.0
Median,219370.0
Q3,552690.0
95-th percentile,1569000.0
Maximum,2656298.0
Range,2654717.0
Interquartile range,459840.0

0,1
Standard deviation,526010
Coef of variation,1.2028
Kurtosis,4.3746
Mean,437300
MAD,375820
Skewness,2.0709
Sum,109326001
Variance,2.7668e+11
Memory size,2.0 KiB

Value,Count,Frequency (%),Unnamed: 3
188991,1,0.4%,
172690,1,0.4%,
1476783,1,0.4%,
247470,1,0.4%,
1536173,1,0.4%,
192683,1,0.4%,
952490,1,0.4%,
577192,1,0.4%,
294566,1,0.4%,
135845,1,0.4%,

Value,Count,Frequency (%),Unnamed: 3
1581,1,0.4%,
1834,1,0.4%,
2928,1,0.4%,
3432,1,0.4%,
3619,1,0.4%,

Value,Count,Frequency (%),Unnamed: 3
2284977,1,0.4%,
2395509,1,0.4%,
2416378,1,0.4%,
2497293,1,0.4%,
2656298,1,0.4%,

Unnamed: 0,views,text,tags,date_asked
1132941,103641,“Least Astonishment” and the Mutable Default A...,"(python, language-design, least-astonishment)",2009-07-15 18:00:37Z
15112125,86933,How do I test multiple variables against a value?,"(python, if-statement, comparison, match, bool...",2013-02-27 12:26:23Z
509211,996467,Understanding Python's slice notation,"(python, list, slice)",2009-02-03 22:31:02Z
23294658,204753,Asking the user for input until they give a va...,"(python, validation, loops, python-3.x, user-i...",
240178,15192,List of lists changes reflected across sublist...,"(python, list, nested-lists, mutable)",2008-10-27 14:57:22Z


## Clean Data More



### Exploratory Data Analysis: Word Cloud

Natural language processing is its own special area.  One of the first things people often do is make a word cloud.  To do that 

In [None]:
from collections import Counter

In [None]:
c = Counter({'k': 12, 'k': 15})

c

In [None]:
for i, text in df.iterrows():
    print i, text.text.split(" ")

### Exploratory Data Analysis: Tag Networks

In [None]:
# make another data frame