# Scraping Raw Data From Stack Overflow

In [1]:
# get inline, interactivate plots
%matplotlib notebook

# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2 

## Identify the data

Data Source: All the questions on Stack Overflow that have the "Python" tag on them.

--> Question: Where exactly does all of this data live -- what is the URL structure we can use to 
acquire all of this user data.  Go look at Stack Overflow.

--> Task: Work the URL into a formattable string you can feed into a srcaper.

In [2]:
SO_URL_format = "https://stackoverflow.com/questions/tagged/python?page={0}&sort=frequent&pagesize=50"

## Acquire the raw data

In [11]:
number_of_pages_to_gather = 50

# create sequence of page
page_range = range(1, number_of_pages_to_gather + 1)

So you may have noticed when we were formatting the URL that there are actually 1000s of pages of Python questions,
but here we're only collecting 50. This is intentional and temporary. Eventually we should collect the entire corpus of data, but right now we are trying to prototype a workflow.  So we are going to temporarily "downsample" to work faster.  In fact, 50 is pretty high.  Let's kick it down to 5 files.  That way, we are still coding with the for conditions where we need to take multiple files (as opposed to just one), but not introducing lots of computing time.

In [13]:
page_range = range(1, 6)  

In [14]:
import requests 

for i in page_range:
    print("you don't have wifi right now so don't erase ur data")
#     so_response = requests.get(SO_URL_format.format(i))
#     if so_response.status_code == 200:
#         html_file = open('FILENAME_00{0}.html'.format(i),'w')
#         html_file.write(so_response.text)
#         html_file.close()
#     else:
#         raise Error

# TODO: make this code block more functional

you don't have wifi right now so don't erase ur data
you don't have wifi right now so don't erase ur data
you don't have wifi right now so don't erase ur data
you don't have wifi right now so don't erase ur data
you don't have wifi right now so don't erase ur data


## Mash Until No Good -- Data Munging/Cleaning/Wrangling

> Bad programmers spend their time thinking about the code.  Good programmers
> spend their time thinking about data structures. 

> Linus Torvalds

> I be in the kitchen whipping
>
> trying to cook the sauce.
>
>   Yo Gotti

We are not going to begin whipping this data into shape for various levels of analysis - it's hard to do 
analysis on a bunch of data locked up in an HTML structure though.  

#### Extracting the maximum number of dimensions from the data

Look at the stack overflow page and think about what our granular data points are.  For the pages
we have decided to 

< INSERT PICTURE OF SO PAGE HERE >

The granular logical data point is a question.  So what are the dimensions/attributes of a question object?
- question text
- vote score
- views 
- details
- author
- question details 

Beautiful soup parses the HTML into a Python tree structure (DOM).  You can then use a variety of BS4 methods to extract specific HTML elements based on HTML attritbute.  Since classes and IDs are HTML attributes, you can use CSS selectors to extract information.

In [None]:
# insert quick BS4 demo before doing the real code in the next block

OK, let's actually get the question text, vote score, views, etc. out of the data.

In [7]:
from bs4 import BeautifulSoup

In [43]:
# for i in page_range:
#     file_name = 'FILENAME_00{0}.html'.format(i)
#     print(file_name)
    
#     with open(file_name,'r') as f:
#         soup = BeautifulSoup(f.read(), 'html.parser')
        
#         # for each file, get all of the question objects
#         questions = soup.find_all("div", class_="question-summary")
        
#         for question in questions:
#             text = question.find('a', class_="question-hyperlink").text
#             tags = [tag.text for tag in question.find_all('a', class_="post-tag")]
#             views = int(question.find('div', class_="views")['title'].split(" ")[0].replace(",", ""))
#             date_asked = 
#             author
#             print(supernova, text)
#             print(tags)
        
    
        
        

OK, so we have figured out how to get at the data with Beautiful Soup above.  Let's wrap all of that logic into a function that accepts an HTML as an argument and returns a sequence of question objects -- each object will contain all of the attributes.  Each of these will become a row in a Pandas DataFrame.

In [45]:
def extract_question_objects(relative_html_path):
    """
        :relative_html_path: What it sounds like
    """
    questions = []
    f = open(relative_html_path, 'r')
    
    soup = BeautifulSoup(f.read(), 'html.parser')
    question_divs = soup.find_all("div", class_="question-summary")
        
    for question in question_divs:
            id = question['id'].split("-")[2]
            text = question.find('a', class_="question-hyperlink").text
            tags = [tag.text for tag in question.find_all('a', class_="post-tag")]
            views = int(question.find('div', class_="views")['title'].split(" ")[0].replace(",", ""))
    
            
            questions.append({id: [text, tags, views]})
    
    f.close()
    
    return questions

In [46]:
dataset = []

for i in page_range:
    filename = "FILENAME_00{}.html".format(i)
    
    qs = extract_question_objects(filename)
    
    for q in qs:
        dataset.append(q)


# print(dataset)

In [51]:
for i in dataset:
    print i

{u'1132941': [u'\u201cLeast Astonishment\u201d and the Mutable Default Argument', [u'python', u'language-design', u'least-astonishment'], 103225]}
{u'15112125': [u'How do I test multiple variables against a value?', [u'python', u'if-statement', u'comparison', u'match', u'boolean-logic'], 86153]}
{u'509211': [u"Understanding Python's slice notation", [u'python', u'list', u'slice'], 990799]}
{u'23294658': [u'Asking the user for input until they give a valid response', [u'python', u'validation', u'loops', u'python-3.x', u'user-input'], 202609]}
{u'240178': [u'List of lists changes reflected across sublists unexpectedly', [u'python', u'list', u'nested-lists', u'mutable'], 15103]}
{u'2612802': [u'How to clone or copy a list?', [u'python', u'list', u'copy', u'clone'], 854693]}
{u'1373164': [u'How do I create a variable number of variables?', [u'python', u'variable-variables'], 74766]}
{u'312443': [u'How do you split a list into evenly sized chunks?', [u'python', u'list', u'split', u'chunks']

## Exploratory Data Analysis

In [52]:
import pandas as pd

df = pd.DataFrame()

The minimum supported version is 1.0.0



In [53]:
df

Unnamed: 0,100003,1009860,101268,10434599,104420,1059559,107705,1101750,110259,11269575,...,9189172,9264763,931092,952914,9535954,972,986006,988228,9884132,9942594
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,"[How do I pass a variable by reference?, [pyth...",,,
9,,,,,,,,,,,...,,,,,,,,,,
