# Scraping Raw Data From Stack Overflow

Systematically extracting business intelligence from data.

In [1]:
# get inline, interactivate plots
%matplotlib inline

# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2 

# What is our question?

Essentially: What are the biggest problem areas for Python programmers?

# Identify the data (Stack Overflow)

Data Source: All the questions on Stack Overflow that have the "Python" tag on them.

Question: Where exactly does all of this data live -- what is the URL structure we can use to 
acquire all of this user data?

Task: Work the URL into a formattable string template you can feed into a scraper.

In [2]:
SO_URL = "https://stackoverflow.com/questions/tagged/python?page={0}&sort=frequent&pagesize=50"

## Acquire the raw data

In [3]:
number_of_pages_to_gather = 50

# create sequence of page
page_range = range(1, number_of_pages_to_gather + 1)

So you may have noticed when we were formatting the URL that there are actually 1000s of pages of Python questions,
but here we're only collecting 50. This is intentional and temporary. Eventually we should collect the entire corpus of data, but right now we are trying to prototype a workflow.  So we are going to temporarily **downsample** to more rapidly prototype.

As a matter of fact, 50 is pretty high.  Let's kick it down to 5 files.  That way, we are still coding with the for conditions where we need to take multiple files (as opposed to just one), but not introducing lots of computing time.

In [4]:
page_range = range(1, 6)  
print(page_range)

[1, 2, 3, 4, 5]


In [5]:
# TODO: MAKE A PAGE URLS GENERATOR (argument: array of numbers, yields URLs )

In [7]:
import requests 

def http_get (URL):
    response = requests.get(URL)
    return response

# 
#  Once you have the data, it can be helpful to comment the following loop out.
# 
for i in page_range:
    so_response = http_get(SO_URL.format(i))
    
    if so_response.status_code == 200:
        html_file = open('FILENAME_00{0}.html'.format(i),'w')
        html_file.write(so_response.text.encode('ascii', 'ignore'))
        html_file.close()
    else:
        print("Failed at loop: ", i)

## Mash Until No Good! Data Munging/Wrangling/Transforming

> Bad programmers worry about the code. 
>
> Good programmers worry about data structures and their relationships.
>
> Linus Torvalds, creator of Linux and git

> I be in the kitchen whipping
>
> trying to cook the sauce.
>
>   Yo Gotti, _The Art of the Hustle_

We are not going to begin whipping this data into shape for various levels of analysis - it's hard to do 
analysis on a bunch of data locked up in an HTML structure though.  

#### Extracting the maximum number of dimensions from the data

Look at the stack overflow page and think about what our granular data points are.  For the pages
we have decided to 

< INSERT PICTURE OF SO PAGE HERE >

The granular logical data point is a question.  So what are the dimensions/attributes of a question object?
- question text
- vote score
- views 
- details
- author
- question details 

Beautiful soup parses the HTML into a Python tree structure (DOM).  You can then use a variety of BS4 methods to extract specific HTML elements based on HTML attritbute.  Since classes and IDs are HTML attributes, you can use CSS selectors to extract information.

In [None]:
# insert quick BS4 demo before doing the real code in the next block

OK, let's actually get the question text, vote score, views, etc. out of the data.

In [8]:
from bs4 import BeautifulSoup

In [10]:
for i in page_range:
    file_name = 'FILENAME_00{0}.html'.format(i)
    print(file_name)
    
    with open(file_name,'r') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        
        # for each file, get all of the question objects
        questions = soup.find_all("div", class_="question-summary")
        
        for question in questions:
            text = question.find('a', class_="question-hyperlink").text
            tags = [tag.text for tag in question.find_all('a', class_="post-tag")]
            views = int(question.find('div', class_="views")['title'].split(" ")[0].replace(",", ""))
#             print(tags)
        
    
        
        

FILENAME_001.html
FILENAME_002.html
FILENAME_003.html
FILENAME_004.html
FILENAME_005.html


OK, so we have figured out how to get at the data with Beautiful Soup above.  Let's wrap all of that logic into a **function** that accepts an HTML file as an argument and returns a sequence of question objects -- each object will contain all of the attributes.  Each of these will become a row in a Pandas DataFrame.

In [None]:
def get_question_info_from_summary(summary_div):
    qid = summary_div['id'].split("-")[2]
    text = summary_div.find('a', class_="question-hyperlink").text
    tags = [tag.text for tag in summary_div.find_all('a', class_="post-tag")]
    views = int(summary_div.find('div', class_="views")['title'].split(" ")[0].replace(",", ""))

    # data isn't always there
    date = summary_div.find('span', class_='relativetime')
    date_asked = date['title'] if date else None
    
    return [qid, text, tags, views, date_asked]


def extract_question_objects(relative_html_path):
    """
        :relative_html_path: file to read and parse for stack overflow questions
    """
    questions_objects = []
    with open(relative_html_path, 'r') as f:
    
        soup = BeautifulSoup(f.read(), 'html.parser')
        question_divs = soup.find_all("div", class_="question-summary")

        for question in question_divs:
                q_info = get_question_info_from_summary(question)
                questions_objects.append(q_info)
    
    
    return questions_objects

In [None]:
dataset = []

for i in page_range:
    filename = "FILENAME_00{}.html".format(i)
    
    qs = extract_question_objects(filename)
    
    for q in qs:
        dataset.extend(q)


# print(dataset)

In [None]:
print(len(dataset))

OK, this raw data looks pretty clean.  It's time to explore.

## Exploratory Data Analysis

Enter pandas, the lingua franca of data analysis.  It works very well with tabular data.

In [None]:
import pandas as pd

import numpy as np

# dataset2 = {}

# # munge dict of lists into pd dataframe-acceptable format
# for q in dataset:
#     dataset2[]

df = pd.DataFrame(columns=['views', 'text', 'tags'], )

for data in dataset:
    qid, views, text, tags = data
    df.loc[qid] = [views, text, tuple(np.array(tags))]



In [None]:
df.columns

In [None]:
df.index

In [None]:
df

### Exploratory Data Analysis: Profiling

In [None]:
import pandas_profiling

In [None]:
pandas_profiling.ProfileReport(df)

In [None]:
dir(profile)

### Exploratory Data Analysis: Word Cloud

Natural language processing is its own special area.  One of the first things people often do is make a word cloud.  To do that 

In [None]:
from collections import Counter

In [None]:
c = Counter({'k': 12, 'k': 15})

c

In [None]:
for i, text in df.iterrows():
    print i, text.text.split(" ")

### Exploratory Data Analysis: Tag Networks

In [None]:
# make another data frame