# Scraping Raw Data From Stack Overflow

Systematically extracting business intelligence from data.

In [1]:
# get inline, interactivate plots
# %matplotlib inline

# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2 

# What is our question?

Essentially: What are the biggest problem areas for Python programmers?

# Identify the data (Stack Overflow)

Data Source: All the questions on Stack Overflow that have the "Python" tag on them.

Question: Where exactly does all of this data live -- what is the URL structure we can use to 
acquire all of this user data?

Task: Work the URL into a formattable string you can feed into a srcaper.

In [2]:
SO_URL_format = "https://stackoverflow.com/questions/tagged/python?page={0}&sort=frequent&pagesize=50"

## Acquire the raw data

In [3]:
number_of_pages_to_gather = 50

# create sequence of page
page_range = range(1, number_of_pages_to_gather + 1)

In [4]:
page_range = range(1, 6)  

In [5]:
# TODO: MAKE A PAGE URLS GENERATOR (argument: array of numbers, yields URLs )

In [6]:
import requests 

def http_get (URL):
    response = requests.get(URL)
    return response

# 
#  Once you have the data, it can be helpful to comment the following loop out.
# 
# for i in page_range:
#     print("you don't have wifi right now so don't erase ur data")
#     so_page_url = SO_URL_format.format(i)

#     so_response = get(so_page_url)
    
#     if so_response.status_code == 200:
#         html_file = open('FILENAME_00{0}.html'.format(i),'w')
#         html_file.write(so_response.text)
#         html_file.close()
#     else:
#         raise Error

## Mash Until No Good! Data Munging/Wrangling/Transforming

> Bad programmers worry about the code. 
>
> Good programmers worry about data structures and their relationships.
>
> Linus Torvalds, creator of Linux and git

> I be in the kitchen whipping
>
> trying to cook the sauce.
>
>   Yo Gotti, _The Art of the Hustle_

We are not going to begin whipping this data into shape for various levels of analysis - it's hard to do 
analysis on a bunch of data locked up in an HTML structure though.  

#### Extracting the maximum number of dimensions from the data

Look at the stack overflow page and think about what our granular data points are.  For the pages
we have decided to 

< INSERT PICTURE OF SO PAGE HERE >

The granular logical data point is a question.  So what are the dimensions/attributes of a question object?
- question text
- vote score
- views 
- details
- author
- question details 

Beautiful soup parses the HTML into a Python tree structure (DOM).  You can then use a variety of BS4 methods to extract specific HTML elements based on HTML attritbute.  Since classes and IDs are HTML attributes, you can use CSS selectors to extract information.

In [7]:
# insert quick BS4 demo before doing the real code in the next block

OK, let's actually get the question text, vote score, views, etc. out of the data.

In [8]:
from bs4 import BeautifulSoup

In [9]:
for i in page_range:
    file_name = 'FILENAME_00{0}.html'.format(i)
    print(file_name)
    
    with open(file_name,'r') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        
        # for each file, get all of the question objects
        questions = soup.find_all("div", class_="question-summary")
        
        for question in questions:
            text = question.find('a', class_="question-hyperlink").text
            tags = [tag.text for tag in question.find_all('a', class_="post-tag")]
            views = int(question.find('div', class_="views")['title'].split(" ")[0].replace(",", ""))
            print(tags)
        
    
        
        

FILENAME_001.html
[u'python', u'language-design', u'least-astonishment']
[u'python', u'if-statement', u'comparison', u'match', u'boolean-logic']
[u'python', u'list', u'slice']
[u'python', u'validation', u'loops', u'python-3.x', u'user-input']
[u'python', u'list', u'nested-lists', u'mutable']
[u'python', u'list', u'copy', u'clone']
[u'python', u'variable-variables']
[u'python', u'list', u'split', u'chunks']
[u'python', u'reference', u'parameter-passing', u'pass-by-reference']
[u'python', u'iteration']
[u'python', u'list', u'multidimensional-array', u'flatten']
[u'python', u'pandas']
[u'python', u'python-2.7', u'python-3.x', u'int']
[u'python', u'iterator', u'generator', u'yield', u'coroutine']
[u'python', u'scope', u'dynamic-languages']
[u'python', u'syntax', u'parameter-passing', u'identifier', u'kwargs']
[u'python', u'shell', u'command', u'subprocess', u'external']
[u'python', u'module', u'namespaces', u'main', u'idioms']
[u'python', u'global-variables', u'scope']
[u'python', u'sortin

[u'python', u'mongodb', u'pandas', u'hdf5', u'large-data']
[u'python', u'python-3.x', u'python-import']
[u'python', u'generator']
[u'python', u'python-3.x', u'scope', u'list-comprehension', u'python-internals']
[u'python', u'file-read']
[u'python', u'web-scraping', u'urlopen']
[u'python', u'break', u'control-flow']
[u'python', u'text-files', u'line-count']
[u'python', u'memoization']
[u'python', u'dictionary']
[u'python', u'selenium', u'firefox', u'selenium-firefoxdriver', u'geckodriver']
[u'python', u'math', u'syntax', u'operators']
[u'python', u'subprocess']
FILENAME_005.html
[u'python', u'function', u'lambda', u'closures']
[u'c++', u'python', u'c']
[u'python', u'oop']
[u'python', u'object', u'iterator']
[u'python', u'timer']
[u'python']
[u'python', u'python-2.7', u'locale', u'ipython', u'ipython-notebook']
[u'python', u'io']
[u'python', u'multithreading', u'timeout', u'subprocess']
[u'python', u'pip']
[u'python', u'list']
[u'python', u'file', u'copy', u'filesystems', u'copyfile']
[u

OK, so we have figured out how to get at the data with Beautiful Soup above.  Let's wrap all of that logic into a function that accepts an HTML as an argument and returns a sequence of question objects -- each object will contain all of the attributes.  Each of these will become a row in a Pandas DataFrame.

In [10]:
def extract_question_objects(relative_html_path):
    """
        :relative_html_path: file to read and parse for stack overflow questions
    """
    questions = []
    f = open(relative_html_path, 'r')
    
    soup = BeautifulSoup(f.read(), 'html.parser')
    question_divs = soup.find_all("div", class_="question-summary")
        
    for question in question_divs:
            qid = question['id'].split("-")[2]
            text = question.find('a', class_="question-hyperlink").text
            tags = [tag.text for tag in question.find_all('a', class_="post-tag")]
            views = int(question.find('div', class_="views")['title'].split(" ")[0].replace(",", ""))
            
            date = question.find('span', class_='relativetime')
            
            date_asked = date['title'] if date else None
            
            questions.append([qid, views, text, tags])
    
    f.close()
    
    return questions

In [11]:
dataset = []

for i in page_range:
    filename = "FILENAME_00{}.html".format(i)
    
    qs = extract_question_objects(filename)
    
    for q in qs:
        dataset.append(q)


# print(dataset)

In [12]:
print(len(dataset))

250


OK, this raw data looks pretty clean.  It's time to explore.

## Exploratory Data Analysis

In [13]:
import pandas as pd

import numpy as np

# dataset2 = {}

# # munge dict of lists into pd dataframe-acceptable format
# for q in dataset:
#     dataset2[]

df = pd.DataFrame(columns=['views', 'text', 'tags'], )

for data in dataset:
    qid, views, text, tags = data
    df.loc[qid] = [views, text, tuple(np.array(tags))]



The minimum supported version is 1.0.0



In [14]:
df

Unnamed: 0,views,text,tags
1132941,103329,“Least Astonishment” and the Mutable Default A...,"(python, language-design, least-astonishment)"
15112125,86361,How do I test multiple variables against a value?,"(python, if-statement, comparison, match, bool..."
509211,992325,Understanding Python's slice notation,"(python, list, slice)"
23294658,203161,Asking the user for input until they give a va...,"(python, validation, loops, python-3.x, user-i..."
240178,15132,List of lists changes reflected across sublist...,"(python, list, nested-lists, mutable)"
2612802,856436,How to clone or copy a list?,"(python, list, copy, clone)"
1373164,74882,How do I create a variable number of variables?,"(python, variable-variables)"
312443,534924,How do you split a list into evenly sized chunks?,"(python, list, split, chunks)"
986006,857645,How do I pass a variable by reference?,"(python, reference, parameter-passing, pass-by..."
1207406,281090,Remove items from a list while iterating,"(python, iteration)"


Let's run some preliminary summary statistics.

In [15]:
import pandas_profiling

In [19]:
pandas_profiling.ProfileReport(df)

0,1
Number of variables,4
Number of observations,250
Total Missing (%),0.0%
Total size in memory,7.9 KiB
Average record size in memory,32.3 B

0,1
Numeric,1
Categorical,1
Date,0
Text (Unique),2
Rejected,0

First 3 values
3906137
10434599
890128

Last 3 values
67631
1101750
2386714

Value,Count,Frequency (%),Unnamed: 3
100003,1,0.4%,
1009860,1,0.4%,
101268,1,0.4%,
10434599,1,0.4%,
104420,1,0.4%,

Value,Count,Frequency (%),Unnamed: 3
972,1,0.4%,
986006,1,0.4%,
988228,1,0.4%,
9884132,1,0.4%,
9942594,1,0.4%,

0,1
Distinct count,224
Unique (%),89.6%
Missing (%),0.0%
Missing (n),0

0,1
"(u'python',)",8
"(u'python', u'python-import')",4
"(u'python', u'dictionary')",3
Other values (221),235

Value,Count,Frequency (%),Unnamed: 3
"(u'python',)",8,3.2%,
"(u'python', u'python-import')",4,1.6%,
"(u'python', u'dictionary')",3,1.2%,
"(u'python', u'python-3.x')",3,1.2%,
"(u'python', u'tkinter')",3,1.2%,
"(u'python', u'list')",3,1.2%,
"(u'python', u'subprocess')",2,0.8%,
"(u'python', u'global-variables', u'scope')",2,0.8%,
"(u'python', u'sorting', u'dictionary')",2,0.8%,
"(u'python', u'list-comprehension')",2,0.8%,

First 3 values
Difference between append vs. extend list meth...
'import module' or 'from module import'
What is the difference between re.search and r...

Last 3 values
Immutable vs Mutable types
What is memoization and how can I use it in Py...
Split Strings with Multiple Delimiters?

Value,Count,Frequency (%),Unnamed: 3
'import module' or 'from module import',1,0.4%,
*args and **kwargs? [duplicate],1,0.4%,
Accessing class variables from a list comprehension in the class definition,1,0.4%,
Accessing the index in Python 'for' loops,1,0.4%,
Adding Python Path on Windows 7,1,0.4%,

Value,Count,Frequency (%),Unnamed: 3
"python open built-in function: difference between modes a, a+, w, w+, and r+?",1,0.4%,
strange result when removing item from a list [duplicate],1,0.4%,
“Large data” work flows using pandas,1,0.4%,
“Least Astonishment” and the Mutable Default Argument,1,0.4%,
“is” operator behaves unexpectedly with integers,1,0.4%,

0,1
Distinct count,250
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,437250
Minimum,1518
Maximum,2646110
Zeros (%),0.0%

0,1
Minimum,1518.0
5-th percentile,9439.7
Q1,88864.0
Median,217650.0
Q3,569760.0
95-th percentile,1563000.0
Maximum,2646110.0
Range,2644592.0
Interquartile range,480900.0

0,1
Standard deviation,524800
Coef of variation,1.2002
Kurtosis,4.2712
Mean,437250
MAD,376920
Skewness,2.0457
Sum,109311483
Variance,2.7542e+11
Memory size,2.0 KiB

Value,Count,Frequency (%),Unnamed: 3
2005503,1,0.4%,
248999,1,0.4%,
242368,1,0.4%,
214206,1,0.4%,
91836,1,0.4%,
150202,1,0.4%,
192113,1,0.4%,
870070,1,0.4%,
415922,1,0.4%,
426673,1,0.4%,

Value,Count,Frequency (%),Unnamed: 3
1518,1,0.4%,
1814,1,0.4%,
2902,1,0.4%,
3408,1,0.4%,
3604,1,0.4%,

Value,Count,Frequency (%),Unnamed: 3
2276430,1,0.4%,
2387267,1,0.4%,
2401091,1,0.4%,
2484466,1,0.4%,
2646110,1,0.4%,

Unnamed: 0,views,text,tags
1132941,103329,“Least Astonishment” and the Mutable Default A...,"(python, language-design, least-astonishment)"
15112125,86361,How do I test multiple variables against a value?,"(python, if-statement, comparison, match, bool..."
509211,992325,Understanding Python's slice notation,"(python, list, slice)"
23294658,203161,Asking the user for input until they give a va...,"(python, validation, loops, python-3.x, user-i..."
240178,15132,List of lists changes reflected across sublist...,"(python, list, nested-lists, mutable)"


In [17]:
dir(profile)

['__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_repr_html_',
 'description_set',
 'file',
 'get_description',
 'get_rejected_variables',
 'html',
 'to_file']

