# Mining Data from the Web with Python

## Purpose
Summarize different strategies for mining data from the web and discuss when to use them so that you can become a more powerful and efficient data analyst.

## Agenda
- Data Collection
- Data Available on the Web
- Downloading Datasets
- Overview of Web APIs
- Python API Wrapper Libraries
- Accesing Web APIs from Python
- Web Scraping
- Summary

## Data Collection
<img style="float: right;" src="images/data_analysis_process.png">
When performing data analysis, analysts will typically follow a process similar to the one on the right.

The focus of this talk is on the data collection step of the data analysis process.

Data collection is "gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes." [4]

Data collection is important:
- Significant impact to remaining steps in the data analysis process
- Time intensive; as the old adage goes "data scientists spend 80% of their time finding, cleaning and reorganizing data and only 20% conducting analysis." [3]

## Data Collection Considerations
There are many things to consider when determining how data will be collected for analysis.  Not only are there many techniques by which to collect data from a source there may also be many sources of that data.

Below are the aspects to consider prior to collecting data [5]:
- Accuraccy: Data providing the most accurate results is desired.
- Reliability: Data that can be collected consistently is desired.
- Time/Cost: Minimal implementation time and cost behind collecting the data is desired.
- Utility: Data that can be easily analyzed is desired.  Data that contributes in a significant way to answering the question is desired.

## Mining Data from the Web
One of the most common sources of data is the internet. The web is often the only source of suitable data for analysis and frequently the most convenient.  More and more data exists on the web every day and more and more of it is easily accessible.

The focus of the remainder of this talk is collecting data that we want to utilize for analysis.  A selection of techniques are discussed to help you better understand what your options are for implementing the data collection step of your analysis process.  The techniques are discussed in order of ease of implementation.  For example Downloading a File Locally is easier than Web Crawling.

Web Data Extraction Techniques:
- Downloading Datasets
- Python API Wrapper Libraries
- Accessing Web APIs from Python
- Web Scraping

## Downloading Datasets
Often times an analyst's first experience with collecting data from the web will be downloading a csv file locally and then using it from there:

In [1]:
import pandas as pd
sample_data = pd.read_csv('sample.csv')
sample_data.head()

Unnamed: 0,x,y
0,-3,9
1,-2,4
2,-1,1
3,0,0
4,1,1


This is about as easy as it gets, and preferred to some of the more time intensive techniques covered later.  This is not specific to csv files.  Other file formats you might find as downloadable assets are json files (.json), HDF5 files (.h5), and excel files (.xls, .xlsx) to name a few.  The pandas library has utilities for reading those file types as well.

The issue with this technique is that it is only applicable when someone else has done the leg work to consolidate the data and possibly clean it.  It is unlikely you will be able to collect all of your data in this manner if you are performing novel analysis.

## Web APIs
Prior to discussing the remaining techniques it will be useful to introduce and give a brief overview of Web APIs.

According to wikipedia a Web API "is a programmatic interface consisting of one or more publicly exposed endpoints to a defined request–response message system, typically expressed in JSON or XML, which is exposed via the web—most commonly by means of an HTTP-based web server." [6]

Web APIs, specifically how they apply to data collection, are probably best understood using an example of a single endpoint (A web API is one or more endpoints).  Lets use Github's API as an example.  Github's endpoint url is https://api.github.com.  From the documentation [7] I know that I can send a GET HTTP request to /users/haleysam93/repos to get some summary information about my github repos in JSON format.  

Shown below is the recieved JSON when navigating to the URL https://api.github.com/users/haleysam93/repos using my web browser, which sends the GET HTTP request.

<img src="images/github_api_response.png">

In the 'Using Web APIs' sections examples of how to send requests using python will be shown.  The purpose of this section is to introduce the concept of getting data from Web APIs.

## Python API Wrapper Libraries
Even though it is possible, as will soon be showed, to send requests and receive JSON using Python, it isn't necessarily pleasant.  It will at minimum require parsing the JSON object in order to pull the data that you care about and possibly much more tedious programming.

To alleviate some of this burden many organizations will release libraries that provide a higher level interface to the data.  These libraries are typically called API wrappers.  Listed below is small set of common API wrappers I find useful for data collection.  For a more complete list checkout [Real Python's List of Python API Wrappers](https://github.com/realpython/list-of-python-api-wrappers).

Common Python API Wrapper Libraries:
- python-twitter: wrapper library for accessing tweets
- newsapi-python: wrapper library for Google News
- praw (Python Reddit API Wrapper): wrapper library for reddit data
- google-maps-services-python: wrapper library for accessing various google maps services
- nba_api: wrapper library for accessing nba statistics
- rottentomatoes: wrapper library for accessing rotten tomatoes data

## Example: Mining tweets using the python-twitter library
Twitter is a great source of text data.  Sentiment analysis, which is the process of using a model to determine the sentiment (positive, negative, neutral) of some text, is commonly performed using tweet data.  This example shows how one could pull all the tweets on Donald Trump's timeline using the twitter-python library.

In [2]:
import twitter
api = twitter.Api(consumer_key='NKjwPloSr1V5tejcE8LTr5uzL',
                  consumer_secret='tM57R1zNbZG8nyK4ztEDpbeosOw5qBRUOdS18oHGiMddkZpE0v',
                  access_token_key='1705497168-Q3xGfgcaHO9tiHCy8m8hkUQbFd8jGOjS0kK46DP',
                  access_token_secret='cqcO5kVjaNGy8IXpLIBdp7wEGsfsn6w9hEHL2tBFEQZNy')
tweets = [x.text for x in api.GetUserTimeline(screen_name='realDonaldTrump')]
tweets

['https://t.co/4FMs202NrW',
 'Robert Mueller is being asked to testify yet again. He said he could only stick to the Report, &amp; that is what he wo… https://t.co/mAQc2kmO3t',
 'Big 4th of July in D.C. “Salute to America.” The Pentagon &amp; our great Military Leaders are thrilled to be doing thi… https://t.co/UEPNdG57A1',
 'The Economy is the BEST IT HAS EVER BEEN! Even much of the Fake News is giving me credit for that!',
 'As most people are aware, according to the Polls, I won EVERY debate, including the three with Crooked Hillary Clin… https://t.co/g5pKI6Px2X',
 'Mark Levin has written a big number one bestselling book called, conspicuously and accurately, “Unfreedom of the Pr… https://t.co/fBdKrF5bFp',
 '...Texas will defend them &amp; indemnify them against political harassment by New York State and Governor Cuomo. So ma… https://t.co/0UZzWARrml',
 'People are fleeing New York like never before. If they own a business, they are twice as likely to flee. And if the… https://t.co/

The above code didn't necessarily provide us with a capability we wouldn't otherwise have it did make it a lot easier.  Wrapper libraries can use the full power of the python library and provide sensible python objects as opposed to HTTP requests which return text data that requires parsing.

Similarly to simply reading a file, however this technique also suffers from only being supported by companies with the resources to build a python library to access their data.  If you are accessing data from a common source such as twitter, facebook, google or amazon there will likely be a library to support you.  For novel analysis it is unlikely you will be able to access all of your data in this fashion.

## Accessing Web APIs with Python
We have already discussed what a Web API is, but we did not show how to access the data they provide from Python.  This section will cover how Web APIs can be used from within Python.

In order to send requests and recieve their responses we will be using the requests library.  I like requests for it's very simple user interface.  The barrier to entry is low allowing pretty much anyone to send their first requests in minutes.

## Example: Sending Requests to the Github API  
For this example we will bring back the github API and show how to access it programatically using the requests library.  Our goal will be to get information on all the repositories that a user (me in this example) has on github.  We will then store the returned data in a pandas DataFrame.

In [3]:
import pandas as pd
import requests as r

resp = r.get("https://api.github.com/users/haleysam93/repos")

all_repos = []
for repo in resp.json():
    all_repos.append((repo['full_name'], repo['description'],
                      repo['url'], repo['updated_at']))

all_repos = pd.DataFrame(all_repos, columns=['repo_name', 'description', 'url', 'last_updated'])
all_repos

Unnamed: 0,repo_name,description,url,last_updated
0,haleysam93/beer_description_generator,Simple Web App Generating Fake Beer Descriptions,https://api.github.com/repos/haleysam93/beer_d...,2019-05-09T19:48:52Z
1,haleysam93/bokeh,Interactive Web Plotting for Python,https://api.github.com/repos/haleysam93/bokeh,2018-07-29T21:19:13Z
2,haleysam93/nfl_dashboard,Dashboard displaying nfl data,https://api.github.com/repos/haleysam93/nfl_da...,2018-12-15T02:05:10Z
3,haleysam93/PythonDataScienceHandbook,Python Data Science Handbook: full text in Jup...,https://api.github.com/repos/haleysam93/Python...,2018-07-19T02:21:23Z
4,haleysam93/web_data_mining_talk,Some conference talk ideas targeted primarily ...,https://api.github.com/repos/haleysam93/web_da...,2019-07-02T19:49:23Z


That was quite simple.  We actually made the request using only a single line of code, which returned us a Response object.  The json method of the Response object provided us a list of dictionaries representing the json body of the HTTP response.  All that was left to do was pull out the desired information from each dictionary and consolidate it into a DataFrame.

Often times when attempting to collect data you will be making multiple requests.  Many APIs will limit the rate at which you can make requests.  The Github API for example limits users to 60 requests per hour for unauthenticated requests and 5000 requests per hour for authenticated requests.  If you exceed those limits your receive error responses.  I recommend taking a look at each APIs documentation prior to using it to understand what the best practices are.  It can save a lot of headache.

## Private Undocumented APIs
In the last section we showed an example using the Github API, which is a public documented API.  Github provides that API and publishes documentation on how to use it as an open invitation for anyone to use it.  There are many organizations, however, that don't provide a public API, but still rely on private APIs for accessing data within their applications.

This is where we get to put on our hacker hats and do a little reverse engineering.  The basic idea is to use the developer tools of our web browser (firefox is used in my examples) and look at the requests that are being sent by the web application we are using.  If we can determine the structure of the requests being sent we can mirror them using the requests library and get the same data that the web application is using.

## Example: Mining ESPN Fantasy Football Data
During the most recent Fantasy Football season I wanted to get an edge on my competition.  I figured the best way to do this was to use some of my data analysis skills.  I figured that if I could acquire the data I would be able to identify my team's weaknesses and then set out to improve upon them.

When I got to my all important data collection step I realized there was an issue.  There was no way for me to download my league's data and ESPN Fantasy Football does not have a public documented API.  I did some research and found some examples of people accessing data using ESPN fantasy football's private API.  They were not pulling the same data that I was interested in, but I at least had a starting point to work with.

#### Step One: Determining the Right Endpoint
The data that I was interested in was the number of points that each roster spot scored each week.  I figured that page showing scores would get that information.  In order to determine how the ESPN web app was sending those requests I used firefox's developer tools to look at the requests that were being sent.

The general steps for doing this within firefox are:
- Open developer tools (CTRL+SHIFT+I)
- Click on the 'Network' tab
- Click the tab to filter to just XHR

<img src="images/espn_ff_private_api_capture.png">

Once I had firefox displaying the requests that were being sent I simply navigated to the page that accessed the information I was interested in and watched while the list of requests was filled in.  There were other requests for things like ads, but using some common sense I was able to isolate the request that was most likely for the data.  I know had a URL and set of parameters to work with.

#### Step Two: Handling Authentification
Being slightly over zealous I immediately tried to replicate the request using Python.  I subsequently received a 403 error letting me know that my request was forbidden.  I went back and did more research and realized that because my league was private I needed to authenticate my request.  The easiest way to do this was to send cookies along with my request that would authenticate it.

Once again my trusty web browser saved the day.  Using the web browser's developer tools I was able to determine the correct values for those cookies.  Requests made it really easy to include those in my request as well.

#### Step Three: Putting it All Together
Once I determined how to get a single request working the rest of the work involved writing the code to iterate through each week and subsequently each matchup within each week to compile the information I was interested in into a pandas DataFrame.  I then had the data I need to perform analysis.

## Web Scraping
The final technique we will cover is web scraping.  Prior to diving in it is useful to define what web scraping is and is not for the purposes of this talk.  Web scraping is the process of parsing html documents to extract the information you want.  The 'web scraping' term often gets used interchangably with 'web crawling', but I want to make a clear distinction.  Web crawling typically involves using "bots" or "spiders" to systematically traverse entire websites.  The focus here will be on web scraping, and although Python does have a nice web crawling library in Scrapy covering web crawling is outside the scope of this talk.

It is important to understand when you need web scraping and when you do not.  Web scraping is discussed last for a reason, because it can be time intensive and prone to issues.  I see it as a last resort for the case when the only way to access the data you want is by parsing values out of the HTML.

The web scraping library that we will cover in this section is BeautifulSoup.  We will also need the requests library to make the HTTP requests to get the HTML text for BeautifulSoup to parse.

## Example: Mining Beer Description Data
I recently worked on a project where the goal was to build a text model that would generate fake beer descriptions.  In order to build this model I needed real beer descriptions that I could train my model on.  Unfortunately this data is not publicly available, and after performing lots of research the only place I could find the descriptions that suited what I was trying to generate was on beeradvocate.com.  Unfortunately Beer Advocate is not very forth coming with there data and they do not provide any data sets or have an API.  The only way I could get the beer description was to go to the profile of each beer.

#### Step One: Get the URLs to Request
I was able to determine that the beer profile pages don't contain the name of the beer within the URL and instead the URL simply contains an integer between 1 and 400000 (seemingly an index).  This made life easier because I didn't have to know the name or producer of the beer, I could simply iterate from 1 to 400000 appending the value to the base URL.  The code to generate the list of URLs for every beer is shown below.

In [4]:
BASE_URL = 'https://www.beeradvocate.com/beer/profile/001/'
beer_idxs = range(1, 400000)
urls = [BASE_URL + str(i) for i in beer_idxs]

#### Step Two: Create the Function to Parse the HTML
Every request that gets made will return a requests Response object so it's best to make the request and immediately parse the returned HTML so Python doesn't have to persist all of that data.  I created a function to perform the parsing which is shown below.

In [5]:
import re
from bs4 import BeautifulSoup

def parse_description(html_text):
    """Parse the HTML text provided to extract the beer description text.
    Return None if the beer description is not found or N/A."""
    soup = BeautifulSoup(html_text, "lxml")
    info_box = soup.find('div', id='info_box')
    if not info_box:
        return None
    desc_header = info_box.find('b', string="Notes / Commercial Description:")
    if not desc_header:    
        return None
    i = 0
    while i < 3:
        elem = desc_header.find_next(string=re.compile("\n\S"))
        if hasattr(elem, 'string') and len(elem.string) > 0:
            text = elem.string.strip('\n')
            if not text.startswith("None"):
                return text
            return None
        i += 1
    return None

This function is non trivial because of the edge cases that it needs to handle.  If the HTML tag we are looking for to identify where the data is (info_box) doesn't exist or the Commercial Description header does not exist the function must handle that.  When I was implementing this functionality I essentially had to go through a trial and error phase, where I would solve the issue as I came accross them.  This is why web scraping can be time intensive.

#### Step Three: Putting it All Together
I now have my list of URLs and my function to parse the beer description from the HTML.  Now I need to implement the code to iterate over the URLs, get the HTML for each one and then parse that HTML.  Previously I mentioned that we don't want python to persist the requests Response objects.  We get around this by making the requests within a generator.  The python kernel will not have to persist this object in memory once we are done with it.  The code for doing this is shown below.

In [6]:
import requests as r

# Go through only the first 10 urls for demo purposes
resp_generator = (r.get(url) for url in urls[0:10])

beer_descs = [parse_description(resp.text) for resp in resp_generator]
beer_descs

[None,
 None,
 None,
 None,
 'Amber is a Munich style lager brewed with crystal malt and Perle hops. It has a smooth, malty, slightly caramel flavor and a rich amber color. Abita Amber was the first beer offered by the brewery and continues to be our leading seller.',
 'Turbodog is a dark brown ale brewed with Willamette hops and a combination of pale, crystal and chocolate malts. This combination gives Turbodog its rich body and color and a sweet chocolate toffee-like flavor. Turbodog began as a specialty ale but has gained a huge loyal following and has become one of our flagship brews.',
 'Experience the magic of Purple Haze.® Clouds of real raspberries swirl in this tart and tantalizing lager inspired by the good spirits and dark mysteries of New Orleans. Brewed with pilsner and wheat malts along with Vanguard hops, let the scent of berries in the hazy purple brew put a spell on you.',
 'Wheat (May – September) German brewers discovered centuries ago that the addition of wheat prod

While this was a relatively simple example hopefully demonstrates what Web Scraping is and how to use the BeautifulSoup to accomplish it with Python.  Hopefully it also drives the point that it is best to avoid web scraping if possible.  When going the web scraping route the complexity of the code increases quickly.  It is much easier to get data from files and/or using a library or API.  The time you spend collecting your data takes time away from actually exploring it and analyzing it.

## Summary
In this talk we discussed 4 methodologies for mining data from the web in implementation complexity order:
- Downloading Datasets
- Python API Wrapper Libraries
- Accessing Web APIs from Python
- Web Scraping

Hopefully the next time you need to collect data for a data analysis process you can utilize these methodologies to make your data collection step as painless as possible.

# Sources
[1] https://www.tutorialspoint.com/excel_data_analysis/data_analysis_process.htm

[2] https://ori.hhs.gov/education/products/n_illinois_u/datamanagement/dctopic.html

[3] https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html

[4] https://en.wikipedia.org/wiki/Data_collection

[5] https://ebrary.net/1291/education/considerations_for_collecting_data

[6] https://en.wikipedia.org/wiki/Web_API

[7] https://developer.github.com/v3/

[8] https://syntax.fm/show/060/the-undocumented-web-scraping-private-apis-proxies-and-alternative-solutions

[9] https://stmorse.github.io/journal/espn-fantasy-3-python.html

[10] https://www.promptcloud.com/blog/data-scraping-vs-data-crawling