In [None]:
## all imports
from IPython.display import HTML
import numpy as np
import bs4 #this is beautiful soup

from pandas import Series
import pandas as pd
from pandas import DataFrame

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline


# Data Scraping

In this module, we'll focus on a data that has become extremely common on the Internet:  text data.  In principle, text is just another form of data, and text processing is just another part of "data wrangling".  While text is advantageous in that there is so much of it out there that can be used, it is challenging because it is "unstructured."  It does not have the usual "tabular" characteristics, with fields.  Its also relatively "dirty", people misspell wrods and runwordstogether #unpredictably.

Today, we'll talk about data scraping as it is associated with obtaining data from webpages. There is low level scraping where you parse the data out of the html code of the webpage. There also is scraping over APIs or Application Program Interface from websites who try to make your life a bit easier.  Its basically a language that helps you access features of a dataset, like text on a webpage.


## Scraping:  HTML and APIs with Python


## 4. Web Scraping using Beautiful Soup

Let's scrape some data using a fun library called Beautiful Soup. We'll create a CSV dataset of the a table on 311 reported Rodent Incidents around Boston.

The website we are going to scrape is here.

[County Housing Statistics](http://duspviz.mit.edu/_assets/data/county_housing_stats.html)

Let's get started!

#### Importing Modules

First import modules. **import requests** imports the requests module, and **import bs4** imports the Beautiful Soup library.

FYI:  This tutorial is based on material developed by [DSUPviz](http://duspviz.mit.edu/tutorials/python-scraping/).

In [None]:
import bs4
import requests

#### Testing out Requests

Requests will allow us to load a webpage into python so that we can parse it and manipulate it. Test this by running the following. Enter the following commands in terminal, and hit enter after entering each to run each of them.

This allowed us to access all of the content from the source code of the webpage with Python, which we can now parse and extract data. It even printed to our console. Pretty cool!

In [None]:
response = requests.get('http://duspviz.mit.edu/_assets/data/county_housing_stats.html')
#print(response.text) # Print the output

### Testing out Beautiful Soup
Our next big step is to test out Beautiful Soup. Let's talk about what this is...

What is Beautiful Soup?
Beautiful Soup is a Python library for parsing data out of HTML and XML files (aka webpages). It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. The major concept with Beautiful Soup is that it allows you to access elements of your page by following the CSS structures, such as grabbing all links, all headers, specific classes, or more. It is a powerful library. Once we grab elements, Python makes it easy to write the elements or relevant components of the elements into other files, such as a CSV, that can be stored in a database or opened in other software.

The sample webpage we are using contains data on 'rodent incidents' in the greater Boston area. Let's use this file to explore the tree, and extract some data.

Our first step is to *Make the Soup*

First, we have to turn the website code into a Python object. We have already imported the Beautiful Soup library, so we can start calling some of the methods in the libary.  We will replace print response.text with the following command, and this turns the text into an Python object named soup.

An important note: You need to specify the specific parser that Beautiful Soup uses to parse your text. This is done in the second argument of the BeautifulSoup function. The default is the built in Python parser, which we can call using html.parser

You an also use lxml or html5lib. This is nicely described in the [documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). For our purposes, using the default is fine.

Using the Beautiful Soup prettify() function, we can print the page to see the code printed in a readable and legible manner.

At any point, if you need a reference, visit the Beautiful Soup [documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) for the official descriptions of functions. Prettify is a handy one to see our document in a clean fashion.

In [None]:
soup = bs4.BeautifulSoup(response.text, "html.parser")
print(soup.prettify()) # Print the output using the 'prettify' function

## Navigating the Data Structure

With our data from the webpage nicely laid out, Beautiful Soup allows us to now navigate the data structure. We called our Beautiful Soup object soup, so we can run the Beautiful Soup functions on this object. Let's explore some ways to do this, try entering some of the following into your terminal.

In [None]:
# Access the title element
soup.title

In [None]:
# Access the content of the title element
soup.title.string

In [None]:
# Access data in the first 'p' tag
soup.p

In [None]:
# Access data in the first 'a' tag
soup.a

In [None]:
# Retrieve all links in the document (note it returns an array)
soup.find_all('a')

In [None]:
# Retrieve elements by class equal to link using the attributes argument
soup.findAll(attrs={'class' : 'link'})

In [None]:
# Retrieve a specific link by ID
soup.find(id="link3")

In [None]:
# Access Data in the table (note it returns an array)
soup.find_all('td')

## Working with Arrays
The easiest way to access elements and then either write them to file or manipulate them is to save them as objects themselves. Note that our data is organzed into counties and several numbers. Let's save these to arrays, which are the easiest way to work with the data.

The following gives us an array, we can work with the elements.

In [None]:
data = soup.findAll(attrs={'class':'name'})
data[0]

In [None]:
data = soup.findAll(attrs={'class':'name'})
print(data[0].string)
print(data[1].string)
print(data[2].string)
print(data[3].string)

In [None]:
data = soup.findAll(attrs={'class':'name'})
for i in data:
    print(i.string)

This array only gives us counties though, let's get all of the data elements from all classes.



In [None]:
data = soup.findAll(attrs={'class':['name','fips','tot-pop','median-income','no-housing-units','med-home-val','owner-occupied','house-w-debt','house-wo-debt']})
for i in data:
    print(i.string)

We have all of our data that was nested in these tags saved to a Python array. Access the elements of the array by using data[x], where x is location in the array. In Python, arrays start at 0, so place 1 in a Python array is actually called by using a 0, and place 8 would be called by a 7.

In [None]:
print(data[0])
print(data[1])
print(data[0].string)
print(data[1].string)

In [None]:
import requests
import bs4

# load and get the website
response = requests.get('http://duspviz.mit.edu/_assets/data/county_housing_stats.html')

# create the soup
soup = bs4.BeautifulSoup(response.text, "html.parser")

# find all the tags with class city or number
data = soup.findAll(attrs={'class':['name','fips','tot-pop','median-income','no-housing-units','med-home-val','owner-occupied','house-w-debt','house-wo-debt']})

# print 'data' to console
print(data)

You should see an array with our data elements nested within tags. This is what we want!

In [None]:
f = open('county_data.csv','w') # open new file, make sure path to your data file is correct

p = 0 # initial place in array
l = len(data)-1 # length of array minus one


f.write("County, State, FIPS Code, Total Pop, Median Income ($), No. of Housing Units, Median Home Value ($), No. of Owner Occupied Housing Units, No. of Owner Occ. Housing Units with Debt, No. of Owner Occ. Housing Units without Debt\n") #write headers


while p < l: # while place is less than length
    f.write(data[p].string + ", ") # write county and add comma
    p = p + 1 # increment
    f.write(data[p].string + ", ") # write FIPS and add comma
    p = p + 1 # increment
    f.write(data[p].string + ", ") # write Total Pop and add comma
    p = p + 1 # increment
    f.write(data[p].string + ", ") # write Median Income and add comma
    p = p + 1 # increment
    f.write(data[p].string + ", ") # write No. of Housing Units and add comma
    p = p + 1 # increment
    f.write(data[p].string + ", ") # write Median Home Value and add comma
    p = p + 1 # increment
    f.write(data[p].string + ", ") # write No. of Owner Occupied Housing Units and add comma
    p = p + 1 # increment
    f.write(data[p].string + ", ") # write No. of Owner Occ. Housing Units with Debt and add comma
    p = p + 1 # increment
    f.write(data[p].string + "\n") # write No. of Owner Occ. Housing Units without Debt and line break
    p = p + 1 # increment

    
f.close() # close file


## JSON & Working with Web APIs

Simply put, an Application Programming Interface is a standard that facilitates intercommunication between two or more computer programs.

Web APIs are a more convenient way for programs to interact with websites. Many webistes now have a nice API that gives access to it's data in JSON format. JSON is a way to store data in an organized and logical manner. 

Some famous APIs are twitter, facebook, google, etc.  Sometimes you need to have special permission to use APIs and sometimes you can use them publicly. 

In [25]:
import json

a = {'a': 1, 'b':2}
s = json.dumps(a)
a2 = json.loads(s)
a2['b']

2

In [None]:
print(a) # a dictionary
print(s) # s is a string containing a in JSON encoding
print(a2) # reading back the keys are now in unicode

## World Cup in JSON!

There was an [API created for the World Cup](http://worldcup.sfg.io) that scraped current match results and output match data as JSON. Possible output includes events such as goals, substitutions, and cards. The [actual matches are listed here](http://worldcup.sfg.io/matches) in JSON. 

* Example from [Fernando Masanori](https://gist.github.com/fmasanori/1288160dad16cc473a53)

The first step in getting data using an API is to use a GET request.  The GET request by default will return a text, but we can also tell it that we want that request back as a JSON object.

In [32]:
import requests

url = "http://worldcup.sfg.io/matches"
resp = requests.get(url)
wc = resp.json()

In [33]:
"Number of matches in 2019 World Cup: %i" % len(wc)

'Number of matches in 2019 World Cup: 52'

In [36]:
# Print keys in first match

gameIndex = 0
wc[gameIndex].keys()

dict_keys(['venue', 'location', 'status', 'time', 'fifa_id', 'weather', 'attendance', 'officials', 'stage_name', 'home_team_country', 'away_team_country', 'datetime', 'winner', 'winner_code', 'home_team', 'away_team', 'home_team_events', 'away_team_events', 'home_team_statistics', 'away_team_statistics', 'last_event_update_at', 'last_score_update_at'])

In [38]:
wc[gameIndex]['winner']

'France'

In [45]:
wc[gameIndex]['home_team']['goals']

4

In [46]:
for elem in wc:
    print(elem['home_team']['country'], elem['home_team']['goals'], elem['away_team']['country'], elem['away_team']['goals'])

France 4 Korea Republic 0
Germany 1 China PR 0
Spain 3 South Africa 1
Norway 3 Nigeria 0
Brazil 3 Jamaica 0
England 2 Scotland 1
Australia 1 Italy 2
Argentina 0 Japan 0
Canada 1 Cameroon 0
New Zealand 0 Netherlands 1
Chile 0 Sweden 2
USA 13 Thailand 0
Nigeria 2 Korea Republic 0
Germany 1 Spain 0
France 2 Norway 1
Australia 3 Brazil 2
South Africa 0 China PR 1
Japan 2 Scotland 1
Jamaica 0 Italy 5
England 1 Argentina 0
Netherlands 3 Cameroon 1
Canada 2 New Zealand 0
Sweden 5 Thailand 1
USA 3 Chile 0
China PR 0 Spain 0
South Africa 0 Germany 4
Nigeria 0 France 1
Korea Republic 1 Norway 2
Italy 0 Brazil 1
Jamaica 1 Australia 4
Japan 0 England 2
Scotland 3 Argentina 3
Cameroon 2 New Zealand 1
Netherlands 2 Canada 1
Thailand 0 Chile 2
Sweden 0 USA 2
Germany 3 Nigeria 0
Norway 1 Australia 1
England 3 Cameroon 0
France 2 Brazil 1
Spain 1 USA 2
Sweden 1 Canada 0
Italy 2 China PR 0
Netherlands 2 Japan 1
Norway 0 England 3
France 1 USA 2
Italy 0 Netherlands 2
Germany 1 Sweden 2
England 1 USA 2
Ne

### Create a pandas DataFrame from JSON

In [49]:

#pd.DataFrame(wc)
data = pd.DataFrame(wc, columns = ['match_number', 'location', 'datetime', 'home_team', 'away_team', 'winner', 'home_team_events', 'away_team_events'])
data.head()

Unnamed: 0,match_number,location,datetime,home_team,away_team,winner,home_team_events,away_team_events
0,,Parc des Princes,2019-06-07T19:00:00Z,"{'country': 'France', 'code': 'FRA', 'goals': ...","{'country': 'Korea Republic', 'code': 'KOR', '...",France,"[{'id': 1, 'type_of_event': 'goal', 'player': ...","[{'id': 6, 'type_of_event': 'substitution-out'..."
1,,Roazhon Park,2019-06-08T13:00:00Z,"{'country': 'Germany', 'code': 'GER', 'goals':...","{'country': 'China PR', 'code': 'CHN', 'goals'...",Germany,"[{'id': 23, 'type_of_event': 'substitution-out...","[{'id': 19, 'type_of_event': 'yellow-card', 'p..."
2,,Stade Océane,2019-06-08T16:00:00Z,"{'country': 'Spain', 'code': 'ESP', 'goals': 3...","{'country': 'South Africa', 'code': 'RSA', 'go...",Spain,"[{'id': 38, 'type_of_event': 'substitution-out...","[{'id': 37, 'type_of_event': 'goal', 'player':..."
3,,Stade Auguste-Delaune,2019-06-08T19:00:00Z,"{'country': 'Norway', 'code': 'NOR', 'goals': ...","{'country': 'Nigeria', 'code': 'NGA', 'goals':...",Norway,"[{'id': 63, 'type_of_event': 'goal', 'player':...","[{'id': 61, 'type_of_event': 'yellow-card', 'p..."
4,,Stade des Alpes,2019-06-09T13:30:00Z,"{'country': 'Brazil', 'code': 'BRA', 'goals': ...","{'country': 'Jamaica', 'code': 'JAM', 'goals':...",Brazil,"[{'id': 98, 'type_of_event': 'goal', 'player':...","[{'id': 99, 'type_of_event': 'yellow-card', 'p..."


In [51]:
data['gameDate'] = pd.DatetimeIndex(data.datetime).date
#data['gameTime'] = pd.DatetimeIndex(data.datetime).time


In [52]:
data.head()

Unnamed: 0,match_number,location,datetime,home_team,away_team,winner,home_team_events,away_team_events,gameDate
0,,Parc des Princes,2019-06-07T19:00:00Z,"{'country': 'France', 'code': 'FRA', 'goals': ...","{'country': 'Korea Republic', 'code': 'KOR', '...",France,"[{'id': 1, 'type_of_event': 'goal', 'player': ...","[{'id': 6, 'type_of_event': 'substitution-out'...",2019-06-07
1,,Roazhon Park,2019-06-08T13:00:00Z,"{'country': 'Germany', 'code': 'GER', 'goals':...","{'country': 'China PR', 'code': 'CHN', 'goals'...",Germany,"[{'id': 23, 'type_of_event': 'substitution-out...","[{'id': 19, 'type_of_event': 'yellow-card', 'p...",2019-06-08
2,,Stade Océane,2019-06-08T16:00:00Z,"{'country': 'Spain', 'code': 'ESP', 'goals': 3...","{'country': 'South Africa', 'code': 'RSA', 'go...",Spain,"[{'id': 38, 'type_of_event': 'substitution-out...","[{'id': 37, 'type_of_event': 'goal', 'player':...",2019-06-08
3,,Stade Auguste-Delaune,2019-06-08T19:00:00Z,"{'country': 'Norway', 'code': 'NOR', 'goals': ...","{'country': 'Nigeria', 'code': 'NGA', 'goals':...",Norway,"[{'id': 63, 'type_of_event': 'goal', 'player':...","[{'id': 61, 'type_of_event': 'yellow-card', 'p...",2019-06-08
4,,Stade des Alpes,2019-06-09T13:30:00Z,"{'country': 'Brazil', 'code': 'BRA', 'goals': ...","{'country': 'Jamaica', 'code': 'JAM', 'goals':...",Brazil,"[{'id': 98, 'type_of_event': 'goal', 'player':...","[{'id': 99, 'type_of_event': 'yellow-card', 'p...",2019-06-09


In [75]:
import requests

url = "https://developers.zomato.com/api/v2.1/categories"
resp = requests.get(url, headers={'user-key': '1036cfe76c2b8d74c0a12334163d5c61'})
wc = resp.json()
wc

{'categories': [{'categories': {'id': 1, 'name': 'Delivery'}},
  {'categories': {'id': 2, 'name': 'Dine-out'}},
  {'categories': {'id': 3, 'name': 'Nightlife'}},
  {'categories': {'id': 4, 'name': 'Catching-up'}},
  {'categories': {'id': 5, 'name': 'Takeaway'}},
  {'categories': {'id': 6, 'name': 'Cafes'}},
  {'categories': {'id': 7, 'name': 'Daily Menus'}},
  {'categories': {'id': 8, 'name': 'Breakfast'}},
  {'categories': {'id': 9, 'name': 'Lunch'}},
  {'categories': {'id': 10, 'name': 'Dinner'}},
  {'categories': {'id': 11, 'name': 'Pubs & Bars'}},
  {'categories': {'id': 13, 'name': 'Pocket Friendly Delivery'}},
  {'categories': {'id': 14, 'name': 'Clubs & Lounges'}}]}

In [72]:
resp

<Response [200]>