<a href="https://colab.research.google.com/github/sgathai/dsc-data-serialization-lab/blob/master/dspt_phase_1_APIs_and_webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# APIs and Webscraping

**Lecture Overview**

1. API review
2. Webscraping review


## APIs

In [None]:
import pandas as pd
import requests # works with the internet
import json

![link text](https://www.cleveroad.com/images/article-previews/40ca78a7a9db7adfb6bb861fc6b8910ae2ef4bb79f5508007d166f01df5c1038.png)

In [None]:
#@title
# An API (Application Programming Interface) is a set of rules and protocols for building
# and interacting with software. It defines how different software components should
# interact with each other and how data should be exchanged between them.

### Overview of APIs

An API communicates with another application:

* Send request (with some info/data)
* Get response
    + data
    + service


It's always a software-to-software interaction

### Parts on an API

* **Access Permissions**
    + User allowed to ask?
* **API Call/Request**
    + Code used to make API call to implement complicated tasks/features
    + *Methods*: what questions can we ask?
    + *Parameters*: more info to be sent
* **Response**
    + Result of request

### The `requests` Library and its `.get()` Method

To use an API, you make a request to a remote web server, and retrieve the data you need.

We'll use the `requests` library to access web locations.

-------

In [None]:
# Import requests to working environment
import requests

In [None]:
# Make a get request to get the latest position of the
# International Space Station (ISS) from the opennotify api.

url = 'http://api.open-notify.org/iss-now.json'
iss_response = requests.get(url)

This creates a `Response` object containing the response that we received

In [None]:
type(iss_response)

requests.models.Response

The `Response` object contains a bunch of information about the response we got from the server. For example, it includes the status code, which can be helpful for diagnosing request issues. 200 means OK - we'll discuss others later.

In [None]:
iss_response.status_code

200

The `Response` object also contains the data received from our request in the `content` attribute. 

In [None]:
iss_response.content

b'{"timestamp": 1678981613, "message": "success", "iss_position": {"longitude": "-3.0153", "latitude": "-22.5944"}}'

### Parsing JSON Responses

OpenNotify has several API **endpoints**. An endpoint is a server route that is used to retrieve different data from the API. For example, the `/comments` endpoint on the Reddit API might retrieve information about comments, whereas the `/users` endpoint might retrieve data about users. To access them, you would add the endpoint to the base url of the API.

In [None]:
# Let's check out who is in space right now!

url = 'http://api.open-notify.org/astros.json'
#api.twittter.com/statuses/lookup
astro_response = requests.get(url)
print(astro_response.status_code)

200


In [None]:
astro_response.content

b'{"message": "success", "number": 10, "people": [{"craft": "ISS", "name": "Sergey Prokopyev"}, {"craft": "ISS", "name": "Dmitry Petelin"}, {"craft": "ISS", "name": "Frank Rubio"}, {"craft": "Shenzhou 15", "name": "Fei Junlong"}, {"craft": "Shenzhou 15", "name": "Deng Qingming"}, {"craft": "Shenzhou 15", "name": "Zhang Lu"}, {"craft": "ISS", "name": "Stephen Bowen"}, {"craft": "ISS", "name": "Warren Hoburg"}, {"craft": "ISS", "name": "Sultan Alneyadi"}, {"craft": "ISS", "name": "Andrey Fedyaev"}]}'

See the `b'` at the beginning? The `content` is stored in a "byte literal" format, not a Python dictionary.

In [None]:
type(astro_response.content)

bytes

We can look at the `test` attribute instead, but this still gives us a string, not a dictionary.

In [None]:
astro_response.text

'{"message": "success", "number": 10, "people": [{"craft": "ISS", "name": "Sergey Prokopyev"}, {"craft": "ISS", "name": "Dmitry Petelin"}, {"craft": "ISS", "name": "Frank Rubio"}, {"craft": "Shenzhou 15", "name": "Fei Junlong"}, {"craft": "Shenzhou 15", "name": "Deng Qingming"}, {"craft": "Shenzhou 15", "name": "Zhang Lu"}, {"craft": "ISS", "name": "Stephen Bowen"}, {"craft": "ISS", "name": "Warren Hoburg"}, {"craft": "ISS", "name": "Sultan Alneyadi"}, {"craft": "ISS", "name": "Andrey Fedyaev"}]}'

In [None]:
print(astro_response.text)
print(type(astro_response.text))

{"message": "success", "number": 10, "people": [{"craft": "ISS", "name": "Sergey Prokopyev"}, {"craft": "ISS", "name": "Dmitry Petelin"}, {"craft": "ISS", "name": "Frank Rubio"}, {"craft": "Shenzhou 15", "name": "Fei Junlong"}, {"craft": "Shenzhou 15", "name": "Deng Qingming"}, {"craft": "Shenzhou 15", "name": "Zhang Lu"}, {"craft": "ISS", "name": "Stephen Bowen"}, {"craft": "ISS", "name": "Warren Hoburg"}, {"craft": "ISS", "name": "Sultan Alneyadi"}, {"craft": "ISS", "name": "Andrey Fedyaev"}]}
<class 'str'>


To address this, we will use the `.json()` method to get a dictionary we can work with.

In [None]:
astro_data = astro_response.json()
astro_data.keys()

dict_keys(['message', 'number', 'people'])

### Status Codes

The request we make may not always be successful. The best way is to check the status code which gets returned with the response: `response.status_code`

In [None]:
astro_response.status_code

200

### Common status codes

* 200 — everything went okay, and the result has been returned (if any)
* 301 — the server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint name is changed.
* 401 — the server thinks you’re not authenticated. This happens when you don’t send the right credentials to access an API.
* 400 — the server thinks you made a bad request. This can happen when you don’t send along the right data, among other things.
* 403 — the resource you’re trying to access is forbidden — you don’t have the right permissions to see it.
* 404 — the resource you tried to access wasn’t found on the server.

### Hitting the right endpoint

We’ll now make a GET request to http://api.open-notify.org/iss-pass.json.

In [None]:
iss_pass_url = 'http://api.open-notify.org/iss-pass.json'
response = requests.get(iss_pass_url)
response.status_code

404

We can look at `content` to see if the server told us why there was a problem.

In [None]:
response.content

b'<html>\r\n<head><title>404 Not Found</title></head>\r\n<body bgcolor="white">\r\n<center><h1>404 Not Found</h1></center>\r\n<hr><center>nginx/1.10.3</center>\r\n</body>\r\n</html>\r\n'

### An Example Request with OAuth 

[OAuth](https://en.wikipedia.org/wiki/OAuth) is a common standard used by companies to provide API access. "Auth" refers to two processes:

* Authentication: Verifying your identity
* Authorization: Giving you access to a resource

In [None]:
      
curl "https://api.twitter.com/2/tweets?ids=1261326399320715264,1278347468690915330" \
  -H "Authorization: Bearer AAAAAAAAAAAAAAAAAAAAAFnz2wAAAAAAxTmQbp%2BIHDtAhTBbyNJon%2BA72K4%3DeIaigY0QBrv6Rp8KZQQLOTpo9ubw5Jt?WRE8avbi"


In [None]:
creds = { "id": "1261326399320715264", "key": "AAAAAAAAAAAAAAAAAAAAAFnz2wAAAAAAxTmQbp%2BIHDtAhTBbyNJon%2BA72K4%3DeIaigY0QBrv6Rp8KZQQLOTpo9ubw5Jt?WRE8avbi" }

### Making our Request

[Yelp API Documentation](https://www.yelp.com/developers/documentation/v3/get_started)

Let's look at an example request and dissect it into its consituent parts:

In [None]:
url = 'https://api.twitter.com/2/statuses/user_timeline'
term = 'Hamburgers'
SEARCH_LIMIT = 10
headers = {
    'Authorization': 'Bearer ' + creds['key']
}

url_params = {
    'term': term,
    'location': 'Seattle+WA',
    'limit': SEARCH_LIMIT,
    'offset': 0
}
response = requests.get(url, headers=headers, params=url_params)
print(response.status_code)

404


### The Response

As before, our response object has both a status code, as well as the data itself. With that, let's start with a little data exploration!

In [None]:
response.content

b''

## Webscraping

### The components of a web page

When we visit a web page, our web browser makes a GET request to a web server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

- HTML — contain the main content of the page.
- CSS — add styling to make the page look nicer.
- JS — Javascript files add interactivity to web pages.
- Images — image formats, such as JPG and PNG allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping.

### HTML

We’ll now add our first content to the page, in the form of the p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:
~~~html
<html>
    <head>
    </head>
    <body>
          <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>
~~~

<html>
    <head>
    </head>
    <body>
          <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>

Tags have commonly used names that depend on their position in relation to other tags:

- **child** — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
- **parent** — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
- **sibiling** — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

We can also add properties to HTML tags that change their behavior:

~~~html
<html>
  <head></head>
  <body>
    <p>
      Here's a paragraph of text!
      <a href="https://www.dataquest.io">Learn Data Science Online</a>
    </p>
    <p>
      Here's a second paragraph of text!
      <a href="https://www.python.org">Python</a>        
    </p>
  </body>
</html>
~~~

<html>
    <head>
    </head>
    <body>
        <p>
            Here's a paragraph of text!
            <a href="https://www.dataquest.io">Learn Data Science Online</a>
        </p>
        <p>
            Here's a second paragraph of text!
            <a href="https://www.python.org">Python</a>        </p>
    </body></html>

In the above example, we added two a tags. a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

a and p are extremely common html tags. Here are a few others:

- *div*: indicates a division, or area, of the page.
- *b*: bolds any text inside.
- *i*: italicizes any text inside.
- *u*: underlines any text inside.
- *table*: creates a table.
- *form*: creates an input form.


For a full list of tags, look [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

## Webscraping with Python

In [None]:
from bs4 import BeautifulSoup

### The requests library

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library (similar to interacting with APIs!).

In [None]:
req = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

In [None]:
req.status_code

200

We can print out the HTML content of the page using the content property:

In [None]:
req.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [None]:
r = requests.get("https://www.jumia.co.ke/xiaomi-redmi-9a-6.53-2gb32gb-13.0mp-5000mah-4g-lte-dual-sim-grey-30887170.html")
r.content

b'<!DOCTYPE html><html lang="en" dir="ltr"><head><meta charset="utf-8" /><title>XIAOMI Redmi 9A, 6.53&quot;, 2GB+32GB, 13.0MP, 5000mAh, 4G LTE, Dual SIM - Grey @ Best Price Online | Jumia Kenya</title><meta property="og:type" content="product" /><meta property="og:site_name" content="Jumia Kenya" /><meta property="og:title" content="Redmi 9A, 6.53&quot;, 2GB+32GB, 13.0MP, 5000mAh, 4G LTE, Dual SIM - Grey" /><meta property="og:description" content="Redmi 9A6.53\\&quot; large display - 5000mAh battery -\xc2\xa013MP AI Rear CameraImmersive 6.53\\&quot; HD+ displayThe large display allows you to fully immerse yourself in the virtual world.Low blue light for a comfortable viewing experienceWith blue light protection certification, your eyes will beat ease even after spending long hours on your phone.*Massive 5000mAh BatteryWith 34 days of standby-battery time, this battery provides power that lastsLong-lasting battery lifeThe battery has a charge cycle count as high as 1000, meaning that th

### Parsing a page with BeautifulSoup

We can use the BeautifulSoup library to parse this document, and extract the text from the `<p>` tag. 

In [None]:
soup = BeautifulSoup(req.content)
list(soup.children)

['html', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [None]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


### Finding all instances of a tag at once

If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

In [None]:
soup = BeautifulSoup(req.content)
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

Note that `find_all` returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [None]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

If you instead only want to find the first instance of a tag, you can use the `find` method, which will return a single `BeautifulSoup` object:

In [None]:
soup.find('p')

<p>Here is some simple content for this page.</p>

### Searching for tags by class and id

In [None]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content)
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

Now, we can use the `find_all` method to search for items by class or by id. In the below example, we’ll search for any `p` tag that has the class `outer-text`:

In [None]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In the below example, we’ll look for any tag that has the class `outer-text`:



In [None]:
soup.find_all(class_="outer-text")[0]

<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>

We can also search for elements by `id`:


In [None]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

### More sophisticated webpages

In [None]:
url = 'https://forecast.weather.gov/MapClick.php?lat=41.8843&lon=-87.6324#.XdPlJUVKg6g'
request = requests.get(url)
soup = BeautifulSoup(request.content)

In [None]:
times = soup.find_all(class_='period-name')
times

[<p class="period-name">Today<br/><br/></p>,
 <p class="period-name">Tonight<br/><br/></p>,
 <p class="period-name">Friday<br/><br/></p>,
 <p class="period-name">Friday<br/>Night</p>,
 <p class="period-name">Saturday<br/><br/></p>,
 <p class="period-name">Saturday<br/>Night</p>,
 <p class="period-name">Sunday<br/><br/></p>,
 <p class="period-name">Sunday<br/>Night</p>,
 <p class="period-name">Monday<br/><br/></p>]

In [None]:
descs = soup.find_all(class_='short-desc')
descs

[<p class="short-desc">Showers and<br/>Breezy</p>,
 <p class="short-desc">Showers and<br/>Breezy</p>,
 <p class="short-desc">Breezy.<br/>Chance<br/>Rain/Flurries<br/>then Mostly<br/>Sunny</p>,
 <p class="short-desc">Partly Cloudy<br/>and Breezy</p>,
 <p class="short-desc">Breezy.<br/>Chance<br/>Flurries then<br/>Slight Chance<br/>Snow Showers</p>,
 <p class="short-desc">Slight Chance<br/>Snow Showers<br/>and Blustery<br/>then Partly<br/>Cloudy</p>,
 <p class="short-desc">Sunny</p>,
 <p class="short-desc">Mostly Clear</p>,
 <p class="short-desc">Sunny</p>]

In [None]:
together = [(entry[0].text, entry[1].text) for entry in zip(times, descs)]
together

[('Today', 'Showers andBreezy'),
 ('Tonight', 'Showers andBreezy'),
 ('Friday', 'Breezy.ChanceRain/Flurriesthen MostlySunny'),
 ('FridayNight', 'Partly Cloudyand Breezy'),
 ('Saturday', 'Breezy.ChanceFlurries thenSlight ChanceSnow Showers'),
 ('SaturdayNight', 'Slight ChanceSnow Showersand Blusterythen PartlyCloudy'),
 ('Sunday', 'Sunny'),
 ('SundayNight', 'Mostly Clear'),
 ('Monday', 'Sunny')]

### Pulling in a Table

*In general you'll need to examine the html code so that you can tell the BeautifulSoup parser what to look for!*

In [None]:
url = 'https://www.pro-football-reference.com/'

res = requests.get(url)
soup = BeautifulSoup(res.content)

In [None]:
?soup.find

In [None]:
teams = []
table = soup.find('table', {'id': 'AFC'})
#print(table)

for row in table.find('tbody').find_all('tr'):
    try:
        team = {'name': row.find('th', {'data-stat': 'team'}).text,
               'wins': row.find('td', {'data-stat': 'wins'}).text,
               'losses': row.find('td', {'data-stat': 'losses'}).text,
               'ties': row.find('td', {'data-stat': 'ties'}).text}
        teams.append(team)
    except:
        pass

In [None]:
teams

[{'name': 'BUF*', 'wins': '13', 'losses': '3', 'ties': '0'},
 {'name': 'MIA+', 'wins': '9', 'losses': '8', 'ties': '0'},
 {'name': 'NWE', 'wins': '8', 'losses': '9', 'ties': '0'},
 {'name': 'NYJ', 'wins': '7', 'losses': '10', 'ties': '0'},
 {'name': 'CIN*', 'wins': '12', 'losses': '4', 'ties': '0'},
 {'name': 'BAL+', 'wins': '10', 'losses': '7', 'ties': '0'},
 {'name': 'PIT', 'wins': '9', 'losses': '8', 'ties': '0'},
 {'name': 'CLE', 'wins': '7', 'losses': '10', 'ties': '0'},
 {'name': 'JAX*', 'wins': '9', 'losses': '8', 'ties': '0'},
 {'name': 'TEN', 'wins': '7', 'losses': '10', 'ties': '0'},
 {'name': 'IND', 'wins': '4', 'losses': '12', 'ties': '1'},
 {'name': 'HOU', 'wins': '3', 'losses': '13', 'ties': '1'},
 {'name': 'KAN*', 'wins': '14', 'losses': '3', 'ties': '0'},
 {'name': 'LAC+', 'wins': '10', 'losses': '7', 'ties': '0'},
 {'name': 'LVR', 'wins': '6', 'losses': '11', 'ties': '0'},
 {'name': 'DEN', 'wins': '5', 'losses': '12', 'ties': '0'}]

## Combining our data into a Pandas DataFrame

We can now combine the data into a Pandas DataFrame and analyze it.

In order to do this, we’ll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column:

In [None]:
# Football data from the table in dictionary form (very easy!)

football = pd.DataFrame(teams)
football

Unnamed: 0,name,wins,losses,ties
0,BUF*,13,3,0
1,MIA+,9,8,0
2,NWE,8,9,0
3,NYJ,7,10,0
4,CIN*,12,4,0
5,BAL+,10,7,0
6,PIT,9,8,0
7,CLE,7,10,0
8,JAX*,9,8,0
9,TEN,7,10,0


In [None]:
# Weather data from the list of doubles

weather = pd.DataFrame(together,
                      columns=['time', 'description'])
weather

Unnamed: 0,time,description
0,Today,ScatteredShowers thenShowers andBreezy
1,Tonight,Showers thenScatteredShowers andBreezy
2,Friday,Breezy.ChanceSprinkles/Flurriesthen MostlySunny
3,FridayNight,Partly Cloudyand Breezy
4,Saturday,Slight ChanceSnow Showersand Breezy
5,SaturdayNight,Mostly Cloudyand Blusterythen PartlyCloudy
6,Sunday,Sunny
7,SundayNight,Mostly Clear
8,Monday,Sunny
