# Requests and APIs

Here I will demonstrate:
- Downloading an HTML file using __requests__.
- Use __BeautifulSoup__ to extract data from the HTML file.
- Use __Pandas__ to read a table in from HTML.
- Using various __APIs__.


In [3]:
import requests
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

### Using requests
I seldom use anything other than __GET__, so that's what I will demonstrate.

My task: scrape a bunch of dairy cow data from an online database.

For example, what are the stats for the cow with the ID __"29HO18225"__?

In [4]:
url = "https://www.naab-css.org/dairy-cross-reference-results"

r = requests.get(url,params={"naab":"29HO18225"})

r.status_code

200

It starts with a 2, we are good to go!

Issue with this site, is it would have given a 2 regardless. If the ID wasnt real, it would have just returned an empty table.

Notice that the url actually redirected:

In [5]:
r.url

'https://www.naab-css.org/dairy-cross-reference/HOUSA000074024948'

So another way to do this is without the "keywords" param if I happen to know the other id which is above (and this is actually what I ended up doing).

In [6]:
new_url = 'https://www.naab-css.org/dairy-cross-reference/HOUSA000074024948'
r = requests.get(new_url)
r

<Response [200]>

Instead of feeding IDs into the `params`, I would just make new URL strings.

Ok so what is in `r` now?

In [7]:
r.text

'\r\n\r\n<!DOCTYPE html>\r\n<html lang="en">\r\n<head><meta charset="utf-8" /><meta http-equiv="X-UA-Compatible" content="IE=edge" /><meta name="viewport" content="width=device-width, initial-scale=1.0" /><link rel="preconnect" href="https://fonts.gstatic.com" crossorigin="" /><link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@100;200;300;400;500;600;700;800;900&amp;family=Source+Sans+Pro:ital,wght@0,300;0,400;0,600;0,700;0,900;1,400&amp;display=swap" rel="stylesheet" /><link href="../Layout/CSS/styles.css?v=8" rel="stylesheet" type="text/css" /><link href="../Layout/CSS/swiper-bundle.min.css" type="text/css" rel="stylesheet" /><link href="../Layout/CSS/print.css" rel="stylesheet" type="text/css" media="print" /><link rel="apple-touch-icon" sizes="180x180" href="/images/favicon/apple-touch-icon.png" /><link rel="icon" type="image/png" sizes="48x48" href="/images/favicon/favicon-48x48.png" /><link rel="icon" type="image/png" sizes="32x32" href="/images/favicon/favicon-

In "text" it just stored the entire HTML of that output page. How do we get data out of it?

__BeautifulSoup__ is a package that gives tools for finding elements in HTML files.

HTML has several tags, such as:
- `<a>`: hyperlinks
- `<p>`: paragraphs
- `<h1>`: headers
- `<table>`: table wrappers.
    - `<td>`: table data
    - `<tr>`: rows of the table.
    - `<th>`: headers of the table.

BeautifulSoup first creates a "soup" object:

In [8]:
soup = BeautifulSoup(r.text,'html.parser')

soup


<!DOCTYPE html>

<html lang="en">
<head><meta charset="utf-8"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/><link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/><link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@100;200;300;400;500;600;700;800;900&amp;family=Source+Sans+Pro:ital,wght@0,300;0,400;0,600;0,700;0,900;1,400&amp;display=swap" rel="stylesheet"/><link href="../Layout/CSS/styles.css?v=8" rel="stylesheet" type="text/css"/><link href="../Layout/CSS/swiper-bundle.min.css" rel="stylesheet" type="text/css"/><link href="../Layout/CSS/print.css" media="print" rel="stylesheet" type="text/css"/><link href="/images/favicon/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/><link href="/images/favicon/favicon-48x48.png" rel="icon" sizes="48x48" type="image/png"/><link href="/images/favicon/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/><link href=

Now we can search for tags:

In [9]:
links = soup.find_all("a")

links[:2]

[<a class="SkipLink" href="#MainContent">Skip to Content</a>,
 <a href="/"><img alt="National Association of Animal Breeders: Return to homepage" src="/images/logo.svg"/></a>]

Each of these tags has attributes you can access.

In [10]:
links[0],\
links[0]['class'],\
links[0]['href'],\
links[0].contents

(<a class="SkipLink" href="#MainContent">Skip to Content</a>,
 ['SkipLink'],
 '#MainContent',
 ['Skip to Content'])

All the titles of the links:

In [11]:
[x.contents[0] for x in links]

['Skip to Content',
 <img alt="National Association of Animal Breeders: Return to homepage" src="/images/logo.svg"/>,
 <img alt="Certified Semen Services: Return to homepage" src="/images/css-logo-blue.svg"/>,
 'NAAB CR Login',
 'CSS Login',
 'About',
 'NAAB',
 'History',
 'Bylaws',
 'Code of Ethics',
 'Awards',
 'AI Careers',
 'NAAB Cross Reference Program',
 'Staffing',
 'CSS',
 'Minimum Requirements',
 'Overview & Audits',
 'Management Guidelines',
 'Bylaws',
 'Participation Agreement',
 'OIE-CSS Health Chart',
 'CSS Participants',
 'Services & Programs',
 'NAAB Marketing Code',
 'NAAB Uniform Coding System',
 'NAAB-ICAR Stud Location Codes Guidelines',
 'NAAB-Forms and Applications',
 'News & Alerts',
 'Calendar of Events',
 'Cross Reference Calendar',
 'Blog',
 'Committees & Directors',
 'Membership',
 'Become a Member',
 'Databases',
 'Dairy Cross Reference',
 'Beef Cross Reference',
 'Active (A) Sire Evaluation Database',
 'Genomic (G, young) Sire Evaluation Database - (G)',
 'S

What I was looking for specifically was the table, which I can find looking for "table."

This function `find` returns the first instance:

In [12]:
naab_table = soup.find("table")

naab_table

<table class="DairyCrossTable">
<tbody>
<tr>
<td><strong>Breed</strong></td>
<td>HO</td>
</tr>
<tr>
<td><strong>Country</strong></td>
<td>USA</td>
</tr>
<tr>
<td><strong>ID Number</strong></td>
<td>000074024948</td>
</tr>
<tr>
<td><strong>Semen Release Date</strong></td>
<td>2016-8</td>
</tr>
<tr>
<td><strong>Status</strong></td>
<td>I</td>
</tr>
<tr>
<td><strong>Sampling Code</strong></td>
<td> </td>
</tr>
<tr>
<td><strong>Original Controller</strong></td>
<td></td>
</tr>
<tr>
<td><strong>Reg. Name</strong></td>
<td>PINE-TREE BURLEY-ET</td>
</tr>
<tr>
<td><strong>Short Name</strong></td>
<td>BURLEY</td>
</tr>
<tr>
<td><strong>Birthdate</strong></td>
<td>5/25/2015</td>
</tr>
<tr>
<td><strong>Sire Breed</strong></td>
<td>HO</td>
</tr>
<tr>
<td><strong>Sire Country</strong></td>
<td>CAN</td>
</tr>
<tr>
<td><strong>Sire ID Number</strong></td>
<td>000011857447</td>
</tr>
<tr>
<td><strong>Dam Breed</strong></td>
<td>HO</td>
</tr>
<tr>
<td><strong>Dam Country</strong></td>
<td>USA</td>
</tr

#### Using Beautiful Soup to Extract Data:

In [13]:
            # take the string      # get every second element
                                                         # starting at 0
row_labels = [x.string for x in naab_table.find_all("td")[0::2]]

row_labels

['Breed',
 'Country',
 'ID Number',
 'Semen Release Date',
 'Status',
 'Sampling Code',
 'Original Controller',
 'Reg. Name',
 'Short Name',
 'Birthdate',
 'Sire Breed',
 'Sire Country',
 'Sire ID Number',
 'Dam Breed',
 'Dam Country',
 'Dam ID Number',
 'MGS Breed',
 'MGS Country',
 'MGS ID Number',
 'Controller Number',
 'Primary NAAB Code',
 'Secondary NAAB Code(s)',
 'Genotype Information']

In [14]:
                                                 # every second starting at 1
row_values = [[y for y in x.strings] for x in naab_table.find_all("td")[1::2]]
row_values

[['HO'],
 ['USA'],
 ['000074024948'],
 ['2016-8'],
 ['I'],
 [' '],
 [],
 ['PINE-TREE BURLEY-ET'],
 ['BURLEY'],
 ['5/25/2015'],
 ['HO'],
 ['CAN'],
 ['000011857447'],
 ['HO'],
 ['USA'],
 ['000071859525'],
 ['HO'],
 ['USA'],
 ['000069169951'],
 ['0029'],
 ['029HO18225'],
 ['629HO18225', '529HO18225', '602HO18225', '604HO18225'],
 ['TC TV TL TY TD']]

In [15]:
                 # take first element
                 # if only one element     else concat into empty string
amended_values = [x[0] if len(x)==1 else " ".join(x) for x in row_values]
amended_values

['HO',
 'USA',
 '000074024948',
 '2016-8',
 'I',
 ' ',
 '',
 'PINE-TREE BURLEY-ET',
 'BURLEY',
 '5/25/2015',
 'HO',
 'CAN',
 '000011857447',
 'HO',
 'USA',
 '000071859525',
 'HO',
 'USA',
 '000069169951',
 '0029',
 '029HO18225',
 '629HO18225 529HO18225 602HO18225 604HO18225',
 'TC TV TL TY TD']

#### Using Pandas to extract html data:

In [16]:
from io import StringIO  
pd.read_html(StringIO(str(naab_table)))[0].set_index(0).T

Unnamed: 0,Breed,Country,ID Number,Semen Release Date,Status,Sampling Code,Original Controller,Reg. Name,Short Name,Birthdate,...,Dam Breed,Dam Country,Dam ID Number,MGS Breed,MGS Country,MGS ID Number,Controller Number,Primary NAAB Code,Secondary NAAB Code(s),Genotype Information
1,HO,USA,74024948,2016-8,I,,,PINE-TREE BURLEY-ET,BURLEY,5/25/2015,...,HO,USA,71859525,HO,USA,69169951,29,029HO18225,629HO18225 529HO18225 602HO18225 604HO18225,TC TV TL TY TD


## API Examples

### Google Maps "Geocode" API
https://developers.google.com/maps/documentation

This API lets you query geographic information based on a name using the Google Maps engine. Google Maps has several APIs, including ones that give directions, distances, or locations of things.

__Note: this API is not free, unlike the other examples. Google charges by the Query once you set up an account and billing information.__

Here we are going to find out the address of the famous Morrow Plots, the oldest experiment plot in the United States.

In [17]:
location = "Morrow Plots"

In [18]:
# We're going to use a specific URL, which looks ups geocodes and sends back JSON files.
url = 'https://maps.googleapis.com/maps/api/geocode/json'

# Put that junk into a dictionary
PARAMS = {'address':location,'key':map_key}

NameError: name 'map_key' is not defined

In [None]:
# Using GET, we are fetching data from that URL and putting in those parameter.
r = requests.get(url = url, params = PARAMS)

# Check the status code:
r.status_code

A status code of 200 = Success!

Now let's look at the data in JSON:

In [None]:
data = r.json()
data

So now we have a nest JSON file which can be read as a dictionary. Let's access the "formatted_address" field to see if it found the right location.

In [None]:
data['results'][0]['formatted_address']

Right in our backyard!

### Lord of the Rings API

https://the-one-api.dev/

This API contains information about the book and movie series Lord of the Rings. To access different datasets, we need to pass different URLs, and in this case we're going to access the "movie" database.

Note here that instead of authenticating using "params" we are putting in a "header," which is a different way to pass a key that is sometimes used to authenticate.

In [None]:
headers = {"Authorization": "Bearer 9r0RdKMLdSeyl3JulQEV"}
movies = requests.get("https://the-one-api.dev/v2/movie",headers=headers)
movies

Success, so let's keep going.

In [None]:
movies.json()

So this looks messy, but notice that the first layer is "docs", which is the key to a list of dictionaries. Pandas is the most handy way to convert from a dictionary to a DataFrame:

In [None]:
data = pd.DataFrame(movies.json()['docs'])
data

Much nicer to look at. This table contains information about both the budgets of the movies and also their revenue. We might ask ourselves, which of the six movies had the best return on their investment?

__FUN FACT__: a typical rule of thumb is seeing if the box office revenue exceeds __double the film's budget__ (to take into account advertising, licensing, etc.)

In [None]:
data['roi'] = data['boxOfficeRevenueInMillions']/(data['budgetInMillions'])
# Skip the first two rows, as they are the total for the trilogies.
data = data[2:]

Let's look at the movies sorted by ROI

In [None]:
data.set_index("name")['roi'].sort_values(ascending=False)

The original franchise was Fellowship, Two Towers, and Return of the King, whereas the follow up franchise was Unexpected Journey, Desolation of Smaug, and Battle of the Five Armies.

So as the original franchise went on, ROI increased. However, the new Hobbit trilogy had less ROI with each successive movie.

__Bonus:__ Let's make a graph!!

Pandas Series objects can be plotted quickly with .plot(), which will automatically use the index as the X and the values as the Y

In [None]:
data.set_index("name")['roi'].sort_values().plot(kind='bar')

Oof, hard to read. A "horizontal bar graph" is a much better option.

In [None]:
# Call the plot, which sets this graph as active in the space.
data.set_index("name")['roi'].sort_values().plot(kind='barh',label="Revenue/Budget")
plt.axvline(color="black")

# Using the "matplotlib.pyplot" package, we can manipulate the parameters for whatever graph is active.
plt.axvline(2,color='black',ls='--',label="Break Even")
plt.xlabel("Revenue/Budget")
plt.ylabel("")
plt.title("LOTR Movies Sorted by Revenue/Budget")
plt.legend()

### NASS API
https://quickstats.nass.usda.gov/api

The National Agricultural Statistics Service via the USDA serves up all of the USDA's surveys and censuses in one place. Using their GUI, we can download the spreadsheet manually, which could take forever. Using the API, we can read data in directly from their website.

Suppose that we want to know the total dairy cow population by state in 2017 as calculated by the Agricultural Census. They have a 50k record limit so we need to make sure our query has parameters.

Using their documentation, we can construct a Python dictionary that tells NASS what data we want. We'll call this dictionary "params."

In [None]:
URL = "http://quickstats.nass.usda.gov/api/api_GET/"

params = {"key":nass_key, # Put the API KEY
          "year":"2017", # The year Census we want.
          "domain_desc":"TOTAL", # total across all domains
          "source_desc":"CENSUS",# Specify that we want the Census, not a survey.
          "agg_level_desc":"STATE", # Specify that we want the state level.
          "short_desc":"CATTLE, COWS, MILK - INVENTORY" # The name of the variable, so we don't 
                                                        # have to specify more params
         }

In [None]:
r = requests.get(url = URL, params = params)
r.status_code

A 200 means __success!__

Now let's look at what we just downloaded.

In [None]:
r.json()

Its a Python dictionary again, with the key "data" which maps to a list object full of more dictionaries.

Using Pandas:

In [None]:
data = pd.DataFrame(r.json()['data'])

data.head()

In [None]:
len(data)

There are 50 rows, so we indeed have the state level data we were looking for.

So now we have a DataFrame that we can analyze and work with. 

__BONUS: LET'S MAKE ANOTHER GRAPH__

First find the data stored in the "Value" column, and make a Series object indexed by the state abbreviation.

In [None]:
dem_cows = data.set_index("state_alpha")["Value"]
dem_cows

Notice that __dtype: object__, so we have some non-numeric data (the D's).

But that isn't the only problem: we have commas in the values, causing Pandas to think this is a series of strings.

Using the "string accessor" of Series objects (".str") we can call a replace function and take out the commas.

In [None]:
dem_cows = dem_cows.str.replace(",","")

Another useful function here is ".to_numeric()", which will attempt to convert strings to numbers, and we can also tell it to make anything it can't coerce into a missing value:

In [None]:
dem_cows = pd.to_numeric(dem_cows,errors='coerce')

dem_cows

dtype is now float, so we're good to go.

Let's find the top 10 states by dairy cow population and look at a bar graph.

In [None]:
# Find a tiny little series of the top 10 states.
top10 = dem_cows.sort_values(ascending=False).head(10)

# Graph it using the built in matplotlib functionality.
top10.plot(kind='bar')

# Specify some things with the graph
plt.xticks(rotation=0)
plt.xlabel("State")
plt.ylabel("Number of Dairy Cows")
plt.title("Top Ten States by Dairy Cow Population")

## You now have everything you need to do Homework 0