# Course: Intro to Python & R for Data Analysis
## Lecture: New Data City & Access Granted

Data Structures & API

Professor: Mary Kaltenberg

Fall

contact: mkaltenberg@pace.edu

About me: www.mkaltenberg.com

## Objectives:

Understand:
- what is an API
- protocols and HTTP requests
- the format of various data structures
- basics of API
- use Python to make requests to an API
    - make a request!
    - Parse JSON information
    - Turn it into a dataframe


In [1]:
#Another trick for installing pip packages via jupyter notebook
!pip install ratelim tenacity
!pip install requests

Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Collecting tenacity
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, ratelim
Successfully installed ratelim-0.1.6 tenacity-8.0.1


In [None]:
# you only have to install a package once
# going forward you can just import the package
import ratelim
import tenacity
import requests

In [2]:
!pip install --upgrade pip

Collecting pip
  Downloading pip-21.3-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 9.4 MB/s eta 0:00:01
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.3.1
    Uninstalling pip-20.3.1:
      Successfully uninstalled pip-20.3.1
Successfully installed pip-21.3


An API, or Application Programming Interface, is a **server** that you can use to retrieve and send data to using code. APIs are most commonly used to retrieve data - and for this class, I'll only go over this aspect.

**Servers** are basically powerful computers - and we will talk more about working directly with them later. 

Generally, APIs are connectors of systems. They link up websites, desktops, smartphones, etc - and when they are the link to something else, then the two are "integrated." APIs are everywhere, but when you rarely see them as the user. An API will talk with a **client**, which is program that exchanges data with a server through an API. 

APIs are technically a way for two different applications to communicate. They are set of rules (interface) that the two sides agree to follow. The company publishing the API implements their side of the rules by writing a program and putting it on a server. 

A great example is a smartphone app that syncs with a website. When you push the refresh button in your app, it talks to a server via an API and fetches the newest info.

When we want to use an API directly, we will need to make a request. Like when we push refresh, we are requesting to update the information. 

You will do the same thing - you will directly request an API server for data - which it will respond to your request. 

<img src = "servers.png">

For example, if you want to build an application which plots stock prices, you would use the API of something like google finance to request the current stock prices.

APIs are useful where:
* Data is changing quickly, e.g. stock prices
* The whole dataset is not required, e.g. the tweets of one user
* Repeated computation is involved, e.g. Spotify API that tells you the genre of a piece of music



<img src = "api-request.svg">

From your end, you are going to request stuff (the method called 'GET' - more on this in a second) 

And the server will respond with code and/or data

When we say we are 'web-scraping' (which you're here to learn) it just means that we are requesting information from an API of a web server and collecting that information.

Sometimes, APIs just store data for us to get. Meaning, there are APIs that their whole job is to abide by data requests that we ask them, and they are designed to do this. This is the ideal situation - that the world openly share their information with everyone! True creative commons spirit.

Sadly, that is not the world that we live in. Some websites will not share their data - they will do everything to make it VERY difficult to get access to their information(google scholar), other websites were just not made for giving users data directly, but it's freely available, you just have to learn how to tap into it and collect the data that you're interested in.



## The rules of the game

Computers have a system of rules of which they follow to get along in the world - when we connect two computers, there is some etiquette behind it. To get data from APIs, you're going to have to follow their rules (and learn them).

We're going to learn the rules of the web - Hyper-Text Transfer Protocol (ie HTTP)

when we say, go to http://www.pace.edu  we are saying, follow the web rules called HTTP 

(this is why when you open a jupyter notebook on your web browser there is no http - those aren't the rules jupyter is following)

HTTP centers around request-response

<img src ="httprequest.png">

To make a valid request, the client needs to include four things:

1. URL (Uniform Resource Locator)
2. Method
3. List of Headers
4. Body


## 1. URL

URLs are specific locations of things. You use this ALL of the time.  But, URLs have a specific format that is useful to understand when you webscrape. Each website has their own structure - so do APIs. 

Each website has a way that they organize the data - you will have to figure out how it operates. Some APIs will just tell you what they are, others will require you to investigate

### Example API: Eurostat

[EUROSTAT API](https://ec.europa.eu/eurostat/web/json-and-unicode-web-services).

The documentation linked gives a very useful summary of the structure of the requests:
![](url_example.png)
 
* host_url : fixed part of the request related to our website
* service : fixed part of the request related to the service
* version : fixed part of the request related to the version of the service
* format : data format to be returned (json or unicode)
* lang : language used for metadata (en/fr/de)
* datasetCode : unique code identifier of the queried dataset
* filters : specify the scope of the query (optional). There is a threshold of maximum 50 sub-indicators per query. The filters are specific to a dataset, depending on dataset dimensions.
    * precision : the number of decimals for the values returned by the request
    * unit : filter on the dataset's UNIT dimension 

## 2. Method

The request method tells the server what kind of action the client wants the server to take. In fact, the method is commonly referred to as the request "NAME"

The four methods most commonly seen in APIs are:

**GET - Asks the server to retrieve a resource**

POST - Asks the server to create a new resource

PUT - Asks the server to edit/update an existing resource

DELETE - Asks the server to delete a resource

We are really only going to look at GET requests

## 3. List of Headers

Headers provide meta-information about a request. They are a simple list of items like the time the client sent the request and the size of the request body.

Kind of like when you use the same website on a smartphone vs. a desktop, the format of the presentation will change.

For now, know it exists, but we'll get back to this in more detail when we talk abotu web-scraping in more detail. 

## 4. Body

A unique trait about the body is that the client has complete control over this part of the request. Unlike the method, URL, or headers, where the HTTP protocol requires a rigid structure, the body allows the client to send anything it needs.


This is the flexible part of the protocol - it's the specifics that you are asking the API about. For example, when you ask spotify about music genres, spotify will tell you the exact format that you need to input to get the information you want. 

#### REST
Putting it all together:

Most API's you come across will be RESTful, i.e. they provide a REST (REpresentational State Transfer) interface.

REST uses standard HTTP commands which means that getting data from an API is similar to accessing a webpage. 

For example, When you type `www.duckduckgo.com` in your browser, your browser is asking the `www.duckduckgo.com` server for a webpage by making a `GET` HTTP (Hypertext Transfer Protocol) request. Making a `GET` request to a RESTful API instead retrieves data (rather than a webpage).

Similarly, while your browser uses `POST` to submit the contents of a form, REST APIs use `POST` to update data.

REST APIs also uses other HTTP commmands such as `PUT` - for creating data - and `DELETE` - for removing data.

HTTP is a text-based protocol (the response is always text) and could return a response in any format - this is typically found in the API documentation - though data is more often than not returned in JSON format.

As they are used to retrieve data `GET` requests are the most commonly used type of request, and again, is why we restrict ourselves to `GET` in this class

## Responses

After you successfully request something from an API (making sure you cover those four ingredients), the server will respond to you to let you know if it was successful or an error or something other issue.


These are coded - they let you know the result of your request with a **status code**

Probably most famous to you is status code 404 - Not found (when you put in the wrong URL)

#### Status codes

These numeric [status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) in response to HTTP requests indicate whether a request has been successfully completed.

Some common ones relating to `GET` requests are:
* `200` - Success
* `300` - The API is redirecting to a different endpoint
* `400` - Bad request
* `401` - Not authenticated
* `403` - Forbidden
* `404` - Not found
* `429` - Too many requests

## Data Structures

Data is stored in a variety of structures within APIs - usually stored with some text function of some sort. It is usually not stored in the ways that you have seen data presented in the past.

All of you have probably seen the classic file structure in a csv file (whether you realize it or not) and mayeb some have seen tsv


### CSV

These are equivalent:
<table>
<tr><td style="text-align:left">
<img src="excel_csv.png" width = 550>
</td><td>
<img src="text_csv.png"width = 500 >
</td></tr></table>
    

It just means that the data is stored comma separated.  When excel opens up the file, it typically recognizes the format and will automatically display the information so that each cell is inputted separately when it sees a comma

## TSV

TSV is very similar, instead of it being comma seperated, it is tab separated (and is often read as `\t` when you import it.

This is useful to know when you import data into python/jupyter notebook - especially when we use pandas.  Just be aware of this format. 

However, APIs don't usually store information this way.

### JSON
The format that you will need to consider when you work with APIs is **JSON** and **XML**

Json is built of javascript programming (JSON (JavaScript Object Notation)) 
So those familiar with java will understand its format pretty well - but, you don't need to know java to use it.

It is meant to be human readable and easy to parse

JSON is a very simple format that has two pieces: keys and values.

...that should sound familiar to you.

It uses attribute-value pairs (e.g. python dictionaries {"name": "Pizza", "foodRanking": 1}) and array data-types (e.g. python lists [1, 2, 3])  (Knowing this is coming in handy now!)

Example JSON representation :

`
{
  "firstName": "Donald",
  "lastName": "Trump",
  "age": 73,
  "isAlive": true,
  "color": "orange",
  "addresses": [
      {
          "streetAddress": "1600 Pennsylvania Avenue NW",
          "city": "Washington, D.C.",
          "state": "null",
          "postalCode": "20500",
          "country": "US"
      },
      {
          "streetAddress": "721 Fifth Avenue",
          "city": "NYC",
          "state": "NY",
          "postalCode": "10022",
          "country": "US"
      }
  ],
}`

### XML

... and I'll be honest here, I am not nearly as familiar with this in practice. Almost all APIs I have worked with somehow ended up being in JSON. 

XML or Extensible Markup Language is older - been around since the days of dial-up internet.

The main block of information is called a node. And it'll look like this:

```
<cheeto>
    <firstName>Donald<\firstName>
    <lastName>Trump<\lastName>
    <age>73<\age>
    <isAlive>True<\isAlive>    
    <color>orange<\color>
    <addresses>    
        <streetAddress> 1600 Pennsylvania Avenue NW<\streetAddress>
        <city>Washington, D.C.<\city>
        <state>null<\state>
        <postalCode>20500<\postalCode>
        <country>US<\country>
        
        <streetAddress> 721 Fifth Avenue<\streetAddress>
        <city>New York<\city>
        <state>NY<\state>
        <postalCode>1022<\postalCode>
        <country>US<\country>
    <\addresses>  
<cheeto\>
```

XML always starts with a root node, which in our example is "order." Inside the order are more "child" nodes. The name of each node tells us the attribute of the order (like the key in JSON) and the data inside is the actual detail (like the value in JSON).

The structure is basically:

<img src='xml_structure.png'>

You can also infer English sentences by reading XML. Looking at the line with "crust", we could read, "the crust for the pizza is original style." Notice how in XML, every item in the list of toppings is wrapped by a node. You can see how the XML format requires a lot more text to communicate than JSON does.

## Connecting HTTP to Data

Now, let's go back to our example:
The documentation linked gives a very useful summary of the structure of the requests:

![](url_example.png)

Here, you see we are asking specifically for JSON format. Some APIs operate only in one data structure, others can operate in both, but you just need to tell it so in your header information as part of your request.

## Python and APIs

Now you know what an API is - which means we can use python to start requesting.

We will be using the library requests (yes, as in http requests!)


In [6]:
import requests  # I(we already did this, but just so you don't forget the library we are using right now)

# Query URL <- Here we are defining a particular url
#I happen to know the exact information layout here, but on your own you are going to 
# have to figure this out for any particular API

url = ('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en'
       '/nama_10_gdp?precision=1'
       '&unit=CLV05_MEUR'  # Unit: CLV (2005) Million EUR
       '&geo=NL&geo=DE'  # Country: Netherlands, Germany
       '&time=2010&time=2011&time=2012'  # Years: 2010, 2011, 2012
       '&na_item=B1GQ&na_item=D21'  # GDP (market prices) & taxes on products
       )
# Some api's will have nicer syntax like:
# `&time=2010..2012` or `&na_item=B1GQ,D21`
print(url)


# First part - the GET request.  
# I made sure to give it a url that contains the specific information I want.

response = requests.get(url)  # Make a GET request to the URL and I am defining the request to be called response.

response #If we just print response, it's going to how the url that we requested the get information from
# the output of response is the status code, but we can also ask that it tells me what that status code means.

http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/nama_10_gdp?precision=1&unit=CLV05_MEUR&geo=NL&geo=DE&time=2010&time=2011&time=2012&na_item=B1GQ&na_item=D21


<Response [200]>

In [4]:
# Print status code (and associated text) 
# within the get request for this url which I called response
# I ask that python give me information associated with status_code 
# the period in this case is denoting information within status_code
# Notice how there are no parenthesis after status_code - that is because it is not a functio
# it is a named item within response (we can use this in pandas, too when we get there)

print(f"Request returned {response.status_code} : '{response.reason}'")

#YAY! Success.


Request returned 200 : 'OK'


In [None]:
print("Request returned {response.status_code} : '{response.reason}'")

# the f here is a neat trick to tell python to format information (it's called Literal String Interpolation )
#  give me the information that is contained in this variable (ie give me what is in response.status_code, 
# don't just print the words 'response.status_code')

# Without f you just get the string literal
#Note that reason is a string so I am telling python that the information 
# in response.reason is a string with the string quotes (and not neccessary per say, but good practice)

In [5]:
response.json()

{'version': '2.0',
 'label': 'GDP and main components (output, expenditure and income)',
 'href': 'http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/nama_10_gdp?precision=1&unit=CLV05_MEUR&geo=NL&geo=DE&time=2010&time=2011&time=2012&na_item=B1GQ&na_item=D21',
 'source': 'Eurostat',
 'updated': '2021-10-20',
 'extension': {'datasetId': 'nama_10_gdp',
  'lang': 'EN',
  'description': None,
  'subTitle': None},
 'class': 'dataset',
 'value': {'0': 2426563.7,
  '1': 2521811.0,
  '2': 2532364.7,
  '3': 589946.6,
  '4': 599097.8,
  '5': 592925.0,
  '6': 228301.5,
  '7': 237271.4,
  '8': 236319.0,
  '9': 60631.6,
  '10': 59852.5,
  '11': 57924.1},
 'dimension': {'unit': {'label': 'unit',
   'category': {'index': {'CLV05_MEUR': 0},
    'label': {'CLV05_MEUR': 'Chain linked volumes (2005), million euro'}}},
  'na_item': {'label': 'na_item',
   'category': {'index': {'B1GQ': 0, 'D21': 1},
    'label': {'B1GQ': 'Gross domestic product at market prices',
     'D21': 'Taxes on products'}}},


In [7]:
#How do we get the data that is stored within the request?

# Print data returned (parsing as JSON)
payload = response.json()  # Parse `response.text` into JSON and call it payload
# it will parse other formats if the information is stored in that way 
# (ie when I asked the website for data, I already told it I want it in Json)

payload

{'version': '2.0',
 'label': 'GDP and main components (output, expenditure and income)',
 'href': 'http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/nama_10_gdp?precision=1&unit=CLV05_MEUR&geo=NL&geo=DE&time=2010&time=2011&time=2012&na_item=B1GQ&na_item=D21',
 'source': 'Eurostat',
 'updated': '2021-10-20',
 'extension': {'datasetId': 'nama_10_gdp',
  'lang': 'EN',
  'description': None,
  'subTitle': None},
 'class': 'dataset',
 'value': {'0': 2426563.7,
  '1': 2521811.0,
  '2': 2532364.7,
  '3': 589946.6,
  '4': 599097.8,
  '5': 592925.0,
  '6': 228301.5,
  '7': 237271.4,
  '8': 236319.0,
  '9': 60631.6,
  '10': 59852.5,
  '11': 57924.1},
 'dimension': {'unit': {'label': 'unit',
   'category': {'index': {'CLV05_MEUR': 0},
    'label': {'CLV05_MEUR': 'Chain linked volumes (2005), million euro'}}},
  'na_item': {'label': 'na_item',
   'category': {'index': {'B1GQ': 0, 'D21': 1},
    'label': {'B1GQ': 'Gross domestic product at market prices',
     'D21': 'Taxes on products'}}},


In [8]:
# we can print it or we can print it nicely with pprint

import pprint
pp = pprint.PrettyPrinter(indent=1)
pp.pprint(payload)

{'class': 'dataset',
 'dimension': {'geo': {'category': {'index': {'DE': 0, 'NL': 1},
                                    'label': {'DE': 'Germany (until 1990 '
                                                    'former territory of the '
                                                    'FRG)',
                                              'NL': 'Netherlands'}},
                       'label': 'geo'},
               'na_item': {'category': {'index': {'B1GQ': 0, 'D21': 1},
                                        'label': {'B1GQ': 'Gross domestic '
                                                          'product at market '
                                                          'prices',
                                                  'D21': 'Taxes on products'}},
                           'label': 'na_item'},
               'time': {'category': {'index': {'2010': 0, '2011': 1, '2012': 2},
                                     'label': {'2010': '2010',
                        

In [9]:
#Another way to look at data is through the package json
#  dumps() function is particularly useful as we can use it to print a formatted string 
# which makes it easier to understand the JSON output

import json

def jprint(obj):
    # create a formatted string of the Python JSON object
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

jprint(response.json())  

#though I prefer the prior format

{
    "class": "dataset",
    "dimension": {
        "geo": {
            "category": {
                "index": {
                    "DE": 0,
                    "NL": 1
                },
                "label": {
                    "DE": "Germany (until 1990 former territory of the FRG)",
                    "NL": "Netherlands"
                }
            },
            "label": "geo"
        },
        "na_item": {
            "category": {
                "index": {
                    "B1GQ": 0,
                    "D21": 1
                },
                "label": {
                    "B1GQ": "Gross domestic product at market prices",
                    "D21": "Taxes on products"
                }
            },
            "label": "na_item"
        },
        "time": {
            "category": {
                "index": {
                    "2010": 0,
                    "2011": 1,
                    "2012": 2
                },
                "label": {
           

## You did it! Your first request.

In [11]:
# response contains a TON of information and when you use hte dir function
# it will tell you all of the things that live within it
dir(response)

# But, a nicer way to do this in python is to click tab after response.

response.


['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [12]:
pp.pprint(payload)

{'class': 'dataset',
 'dimension': {'geo': {'category': {'index': {'DE': 0, 'NL': 1},
                                    'label': {'DE': 'Germany (until 1990 '
                                                    'former territory of the '
                                                    'FRG)',
                                              'NL': 'Netherlands'}},
                       'label': 'geo'},
               'na_item': {'category': {'index': {'B1GQ': 0, 'D21': 1},
                                        'label': {'B1GQ': 'Gross domestic '
                                                          'product at market '
                                                          'prices',
                                                  'D21': 'Taxes on products'}},
                           'label': 'na_item'},
               'time': {'category': {'index': {'2010': 0, '2011': 1, '2012': 2},
                                     'label': {'2010': '2010',
                        

In [16]:
# Ok, but what the heck is going on in this data? 
# How do I read what's going on?

payload['value']  #The output of this isn't super informative

{'0': 2426563.7,
 '1': 2521811.0,
 '2': 2532364.7,
 '3': 589946.6,
 '4': 599097.8,
 '5': 592925.0,
 '6': 228301.5,
 '7': 237271.4,
 '8': 236319.0,
 '9': 60631.6,
 '10': 59852.5,
 '11': 57924.1}

In [17]:
# Here we want to see how the data is formatted, and we look at what is the structure
#Note, zip is a useful function that pairs two lists together so that it outputs tuples of information

#The id in this case is tells us the name of the variable that we are looking at (above) 
#  and the size is the size is how many items are contained within that id

list(zip(payload['id'], payload['size']))  # Dimensions of our data (1 x 2 x 2 x 3)

[('unit', 1), ('na_item', 2), ('geo', 2), ('time', 3)]

We want to extract the indices from each dimension of the data (e.g. 'DE' and 'NL' for 'geo') 
and enumerate all the possible index combinations in order to build the index for the values in `payload['value']`.

Let's first extract the indices...

In [21]:
payload['id']

['unit', 'na_item', 'geo', 'time']

In [18]:
list_of_keys = [] #empty list
for k in payload['id']: #for loop so that for each item in the list
    list_of_keys.append(  #get me that key and append the information called dimension, category, and index
        payload['dimension'][k]['category']['index'].keys()
        )
print(list_of_keys)  
    
# NOTE: Equivalent to: [payload['dimension'][k]['category']['index'].keys() for k in payload['id']]

[dict_keys(['CLV05_MEUR']), dict_keys(['B1GQ', 'D21']), dict_keys(['DE', 'NL']), dict_keys(['2010', '2011', '2012'])]


Now we want to enumerate all the combinations.

Fortunately pandas has a function `pd.MultiIndex.from_product` that will do this for us, 
and let us (optionally) name each of the dimensions by passing a `names=` argument.

In [19]:
import pandas as pd

index = pd.MultiIndex.from_product(
    list_of_keys, names=payload['id']
)
index

#But, where is the data? Where are the values?  
#you'll have to now tell python to pass this information (an index of stuff) into a dataframe

MultiIndex([('CLV05_MEUR', 'B1GQ', 'DE', '2010'),
            ('CLV05_MEUR', 'B1GQ', 'DE', '2011'),
            ('CLV05_MEUR', 'B1GQ', 'DE', '2012'),
            ('CLV05_MEUR', 'B1GQ', 'NL', '2010'),
            ('CLV05_MEUR', 'B1GQ', 'NL', '2011'),
            ('CLV05_MEUR', 'B1GQ', 'NL', '2012'),
            ('CLV05_MEUR',  'D21', 'DE', '2010'),
            ('CLV05_MEUR',  'D21', 'DE', '2011'),
            ('CLV05_MEUR',  'D21', 'DE', '2012'),
            ('CLV05_MEUR',  'D21', 'NL', '2010'),
            ('CLV05_MEUR',  'D21', 'NL', '2011'),
            ('CLV05_MEUR',  'D21', 'NL', '2012')],
           names=['unit', 'na_item', 'geo', 'time'])

Now our index is built we can pass in a list of values, index, and columns to `pd.DataFrame`.

In [25]:
df = pd.DataFrame(payload['value'].values(), index=index, columns=['value'])
df
#voi-la data!

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,value
unit,na_item,geo,time,Unnamed: 4_level_1
CLV05_MEUR,B1GQ,DE,2010,2426563.7
CLV05_MEUR,B1GQ,DE,2011,2521811.0
CLV05_MEUR,B1GQ,DE,2012,2532364.7
CLV05_MEUR,B1GQ,NL,2010,589946.6
CLV05_MEUR,B1GQ,NL,2011,599097.8
CLV05_MEUR,B1GQ,NL,2012,592925.0
CLV05_MEUR,D21,DE,2010,228301.5
CLV05_MEUR,D21,DE,2011,237271.4
CLV05_MEUR,D21,DE,2012,236319.0
CLV05_MEUR,D21,NL,2010,60631.6


## Real-world considerations for Requests

Things do not always go so nicely, particularly when using API's at scale.

I'll quickly cover some other common considerations when using API's, and outline how they can be solved.


### Authentication

Not all APIs are freely available. Sometimes (many times), they will ask that you sign up/register and get a key. Essentially,  they will give you a password that allows you to access the data. Sometimes this is just to make sure that you aren't abusing the system - they may want to know how often you use the data or they may limit your access to certain information or  they may only allow certain people to access their server, etc.

For government data, all you need is to request an API key  - a long string of letters and numbers - this key is personalized. Anything you do they will know it is related to you. So, if you give it to other people, they might deactivate your key. Generally, don't share your key or you might be violating their terms.

Authentication is website or owner specific. Everyone has their own way of asking for it and you're going to figure out how to access it correctly (read the specific API docs)

There are some common approaches:
- put the key in the Authorization header, in lieu of a username and password. 
- add the key onto the URL (http://example.com?api_key=my_secret_key) (I have mostly seen this in government data)
- Less common is to bury the key somewhere in the request body next to the data. 

Wherever the key goes, the effect is the same - it lets the server authenticate the client.

Often in python it looks like:

` api_key = 'asodifhafglkkhj'
r = requests.get(url, auth=(api_key, ''))
`


#### Rate limits

API's can be costly to host and typically limit the number of requests that can be made (either by an IP or API key).
If you exceed this limit you'll get a `429` status code for any extra requests you make (and may be blocked if you continue making them).

It is important to therefore respect any rate limits given in an API's documentation (annoyingly some are very vague).
The simplest way to do this is to limit how often the number of times our function that makes the request can be called within some time limit using the [ratelim](https://pypi.org/project/ratelim/) library - again using decorators.

There is a way around this - and I won't detail it here, but if you can implement VPNs and respect the rate limit in combination of this, you've got an easy work around. The ethics of this is murky at best, though.

#### Retries

Sometimes you can do everything perfectly, and send off a request but something on the web-server (or elsewhere) can go wrong and give a bad status code.

We don't want to silently ignore these errors or let them crash our program by raising an exception.

Essentially, we will slow down our requests and define the rates.

A convenient way to do this is through the [tenacity](https://github.com/jd/tenacity) library:

``` python
import requests
from tenacity import (retry, stop_after_attempt, wait_fixed,
                      retry_if_exception_type)

@retry(stop=stop_after_attempt(3), wait=wait_fixed(0.1),
      retry=retry_if_exception_type(requests.HTTPError))
def get(url):
    try:
        r = requests.get(url)
        r.raise_for_status()  # raise an error on a bad status
        return r
    except requests.HTTPError:
        print(r.status_code, r.reason)
        raise
```

It uses a python decorator (the `@` symbol) to wrap our function with another function `retry` that will retry if we raise an error.

We can tell it how many times to stop trying after, how long to wait between each retry, what error to retry on etc.




# OK, Let's Practice

In breakout groups you will work in teams to practice APIs

Let's stick to APIs that don't ask for authentication (it can be time consuming to figure this out because of documentation)

[Open Notify](http://open-notify.org/) gives access to data about the international space station.

As a group:
1. Look at the documentation 
2. Make a GET Request and find out:
3. How many people are in space
4. Where the ISS right now 



# Useful Resources

### Standard URL query structure
https://en.wikipedia.org/wiki/Query_string

### Other Tutorials

https://www.dataquest.io/blog/python-api-tutorial/

https://realpython.com/python-requests/

https://www.dataquest.io/blog/last-fm-api-python/

### List of API's

Massive list [**here**](https://github.com/public-apis/public-apis)

### API wrapper libraries

Massive list [**here**](https://github.com/realpython/list-of-python-api-wrappers).

For example, geo-code location data with [`geopy`](https://github.com/geopy/geopy) :

In [None]:
!pip install geopy
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="ysi_tutorial")

print(geolocator.geocode("Leuvenlaan 4"))  # From address
print(geolocator.reverse("52.086779,5.1674726"))  # From co-ordinates

### Assorted API Snippets

In [3]:
# carbon intensity API
import requests
import pandas as pd

res = requests.get("https://api.carbonintensity.org.uk/intensity/date/2020-10-13") #get request
res_json = res.json() #parsing the information in json

#notice that it's a list called data that contains a bunch of dictionaries
res_json['data'][0]

{'from': '2020-10-12T22:00Z',
 'to': '2020-10-12T22:30Z',
 'intensity': {'forecast': 143, 'actual': 118, 'index': 'low'}}

In [4]:
#we can look at the first item in that list
res_json['data'][0]

{'from': '2020-10-12T22:00Z',
 'to': '2020-10-12T22:30Z',
 'intensity': {'forecast': 143, 'actual': 118, 'index': 'low'}}

In [5]:
#now, we see that within this it there is another dictionary of intensity
res_json['data'][0]['intensity']

{'forecast': 143, 'actual': 118, 'index': 'low'}

In [8]:
len(res_json['data'])

48

In [10]:
#That's the end of the rabbit hole
#now let's turn these dictionaries int a list of lists so that we can export to pandas
carbon = [] # empty list
for i in range(0,len(res_json['data'])):
    carbon.append(res_json['data'][i]['intensity'])
carbon_data = pd.DataFrame(carbon)
carbon_data

Unnamed: 0,forecast,actual,index
0,143,118,low
1,145,123,low
2,129,123,low
3,134,119,low
4,134,127,low
5,132,135,low
6,128,144,low
7,133,146,low
8,142,141,low
9,151,141,low


In [11]:
res_json['data'][0]

{'from': '2020-10-12T22:00Z',
 'to': '2020-10-12T22:30Z',
 'intensity': {'forecast': 143, 'actual': 118, 'index': 'low'}}

In [16]:
#but, what about the index?  Let's get that too.
carbon = []
for i in range(0,len(res_json['data'])):
    carbon.append(res_json['data'][i])
carbon_dates = pd.DataFrame(carbon).drop('intensity',axis = 1)
carbon_dates

Unnamed: 0,from,to,intensity
0,2020-10-12T22:00Z,2020-10-12T22:30Z,"{'forecast': 143, 'actual': 118, 'index': 'low'}"
1,2020-10-12T22:30Z,2020-10-12T23:00Z,"{'forecast': 145, 'actual': 123, 'index': 'low'}"
2,2020-10-12T23:00Z,2020-10-12T23:30Z,"{'forecast': 129, 'actual': 123, 'index': 'low'}"
3,2020-10-12T23:30Z,2020-10-13T00:00Z,"{'forecast': 134, 'actual': 119, 'index': 'low'}"
4,2020-10-13T00:00Z,2020-10-13T00:30Z,"{'forecast': 134, 'actual': 127, 'index': 'low'}"
5,2020-10-13T00:30Z,2020-10-13T01:00Z,"{'forecast': 132, 'actual': 135, 'index': 'low'}"
6,2020-10-13T01:00Z,2020-10-13T01:30Z,"{'forecast': 128, 'actual': 144, 'index': 'low'}"
7,2020-10-13T01:30Z,2020-10-13T02:00Z,"{'forecast': 133, 'actual': 146, 'index': 'low'}"
8,2020-10-13T02:00Z,2020-10-13T02:30Z,"{'forecast': 142, 'actual': 141, 'index': 'low'}"
9,2020-10-13T02:30Z,2020-10-13T03:00Z,"{'forecast': 151, 'actual': 141, 'index': 'low'}"


In [17]:
#Let's bring it all together
pd.merge(carbon_dates,carbon_data, left_index=True, right_index=True, how='left')

#voi-la from json to dataframe

In [None]:
## COMPANIES HOUSE (UK registrar of companies)
# Register for an api key here: https://developer.companieshouse.gov.uk/developer/signin

api_key = 'put your api key here'
url = 'https://api.companieshouse.gov.uk/search?q=consultio consultius'
r = requests.get(url, auth=(api_key, ''))
r.raise_for_status()

r.json()['items'][0]  # get the first item

In [None]:
## REDDIT - https://www.reddit.com/dev/api/#GET_subreddits_search
# Let's find some subreddits to learn python with!
# https://www.reddit.com/dev/api/#GET_subreddits_search

url = 'https://www.reddit.com/subreddits/search.json?q="learn python"&limit=5'
r = requests.get(url, headers={'User-agent': 'your bot 0.1'})
r.raise_for_status()

[result['data']['display_name_prefixed']
 for result in r.json()['data']['children']]