# Intro to API usage with Python

### What is an API (James' summary) 

If you look up what an API (Application Program Interface) does on the interntet you will be met with lots of techical jargon. That is because API's are everywhere and used for all kinds of things in to make the engines of websites commmunicate with each other and communicate with you, the user.

On the most basic level, to use an API you send a web *request* (like loading a URL) aligning with the rules of the API and you get a web *response* which contains your requested data. Easy!


<div>
<img src="attachment:image.png" width="250"/>
</div>

When you most often hear about APIs around C4ADS, you are hearing about a type of public API that a data-dense sites advertises in order to make their data easily accessible. This might be becuase the site is friendly to researchers or becuase the site knows we could scrape it from their user interface. The site decides to give out the data **on their terms** so as to not clog up their servers with too many web requests. 

However, on their terms means that an API might limit you to X number of seraches with a unique key, or mean that it only returns 50 results per search, etc. These are some of the limitations that can make all APIs different! 


In [1]:
# Load packages 

# this package is used for making web requests with python
import requests  
# if you need to download type pip (or maybe pip3) install requests in your CLI (command line interface)
import pandas as pd
# These are likely already installed
import json
import os 

# Here I am making a directory called "data" in my home directory. We will put our raw downloads there.
if not os.path.exists('data'):
    os.mkdir('data')


In [2]:
# This is a helper function to download the json file. 
# It takes a URL where we will define our API request and returns the structured JSON data pacakge.
def pull_json(url):
    # req is a response object. It contains lots of information about what we asked the URL for. Including a JSON object
    req = requests.get(url)
    data = req.json()
    return data

# This is a function that takes the data from the site and writes it to a file
def download_data(data, file_name):
    with open(file_name, 'w') as infile:
        infile.write(json.dumps(data))

## Exploring the Colombia Contacts API

[The Colommbia public procurment](https://www.colombiacompra.gov.co/transparencia/api) site is one such website that advertises an API. I will skip directly to a link where their API documentation exists. This is what we will need for our exploration:

#### [Documentation](https://apiocds.colombiacompra.gov.co:8443/apiCCE2.0/#!/releases/findByFilters)

For my example I will use the `/releases/page/{year}` function to search releases by year as it seems to be the most staright-forward. But as you can see there are serveral other options.

Some reading shows me that there are two impoartant parameters to use this function, `year` and `page`. Read for more details in the documentation.

I am now ready to create my own API call. Luckily this site, as do most user-friendly API interfaces, gives the ability to create a API url call for you. But the true beuty of an API is that you can make your own URLs programatically and then handle the data you get back programatically.

Example: to search the first page in 2018, my URL would be `https://apiocds.colombiacompra.gov.co:8443/apiCCE2.0/rest/releases/page/2018?page=1`. If you plug this into your browser search bar to test you'll get back the data as a big block of text.

Let's try to get 3 pages of 2018 and combine it into one dataset.

**Warning**: this might take 5-10 minutes becuase this API is a really slow. Potentially becuase they are throttling against bulk collection, but hard to say for sure.

In [3]:
# Here we ware going to first download the data in JSON format. 
# Theoretically you could structure the data on the fly. But it is much better to download first with an API this slow.

def pull_n_pages(n, year):
    for page in range(1,n+1):
        url ='https://apiocds.colombiacompra.gov.co:8443/apiCCE2.0/rest/releases/page/{}?page={}'.format(year, page)
        print("Pulling page {}".format(page))
        data = pull_json(url)
        file_name = 'data/page_{}.json'.format(page)
        download_data(data, file_name)
    print("Pull complete!")
    return None

# running the function above. See how I get to set the terms of my request. 
# I could of course do this programatically and generate dozense of requests if I wanted. 
pull_n_pages(n = 3, year = 2018)

Pulling page 1
Pulling page 2
Pulling page 3
Pull complete!


If the above funtion worked you should have N files in your data folder. The next step would be structuring those data files to your specific needs. Prehaps a CSV with the necessary columns.

Technically, our usage of the API is done, but hopefully this shows the potential of programatically feeding in a list of desired inputs and controlling the destination of the outputs. 


### [Bonus] Step 2: Parsing the data

The true usuage of using this API is that the data is now ready to be easily parsed to our desired specifications. To do this, you should be quite comfortable with python lists and dictionaries. JSON file == list of dictionaries. 

There's no right or wrong way to parse the data. In this example I am going to get all the contracted items of each release in one data frame. You can see by looking at the raw .json files that there is *lots* of information given to use. Many potential areas to explore. It would be difficult to put all of this information in one dataframe. 

I am going to lean heavily on a pandas function called [pd.json_normalize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html), it can be a little tough to understand. Really the best way is to disregard this section below and play with the data yourself. 


In [21]:
# helper function that reads the file path and returns a list a JSON-esque object, a list of dictionaries.
def read_data(path):
    with open(path, 'r') as infile:
        data = json.loads(infile.read())
    return data

# hard to explain. Read about this function in the documentation.
def parse_json(json_data):
    output = pd.json_normalize(json_data, record_path = ['releases', 'contracts', 'items'], 
                  meta = [
                      ['releases', 'date'],
                      ['releases', 'id'],
                      ['releases','contracts', 'value', 'amount'],
                      ['releases','contracts','value', 'currency']
                  ])
    return output

# Let's combine it all in one data frame! 
def combine_all_files():
    master_df = pd.DataFrame()
    # List data_paths *Shakes fist at .DS_Store*
    files = ['data/{}'.format(file) for file in os.listdir('data') if file != '.DS_Store']
    # loop over files and append parsed data to the output
    for file in files:
        new_data = read_data(file)
        parsed_data = parse_json(new_data)
        master_df = master_df.append(parsed_data)
    # output data to csv 
    master_df.to_csv('final_output.csv', index = False)
    return None

# Do the deed here. It will deposite in your working directory
combine_all_files()
    