# Putting It Together

This program demonstrates how to programmatically retrieve data from the [Library of Congress API](https://libraryofcongress.github.io/data-exploration/index.html) and load that data into [Pandas](https://pandas.pydata.org/) for processing and aalysis. This program uses the [Requests]() Python library for making HTTP requests for the JSON representations of the [Selected Digitized Books](https://www.loc.gov/collections/selected-digitized-books/) collection. Using the APIs pagination features, the program grabs the first 5 pages of the collection with 50 items each for a total of 250 items. The program saves these index JSON files to disk and then proceeds to iterate through each result set downloading the JSON representation of all 250 items (and saving them to disk). Finally, some data is extracted from the item level JSON and put into a Pandas Dataframe for analysis.

## Load Libraries

In [None]:
import requests
import json
from pathlib import Path

from time import sleep

import pandas as pd
%matplotlib inline

## Set Parameters

In [None]:

# Directory for saving files
DATA_DIR = "json-data/"
Path(DATA_DIR).mkdir(parents=True, exist_ok=True)

# Depth parameter
PAGE_LIMIT = 5


# HTTP Parameters
BASE_URL = "https://loc.gov"
ENDPOINT = "/collections/selected-digitized-books"
FORMAT = "json"
RESULTS_PER_PAGE = 50

## Fetch Collection Index

* This section fetches JSON representation of the collection and saves it to disk
* It also shows how to do pagination by looping through a range from 1 to PAGE_LIMIT
* Each iteration builds a URL for page based on the parameters set above and the current `page_num`
* Save each index file to disk in the DATA_DIR

In [None]:
# Fetch the first n index pages with a loop for each page
for page_num in range(1,PAGE_LIMIT+1):
    
    # build a URL of the index page
    URL = BASE_URL + ENDPOINT + "/?fo={FORMAT}&c={RESULTS}&sp={PAGE}".format(FORMAT=FORMAT,
                                                                             RESULTS=RESULTS_PER_PAGE,
                                                                             PAGE=page_num)
    
    # Fetch and parse the index JSON                                                                     
    print("Fetching", URL)
    response = requests.get(URL)
    collection_index = response.json()

    # Save the json to disk
    file_name = DATA_DIR + "/index_" + str(page_num) + ".json"
    with open(file_name, 'w') as f:
        json.dump(collection_index, f)

## Fetching Individual Items

* Open the saved index files and extract just the results JSON
* Using the results we can go fetch the actualy item-level JSON data

In [None]:
# Use a glob pattern to load the index files
collection_indexes = list(Path(DATA_DIR).glob("index_*.json"))
collection_indexes

* We can use the `read_text()` function from Pathlib to very quickly suck the JSON data into memory
* This is a shortcut around `with open(index,'r') as f:

In [None]:
# read the files into the JSON parser
results_pile = [json.loads(index.read_text())['results'] for index in collection_indexes]
# Check the length
len(results_pile)

In [None]:
len(results_pile[0])

* This isn't actually what we need, we want one list not a list of lists
* We can't actually use a list comprehension here because we are *extending* the list
* A list comprehension would give us a list of lists

In [None]:
# Create a list to hold the result data
results_pile = []
# loop over each index file
for file in collection_indexes:
    # use pathlib to quickly read the file and parse as JSON
    index = json.loads(file.read_text())
    # append all results to the results_pile list
    results_pile.extend(index['results'])
        
# Check the length to make sure everything is loaded
len(results_pile)

* This loop iterates over every item, takes a URL, based on the item ID, and fetches the JSON representation
* It saves the raw JSON to disk as a backup and also returns the JSON representation
* Also sleeps for one second to prevent rate limiting with 250 items this loop should take 250 seconds or around 4 mins

In [None]:
# A function for retreiving the item level JSON
for counter, result in enumerate(results_pile):
    
    if counter % 50 == 0:
        print("Fetching item ", counter)
    
    url = result['id']
    # add JSON format request
    url = url + "?fo=json"
    # fetch the url
    response = requests.get(url)
    
    # try parsing the response data and catch exceptions
    try:
        # parse json
        item_json = response.json()
    except:
        # if we get an error display why and stop looping
        print(response.status_code)
        print(response.headers)
        break
        
    # get the lccn so we can make a unique filename
    lccn = item_json['item']["library_of_congress_control_number"]
    # generate a filename and write json to disk
    filename = DATA_DIR + "item_" + lccn + ".json"
    with open(filename, "w") as f:
        json.dump(item_json, f)
    
    # Sleep for one second
    sleep(1)
    
# Display a finished message
print("Download Complete. Fetched {} items.".format(counter))

## Load Data Files

* Now that we have saved the JSON files to disk we can open them from disk
    * This is much faster than processing them from the web

In [None]:
# Make a master list of all the item json files
item_files = list(Path(DATA_DIR).glob("item_*.json"))
len(item_files)

In [None]:
# Create a function that opens and parses the json files
def open_item(path):
    with open(path, 'r') as f:
        return json.load(f)
    
# use a list comprehension to open all the JSON files
item_pile = [open_item(path) for path in item_files]
# how many files did we open
len(item_pile)

In [None]:
# make a list of fields we want to keep
keys_of_interest = [
    "library_of_congress_control_number",
    "date",
    "title",
    "medium",
    "created_published",
    "id"]

# Make a function that filters out just the fields we want
def get_fields(item):
    # use a dictionary comprehension to filter out just the fields we want
    return {key : item['item'][key] for key in keys_of_interest}

In [None]:
# Put the data into a dataframe
# use a list comprehension with our filter function
data = pd.DataFrame([get_fields(item) for item in item_pile])
# display the first 100 items
data.head(100)

* Now we are going to go some fancy stuff
* The `medium` column has some interesting data, but it is currently an array of string values
    * Since that is how the data came from the JSON response

In [None]:
# Look at just the medium colum
data['medium']

In [None]:
# Grab the first item to get the string values
data['medium'].str.get(0)

In [None]:
# split the string on spaces
data['medium'].str.get(0).str.split()

In [None]:
# get the first item in the new list
data['medium'].str.get(0).str.split().str.get(0)

* Now we have some numbers, but they are currently string
* We also have some bad data here as well, so we will have to ignore those

In [None]:
# convert the strings into numbers
pd.to_numeric(data['medium'].str.get(0).str.split().str.get(0))

* Some of these won't convert because it is messy data, so lets just ignore those

In [None]:
# convert the strings into numbers and ignore the onc 
book_lengths = pd.to_numeric(data['medium'].str.get(0).str.split().str.get(0),
                            errors='coerce')
book_lengths

* Now that we have some reasonably clean data in a tabular format we can start doing analysis

In [None]:
# Compute summary statistics about book lengths
book_lengths.describe()

In [None]:
# Generate a histogram of the book lenghts
book_lengths.hist(bins=100, figsize=(10,6));