# Library Usage in Seattle, 2005-2020

## API Calls

The following notebook (utilizing functions found in the [api_caller.py](functions/api_caller.py) file) can be used as a framework for calling the API to look for data in a specific date range.

Since I had originally downloaded the data on December 15, 2020, I walk through collecting the rest of the data for the year of 2020 (i.e. December 15 through December 31) in this notebook.

In [1]:
# standard dataframe libraries
import pandas as pd; pd.set_option('display.max_columns', 50)
import numpy as np

# api libraries
from sodapy import Socrata
import json

# custom functions
from functions.data_cleaning import *
from functions.api_caller import *

# reload functions/libraries when edited
%load_ext autoreload
%autoreload 2

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# parse api credentials
file_path = '/Users/p.szymo/Documents/code_world/projects/library_usage_seattle/data/api_keys.json'

with open(file_path, 'r') as json_file:
    api_dict = json.load(json_file)
    
api_token = api_dict['api_token']

In [3]:
# define several variables for api function

# data-specific url code
url_addon_code = '5src-czff'

# personal api token
api_token = api_dict['api_token']

# name of date column
date_column = 'checkoutdatetime'

# date to start collecting data
begin_date = '2020-12-15'

# date to stop collecting data (non-inclusive)
end_date = '2021-01-01'

In [4]:
%%time

# call api
results_df = api_date_caller(
    url_addon_code,
    api_token,
    date_column,
    begin_date,
    end_date,
)

# check shape
results_df.shape

CPU times: user 1.01 s, sys: 160 ms, total: 1.17 s
Wall time: 3.83 s


(77882, 10)

In [5]:
# take a look
results_df.head()

Unnamed: 0,id,checkoutyear,bibnumber,itembarcode,itemtype,collection,callnumber,itemtitle,subjects,checkoutdatetime
0,202012150923000010099923236,2020,3486549,10099923236,acbk,cafic,FIC GHOSH 2019,Gun Island,"Booksellers and bookselling Fiction, Self real...",2020-12-15T09:23:00.000
1,202012150923000010088730089,2020,2163686,10088730089,jcbk,ncfic,J HUNTER,Into the wild,"Cats Juvenile fiction, Feral cats Juvenile fic...",2020-12-15T09:23:00.000
2,202012150925000010090618306,2020,2800147,10090618306,acbk,cacomic,741.5973 W678F17 2012,Fables 17 Inherit the wind,"Comic books strips etc United States, Comic bo...",2020-12-15T09:25:00.000
3,202012150925000010101360443,2020,3149052,10101360443,acbk,cacomic,741.5973 M8344N 2016,Nameless,Adventure and adventurers Comic books strips e...,2020-12-15T09:25:00.000
4,202012150925000010090494013,2020,2698178,10090494013,acbk,cacomic,741.5973 W678F15 2011,Fables 15 Rose Red,"Comic books strips etc, Fairy tales Comic book...",2020-12-15T09:25:00.000


In [6]:
# columns to subset on (to match work in 01_data_cleaning.ipynb notebook)
cols = ['collection', 'itemtitle', 'subjects', 'checkoutdatetime']

# rename columns (to match work in 01_data_cleaning.ipynb notebook)
new_col_names = ['collection', 'title', 'subjects', 'date']

In [7]:
# clean and merge data from data dictionary
results_transformed = data_transformer(
    results_df,
    'data/data_dictionary.csv',
    usecols=cols,
    rename=new_col_names,
    dt_format='%Y-%m-%dT%H:%M:%S.%f'
)

# check shape
results_transformed.shape

(77882, 7)

In [8]:
# confirm dates
results_transformed.date.unique()

array([datetime.date(2020, 12, 15), datetime.date(2020, 12, 16),
       datetime.date(2020, 12, 17), datetime.date(2020, 12, 18),
       datetime.date(2020, 12, 19), datetime.date(2020, 12, 20),
       datetime.date(2020, 12, 21), datetime.date(2020, 12, 22),
       datetime.date(2020, 12, 23), datetime.date(2020, 12, 26),
       datetime.date(2020, 12, 27), datetime.date(2020, 12, 28),
       datetime.date(2020, 12, 29), datetime.date(2020, 12, 30),
       datetime.date(2020, 12, 31)], dtype=object)

In [10]:
# load final part of big dataset
df_final_part = pd.read_pickle('data/seattle_lib_11.pkl', compression='gzip')

# check shape
df_final_part.shape

(6503843, 7)

In [11]:
# combine with new results
df = pd.concat([df_final_part, results_transformed], ignore_index=True)

# check shape
df.shape

(6581725, 7)

In [13]:
# take a look
df.tail()

Unnamed: 0,title,subjects,date,format_group,format_subgroup,category_group,age_group
6581720,ILLM Italian diary,,2020-12-23,Other,,Interlibrary Loan,Adult
6581721,ILLM Italian diary,,2020-12-29,Other,,Interlibrary Loan,Adult
6581722,ILLM Toronto eats 100 signature recipes from t...,,2020-12-30,Other,,Interlibrary Loan,Adult
6581723,ILLM Mad about the house how to decorate your ...,,2020-12-31,Other,,Interlibrary Loan,Adult
6581724,ILLM Borderline narcissistic and schizoid adap...,,2020-12-31,Other,,Interlibrary Loan,Adult


In [14]:
# # uncomment to save
# df.to_pickle(f'data/seattle_lib_11.pkl', compression='gzip')

### Count checkouts

Run the code below to create the necessary time series data and add it on to the original data (comprised of all data up through December 14, 2020).

In [11]:
# columns to dummy
dummy_cols = ['format_group', 'format_subgroup', 'category_group', 'age_group']

# dummy the columns
dummy_df = pd.get_dummies(results_transformed[dummy_cols], prefix=dummy_cols)

# `1` if title is missing, `0` if not
results_transformed['missing_title'] = np.where(results_transformed.title.isna(), 1, 0)

# `1` if subjects is missing, `0` if not
results_transformed['missing_subjects'] = np.where(results_transformed.subjects.isna(), 1, 0)

# combine with date column
df_counts = pd.concat([results_transformed[['date', 'missing_title', 'missing_subjects']], dummy_df], axis=1)

# group by date and get category total for each column
df_counts = df_counts.groupby('date').agg('sum')

# combine with total checkouts per day
df_counts = pd.concat([results_transformed.groupby('date').size(), df_counts], axis=1)

# rename target column
df_counts.columns = ['total_checkouts'] + list(df_counts.columns[1:])

In [12]:
df_counts.head()

Unnamed: 0_level_0,total_checkouts,missing_title,missing_subjects,format_group_Equipment,format_group_Media,format_group_Other,format_group_Print,format_subgroup_Audio Disc,format_subgroup_Audiobook Disc,format_subgroup_Book,format_subgroup_Document,format_subgroup_Folder,format_subgroup_Kit,format_subgroup_Music Score,format_subgroup_Periodical,format_subgroup_Video Disc,category_group_Fiction,category_group_Interlibrary Loan,category_group_Language,category_group_Nonfiction,category_group_Other,age_group_Adult,age_group_Juvenile,age_group_Teen
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
2020-12-15,6871,0,54,12.0,2002.0,22.0,4835.0,405.0,90.0,4817.0,0.0,0.0,22.0,18.0,0.0,1497.0,4061.0,22.0,195.0,2536.0,57.0,4635.0,2022.0,214.0
2020-12-16,5066,0,22,5.0,1087.0,3.0,3971.0,161.0,62.0,3967.0,0.0,1.0,13.0,3.0,0.0,856.0,3355.0,2.0,171.0,1491.0,47.0,2799.0,2079.0,188.0
2020-12-17,6025,0,31,12.0,1864.0,3.0,4146.0,340.0,92.0,4138.0,2.0,0.0,17.0,1.0,0.0,1427.0,3692.0,3.0,147.0,2133.0,48.0,4132.0,1676.0,217.0
2020-12-18,5901,0,29,7.0,1529.0,6.0,4359.0,350.0,81.0,4354.0,0.0,0.0,9.0,5.0,0.0,1096.0,3716.0,6.0,192.0,1951.0,36.0,3519.0,2213.0,169.0
2020-12-19,6986,0,34,16.0,2150.0,10.0,4810.0,434.0,70.0,4788.0,1.0,0.0,21.0,4.0,0.0,1641.0,3961.0,6.0,203.0,2772.0,43.0,5077.0,1715.0,194.0


In [13]:
# load item counts data
df_counts_prior = pd.read_pickle('data/seattle_lib_counts.pkl', compression='gzip')

df_counts = pd.concat([df_counts_prior, df_counts])

In [14]:
df_counts.head()

Unnamed: 0_level_0,total_checkouts,missing_title,missing_subjects,format_group_Equipment,format_group_Media,format_group_Other,format_group_Print,format_subgroup_Art,format_subgroup_Audio Disc,format_subgroup_Audio Tape,format_subgroup_Audiobook Disc,format_subgroup_Audiobook Tape,format_subgroup_Book,format_subgroup_Data Disc,format_subgroup_Document,format_subgroup_Film,format_subgroup_Folder,format_subgroup_Kit,format_subgroup_Music Score,format_subgroup_Periodical,format_subgroup_Video Disc,format_subgroup_Video Tape,category_group_Fiction,category_group_Interlibrary Loan,category_group_Language,category_group_Nonfiction,category_group_Other,category_group_Reference,age_group_Adult,age_group_Juvenile,age_group_Teen
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
2005-04-13,16471,212,664,1.0,6397.0,32.0,10041.0,0.0,1874.0,63.0,217.0,308.0,9970.0,10.0,0.0,0.0,8.0,97.0,40.0,0.0,1950.0,1878.0,8189.0,32.0,370.0,6719.0,1143.0,18.0,11257.0,4613.0,601.0
2005-04-14,10358,123,541,1.0,4015.0,75.0,6267.0,0.0,1245.0,31.0,164.0,156.0,6225.0,7.0,0.0,0.0,8.0,85.0,28.0,0.0,1212.0,1115.0,5276.0,73.0,272.0,4104.0,621.0,12.0,6726.0,3381.0,251.0
2005-04-15,12896,179,508,0.0,5351.0,51.0,7494.0,0.0,1462.0,54.0,187.0,239.0,7452.0,12.0,0.0,0.0,4.0,80.0,35.0,0.0,1596.0,1721.0,6357.0,50.0,302.0,5166.0,1014.0,7.0,8795.0,3747.0,354.0
2005-04-16,1358,7,56,0.0,552.0,0.0,806.0,0.0,175.0,8.0,31.0,23.0,802.0,1.0,0.0,0.0,2.0,9.0,1.0,0.0,142.0,163.0,567.0,0.0,29.0,666.0,95.0,1.0,950.0,367.0,41.0
2005-04-17,4555,80,232,0.0,1555.0,8.0,2992.0,0.0,499.0,10.0,47.0,96.0,2946.0,9.0,0.0,0.0,7.0,19.0,27.0,0.0,395.0,480.0,2017.0,8.0,177.0,2145.0,203.0,5.0,3035.0,1349.0,171.0


In [15]:
df_counts.tail()

Unnamed: 0_level_0,total_checkouts,missing_title,missing_subjects,format_group_Equipment,format_group_Media,format_group_Other,format_group_Print,format_subgroup_Art,format_subgroup_Audio Disc,format_subgroup_Audio Tape,format_subgroup_Audiobook Disc,format_subgroup_Audiobook Tape,format_subgroup_Book,format_subgroup_Data Disc,format_subgroup_Document,format_subgroup_Film,format_subgroup_Folder,format_subgroup_Kit,format_subgroup_Music Score,format_subgroup_Periodical,format_subgroup_Video Disc,format_subgroup_Video Tape,category_group_Fiction,category_group_Interlibrary Loan,category_group_Language,category_group_Nonfiction,category_group_Other,category_group_Reference,age_group_Adult,age_group_Juvenile,age_group_Teen
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
2020-12-27,3471,0,15,3.0,814.0,5.0,2649.0,,167.0,,33.0,,2634.0,,0.0,,0.0,13.0,14.0,0.0,604.0,,2092.0,4.0,79.0,1252.0,44.0,,2139.0,1204.0,128.0
2020-12-28,1445,0,6,3.0,424.0,3.0,1015.0,,52.0,,28.0,,1014.0,,0.0,,0.0,4.0,1.0,0.0,343.0,,931.0,1.0,32.0,471.0,10.0,,985.0,417.0,43.0
2020-12-29,7260,0,38,16.0,2109.0,14.0,5121.0,,437.0,,89.0,,5086.0,,4.0,,0.0,22.0,22.0,0.0,1577.0,,4204.0,12.0,179.0,2819.0,42.0,,5214.0,1788.0,258.0
2020-12-30,5739,0,40,35.0,1482.0,5.0,4217.0,,283.0,,89.0,,4210.0,,1.0,,0.0,41.0,4.0,1.0,1104.0,,3400.0,5.0,177.0,2106.0,50.0,,3734.0,1851.0,154.0
2020-12-31,5457,0,40,14.0,1836.0,5.0,3602.0,,420.0,,69.0,,3593.0,,0.0,,0.0,20.0,8.0,0.0,1341.0,,3025.0,3.0,177.0,2211.0,41.0,,3991.0,1312.0,154.0


In [16]:
# # uncomment to save
# df_counts.to_pickle(f'data/seattle_lib_counts.pkl', compression='gzip')