# Library Usage in Seattle 2005-2020

## Data Cleaning

### Import required libraries

In [1]:
import pandas as pd
import numpy as np

### Load checkout data

- Since the data set is so large, I'll specify only the columns that I want in the DataFrame. This will effectively drop the following columns:
    - `ID`
    - `CheckoutYear`
    - `BibNumber`
    - `ItemBarcode`
    - `ItemType`
    
- I want to note that the `ItemType` and `Collection` columns are very similar, but the code in the `Collection` column contains more information within the `category_group` column that I add onto the DataFrame using the `data_dictionary.csv` file (see below). More specifically, the `ItemType` code yields mostly "Miscellaneous" results, whereas the `Collection` code yields differentiates between "Fiction" and "Nonfiction", among others. This could be useful information later on, so I found it best to drop the `ItemType` column.

In [2]:
%%time

usecols = ['Collection', 'CallNumber', 'ItemTitle', 'Subjects', 'CheckoutDateTime']

df = pd.read_csv('data/Checkouts_By_Title__Physical_Items_.csv', nrows=1000000, usecols=usecols)

df.columns = ['collection', 'call_number', 'title', 'subjects', 'date']

CPU times: user 2.36 s, sys: 165 ms, total: 2.52 s
Wall time: 2.53 s


In [3]:
df.head()

Unnamed: 0,collection,call_number,title,subjects,date
0,nadvd,DVD FIREWAL,Firewall,"Kidnapping Drama, Video recordings for the hea...",02/13/2008 07:38:00 PM
1,nanf,793.2 C7744B 2001,best baby shower book a complete guide for par...,Showers Parties,07/23/2008 02:53:00 PM
2,nyfic,YA WESTERF,Uglies,"Fantasy, Teenage girls Fiction, Beauty Persona...",12/23/2009 04:20:00 PM
3,napar,618.4 L9511D 2009,doula guide to birth secrets every pregnant wo...,"Doulas, Childbirth",11/16/2010 12:04:00 PM
4,canf,641.692 M8216S 2005,Salmon a cookbook,Cookery Salmon,04/26/2009 01:29:00 PM


In [4]:
df.shape

(1000000, 5)

In [5]:
df.dtypes

collection     object
call_number    object
title          object
subjects       object
date           object
dtype: object

### Convert `date` column to datetime, set as index

In [6]:
# look at an example before conversion
df.loc[0, 'date']

'02/13/2008 07:38:00 PM'

In [7]:
# specify the format
dt_format = '%m/%d/%Y %I:%M:%S %p'

In [8]:
# convert to datetime, dropping the hour-minute-second stamp using the `dt.date` attribute
df['date'] = pd.to_datetime(df.date, format=dt_format).dt.date

# confirm it worked
df.loc[0, 'date']

datetime.date(2008, 2, 13)

In [9]:
# set `date` column as index, sort by index, and drop it outside the index
# note: in the past, the dropping of the column was done by default, but that no longer seems to be the case?
df = df.set_index(df.date).sort_index().drop(columns='date')

df.head()

Unnamed: 0_level_0,collection,call_number,title,subjects
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2008-01-02,nafic,FIC BERG1998,What we keep,
2008-01-02,nanew,362.82092 W159W 2005,glass castle a memoir,"Problem families United States Case studies, C..."
2008-01-02,nadvd,DVD MIDSOME,Midsomer murders A worm in the bud,"Video recordings for the hearing impaired, Det..."
2008-01-02,nacd,CD 782.42166 St547L,Logic will break your heart,Rock music 2001 2010
2008-01-02,ncpic,E LOVELL,Stand tall Molly Lou Melon,"Bullies Fiction, Grandmothers Fiction, Self ac..."


In [10]:
please break code

SyntaxError: invalid syntax (<ipython-input-10-b8306b2d38fe>, line 1)

### Load other info from data dictionary

In [None]:
dd = pd.read_csv('data/data_dictionary.csv')

dd.columns = ['code', 'description', 'code_type', 'format_group', 'format_subgroup', 
              'category_group', 'category_subgroup', 'age_group']

dd.head()

In [None]:
dd[dd.code_type == 'ItemType']

In [None]:
dd[dd.code == 'nadvd']

In [None]:
dd[dd.code == 'acdvd']

In [None]:
dd.code_type.unique()

In [None]:
dd_item = dd[dd.code_type == 'ItemType'][['code', 'description', 'format_group', 'format_subgroup', 'category_group', 
             'category_subgroup', 'age_group']]

dd_item.head()

In [None]:
dd_item2 = dd[dd.code_type == 'ItemCollection'][['code', 'description', 'format_group', 'format_subgroup', 'category_group', 
             'category_subgroup', 'age_group']]

dd_item2.head()

In [None]:
sorted(df.item_type.unique())

In [None]:
dd_loc = dd[dd.code_type == 'Location'][['code', 'description']]

dd_loc.head()

In [None]:
test = df.merge(dd_item, left_on='item_type', right_on='code')
# test = test.merge(dd_loc, left_on='collection', right_on='code')

test.head()

In [None]:
test.isna().sum()

In [None]:
test.format_group.value_counts()

In [None]:
test.collection.unique()

In [None]:
test.shape

In [None]:
test2 = df.merge(dd_item2, left_on='Collection', right_on='code')

test2.head()

In [None]:
test.groupby('format_group').category_group.value_counts()

In [None]:
test2.groupby('format_group').category_group.value_counts()

In [None]:
dd[dd.code == 'nybot']

In [None]:
sorted(df.collection.unique())

In [None]:
dd_loc.code.unique()

In [None]:
[cod for cod in df.collection.unique() if cod in dd_loc.code.unique()]

In [None]:
%%time

# uncomment to save
with gzip.open('data/seattle_lib.pkl', 'wb') as goodbye:
    pickle.dump(df, goodbye, protocol=pickle.HIGHEST_PROTOCOL)
    
# # uncomment to load
# with gzip.open('data/seattle_lib.pkl', 'rb') as hello:
#     df = pickle.load(hello)