# Library Usage in Seattle, 2005-2020

## Data Cleaning

The data is the [Checkouts by Title (Physical Items)](https://data.seattle.gov/Community/Checkouts-By-Title-Physical-Items-/5src-czff) dataset from [Seattle Open Data](https://data.seattle.gov/) and was downloaded on December 15, 2020.

This notebook is designed to load a downloaded CSV file, merge it with item-specific information, convert it to a time-series-ready DataFrame, and save that as a compressed pickle file.

*Note: This dataset is updated weekly; the more data, the longer the load times will be.*

In the future, it may be a good idea to look into adding [API](https://dev.socrata.com/foundry/data.seattle.gov/5src-czff) calls into the pipeline, so as to quickly and easily add on the additional weekly data.

*Note: Any cell that uses the built-in magic command* `%%time` *takes a significant (or at least not insignificant) time to run.*

### Import required libraries

In [1]:
# standard dataframe packages
import pandas as pd
import numpy as np

# saving packages
import pickle
import gzip

from functions.data_cleaning import status_update

### Load checkout data

[[go back to the top](#Library-Usage-in-Seattle,-2005-2020)]

Since the data set is so large, I'll specify only the columns that I want in the DataFrame. This will effectively drop the following columns:
- `ID`
- `CheckoutYear`
- `CallNumber`
- `BibNumber`
- `ItemBarcode`
- `ItemType`
    
I want to note that the `ItemType` and `Collection` columns are very similar, but the code in the `Collection` column contains more information within the `category_group` column that I add onto the DataFrame using the `data_dictionary.csv` file ([see below](#Load-other-info-from-data-dictionary-and-merge-onto-checkouts-dataset)). More specifically, the `ItemType` code yields mostly "Miscellaneous" results, whereas the `Collection` code yields differentiates between "Fiction" and "Nonfiction", among others. This could be useful information later on, so I found it best to drop the `ItemType` column.

Note: In the future, I may consider using this [dataset](https://data.seattle.gov/Community/Library-Collection-Inventory/6vkj-f5xf) to add on branch information (i.e. which branch an item was checked out from), although this data is rather limited (due to privacy concerns) and incomplete (only appears to be collected beginning in 2017). In order to do that I would need to use the `BibNumber` column.

#### ⏰ Cell below takes ~14.5 minutes to run. ⏰

In [2]:
%%time

# columns to load
usecols = ['Collection', 'ItemTitle', 'Subjects', 'CheckoutDateTime']

# load data
df = pd.read_csv('data/Checkouts_By_Title__Physical_Items_.csv',
#                  nrows=10000000,
                 usecols=usecols)

# rename columns to my preferred format
df.columns = ['collection', 'title', 'subjects', 'date']

CPU times: user 4min 20s, sys: 4min 39s, total: 9min
Wall time: 14min 1s


*Note: While it only took about 25 seconds to load 10 million rows, it takes about 25 minutes to load 106.5 million rows with the same number (5) of columns. In my latest update, I brought it down to 4 columns, which decreased load time to 15.5 minutes.*

In [3]:
# check shape
df.shape

(106534901, 4)

In [4]:
# take a look
df.head()

Unnamed: 0,collection,title,subjects,date
0,nadvd,Firewall,"Kidnapping Drama, Video recordings for the hea...",02/13/2008 07:38:00 PM
1,nanf,best baby shower book a complete guide for par...,Showers Parties,07/23/2008 02:53:00 PM
2,nyfic,Uglies,"Fantasy, Teenage girls Fiction, Beauty Persona...",12/23/2009 04:20:00 PM
3,napar,doula guide to birth secrets every pregnant wo...,"Doulas, Childbirth",11/16/2010 12:04:00 PM
4,canf,Salmon a cookbook,Cookery Salmon,04/26/2009 01:29:00 PM


#### ⏰ Cell below takes ~3 minutes to run. ⏰

In [5]:
%%time

# check for nan values
df.isna().sum()

CPU times: user 33.6 s, sys: 1min 9s, total: 1min 43s
Wall time: 3min 19s


collection          0
title          900912
subjects      1649522
date                0
dtype: int64

*NOTE: Even checking for NaN values takes a significant amount of time with this many rows.*

The most important columns (`collection` and `date`) have no NaN values.

In [6]:
# check datatypes
df.dtypes

collection    object
title         object
subjects      object
date          object
dtype: object

### Convert `date` column to datetime

[[go back to the top](#Library-Usage-in-Seattle,-2005-2020)]

In [7]:
# look at an example before conversion
df.loc[0, 'date']

'02/13/2008 07:38:00 PM'

In [8]:
# specify the format
dt_format = '%m/%d/%Y %I:%M:%S %p'

#### ⏰ Cell below takes ~7 minutes to run. ⏰

In [9]:
%%time

# convert to datetime, dropping the hour-minute-second stamp using the `dt.date` attribute
df['date'] = pd.to_datetime(df.date, format=dt_format).dt.date

# confirm it worked
df.loc[0, 'date']

CPU times: user 6min 27s, sys: 15.6 s, total: 6min 43s
Wall time: 6min 46s


datetime.date(2008, 2, 13)

### Load other info from data dictionary and merge onto checkouts dataset

[[go back to the top](#Library-Usage-in-Seattle,-2005-2020)]

In [10]:
# load data
dd = pd.read_csv('data/data_dictionary.csv')

# rename columns to my preferred format
dd.columns = ['code', 'description', 'code_type', 'format_group', 'format_subgroup', 
              'category_group', 'category_subgroup', 'age_group']

# take a look
dd.head()

Unnamed: 0,code,description,code_type,format_group,format_subgroup,category_group,category_subgroup,age_group
0,cazover,CA7-zine collection oversize,ItemCollection,Print,Book,Periodical,,Adult
1,caziner,CA7-zine collection reference,ItemCollection,Print,Book,Periodical,,Adult
2,cazval,CA7-zine collection valuable mat.,ItemCollection,Print,Book,Periodical,,Adult
3,nga,Northgate Branch,Location,,,,,
4,hip,High Point Branch,Location,,,,,


In [11]:
# check shape
dd.shape

(580, 8)

In [12]:
# check datatypes
dd.dtypes

code                 object
description          object
code_type            object
format_group         object
format_subgroup      object
category_group       object
category_subgroup    object
age_group            object
dtype: object

Since I will only be using information from codes whose type is "ItemCollection", I'll subset the data dictionary down to just those rows.

In [13]:
# subset to only collection codes
dd = dd[dd.code_type == 'ItemCollection']

In [14]:
# check for nan values
dd.isna().sum()

code                   0
description            0
code_type              0
format_group           0
format_subgroup       28
category_group         2
category_subgroup    391
age_group              0
dtype: int64

Again, with the size of the eventual DataFrame in mind, I want to drop any unnecessary columns before merging, so I'll drop the following columns:
- `description`, since that is superfluous information for this project
- `code_type`, since that is superfluous information
- `category_subgroup`, since that is mostly NaN values

In [15]:
# drop columns
dd.drop(columns=['description', 'code_type', 'category_subgroup'], inplace=True)

In [16]:
# list of columns to convert
to_convert = ['format_group', 'format_subgroup', 'category_group', 'age_group']

# convert to category datatype
dd[to_convert] = dd[to_convert].apply(pd.Categorical)

In [17]:
# confirm new datatypes
dd.dtypes

code                 object
format_group       category
format_subgroup    category
category_group     category
age_group          category
dtype: object

#### ⏰ Cell below takes ~4 minutes to run. ⏰

In [18]:
%%time

# merge checkouts dataframe with info from data dictionary
df_merged = df.merge(dd, left_on='collection', right_on='code')

# take a look
df_merged.head()

CPU times: user 1min, sys: 1min 50s, total: 2min 51s
Wall time: 4min 23s


Unnamed: 0,collection,title,subjects,date,code,format_group,format_subgroup,category_group,age_group
0,nadvd,Firewall,"Kidnapping Drama, Video recordings for the hea...",2008-02-13,nadvd,Media,Video Disc,Fiction,Adult
1,nadvd,Marley me,"Comedy films, Married people Drama, Philadelph...",2009-07-03,nadvd,Media,Video Disc,Fiction,Adult
2,nadvd,Six feet under The complete fourth season,"Video recordings for the hearing impaired, Pro...",2008-10-26,nadvd,Media,Video Disc,Fiction,Adult
3,nadvd,Doctor Who The next doctor,"London England Drama, Doctor Who Fictitious ch...",2010-11-10,nadvd,Media,Video Disc,Fiction,Adult
4,nadvd,School ties,"Antisemitism Drama, Video recordings for the h...",2008-12-28,nadvd,Media,Video Disc,Fiction,Adult


### Drop unnecessary columns

I can now drop the `collection` and `code` columns, since those are no longer necessary.

*NOTE: Using the Pandas method `.drop()` was taking well over an hour, so I'm going to try to subset it below, to see if that works any faster.*

#### ⏰ Cell below takes ~38.5 minutes to run. ⏰

In [19]:
%%time

# drop columns
df_merged.drop(columns=['collection', 'code'], inplace=True)

CPU times: user 2min 43s, sys: 18min 6s, total: 20min 50s
Wall time: 44min 42s


### Set `date` column as index

[[go back to the top](#Library-Usage-in-Seattle,-2005-2020)]

I've commented out the below code because now *this* has begun to consistently crash the kernel or zsh shell.

After thinking more about it, I believe I will end up grouping by the date to get raw numbers for each day (not only total checkouts, but total print checkouts and fiction checkouts, etc.), which can be done before setting the `date` column as the index.

In [20]:
# %%time

# # set `date` column as index and sort by index
# df_merged = df_merged.set_index('date').sort_index()

# # take a look
# df_merged.head()

In [21]:
# check shape
df_merged.shape

(106503843, 7)

In [22]:
# check datatypes
df_merged.dtypes

title                object
subjects             object
date                 object
format_group       category
format_subgroup    category
category_group     category
age_group          category
dtype: object

NOTE: I may be able to drop even more columns (thinking especially of `title` and `subjects`), since I'll mostly be looking at sheer numbers of items checked out each day. I'll keep them in for now in case they end up being useful for EDA.

### 💾 Save

[[go back to the top](#Library-Usage-in-Seattle,-2005-2020)]

Due to some several kernel and zsh shell crashes, I'm going to try to save the DataFrame in batches of 10 million rows.

*NOTE: Save time for 10 million rows takes about 5 minutes and the file size is 290MB. Increasing to 20 million rows seemed to increase save time considerably, and so was interrupted before completing.*

In [23]:
# %%time

# # loop through index and multiples of 10 million
# for ind, i in enumerate(range(0, 110000000, 10000000), 1):
    
#     # save (via compressed pickle) a dataframe of 10 million rows, use index for unique file names
#     df_merged.iloc[i:i+10000000].to_pickle(f'data/seattle_lib_{ind}.pkl', compression='gzip')
    
#     # print status/time
#     status_update(f'File {ind} out of 11 saved successfully')

The previous loop appeared to be stuck on the 5th part, so I interrupted the kernel and am attempting to save parts 6-11.

In [24]:
%%time

# loop through index and multiples of 10 million
for ind, i in enumerate(range(50000000, 110000000, 10000000), 5):
    
    # save (via compressed pickle) a dataframe of 10 million rows, use index for unique file names
    df_merged.iloc[i:i+10000000].to_pickle(f'data/seattle_lib_{ind}.pkl', compression='gzip')
    
    # print status/time
    status_update(f'File {ind} out of 11 saved successfully')

Current time = 00:53:04
-----------------------
File 5 out of 11 saved successfully

Current time = 01:18:18
-----------------------
File 6 out of 11 saved successfully

Current time = 02:35:13
-----------------------
File 7 out of 11 saved successfully

Current time = 04:42:51
-----------------------
File 8 out of 11 saved successfully

Current time = 06:06:44
-----------------------
File 9 out of 11 saved successfully

Current time = 06:43:35
-----------------------
File 10 out of 11 saved successfully

CPU times: user 28min 7s, sys: 2h 17min 27s, total: 2h 45min 35s
Wall time: 6h 10min 4s


In [26]:
df_merged.iloc[40000000:50000000].to_pickle(f'data/seattle_lib_4b.pkl', compression='gzip')

In [25]:
please break code

SyntaxError: invalid syntax (<ipython-input-25-b8306b2d38fe>, line 1)

In [27]:
test = df_merged.iloc[:10000000]

test.head()

Unnamed: 0,title,subjects,date,format_group,format_subgroup,category_group,age_group
0,Firewall,"Kidnapping Drama, Video recordings for the hea...",2008-02-13,Media,Video Disc,Fiction,Adult
1,Marley me,"Comedy films, Married people Drama, Philadelph...",2009-07-03,Media,Video Disc,Fiction,Adult
2,Six feet under The complete fourth season,"Video recordings for the hearing impaired, Pro...",2008-10-26,Media,Video Disc,Fiction,Adult
3,Doctor Who The next doctor,"London England Drama, Doctor Who Fictitious ch...",2010-11-10,Media,Video Disc,Fiction,Adult
4,School ties,"Antisemitism Drama, Video recordings for the h...",2008-12-28,Media,Video Disc,Fiction,Adult


### Checking for duplicates and assumption of data integrity

The issue of checking for duplicates with this dataset is that duplicates are acceptable! It is very likely that the same item is checked out from either the same or different branches on a single day. Multiple copies of a book, for example, can be stored in one branch or across several branches.

NOTE: More investigation on the uniqueness of an item's call number could potentially solve this and allow me to check for actual duplicate rows. For the time being, I will assume the data is almost entirely, if not entirely, accurate.

Checking for duplicates can be done using the code below, although it may take quite awhile to run.

In [32]:
# %%time

# df_merged[df_merged.duplicated(keep=False)]

CPU times: user 9.16 s, sys: 673 ms, total: 9.83 s
Wall time: 9.84 s


Unnamed: 0,title,subjects,date,format_group,format_subgroup,category_group,age_group
5,Pirates of the Caribbean At worlds end,"Adventure films, Fantasy films, Comedy films, ...",2008-06-03,Media,Video Disc,Fiction,Adult
662,Ray,"Video recordings for the hearing impaired, Fea...",2008-06-03,Media,Video Disc,Fiction,Adult
14387,No reservations,"Comedy films, Man woman relationships Drama, V...",2008-06-03,Media,Video Disc,Fiction,Adult
19519,Charlie Wilsons war,"Legislators United States Drama, Video recordi...",2008-06-03,Media,Video Disc,Fiction,Adult
27006,Georgia rule,"Comedy films, Mothers and daughters Drama, Vid...",2008-06-03,Media,Video Disc,Fiction,Adult
42380,Bury my heart at Wounded Knee,"Video recordings for the hearing impaired, Uni...",2008-06-03,Media,Video Disc,Fiction,Adult
42860,Nacho Libre,"Comedy films, Video recordings for the hearing...",2008-06-03,Media,Video Disc,Fiction,Adult
45389,Closer,"Man woman relationships England London Drama, ...",2008-06-03,Media,Video Disc,Fiction,Adult
57255,Georgia rule,"Comedy films, Mothers and daughters Drama, Vid...",2008-06-03,Media,Video Disc,Fiction,Adult
66732,Margot at the wedding,"Comedy films, Mothers and sons Drama, Problem ...",2008-06-03,Media,Video Disc,Fiction,Adult


In [34]:
%%time

cols = ['date', 'format_group', 'format_subgroup', 'category_group', 'age_group']

test[cols].groupby(cols).size()

CPU times: user 3min 15s, sys: 1min 56s, total: 5min 12s
Wall time: 5min 42s


date        format_group  format_subgroup  category_group     age_group
2008-01-02  Electronic    Art              Fiction            Adult        0
                                                              Juvenile     0
                                                              Teen         0
                                           Interlibrary Loan  Adult        0
                                                              Juvenile     0
                                                                          ..
2019-07-18  Print         Video Tape       Temporary          Juvenile     0
                                                              Teen         0
                                           WTBBL              Adult        0
                                                              Juvenile     0
                                                              Teen         0
Length: 9139500, dtype: int64

In [49]:
df_merged.format_group.value_counts()

Print         59685137
Media         46618209
Other           200478
Equipment           18
Electronic           1
Name: format_group, dtype: int64

In [51]:
df_merged.format_subgroup.value_counts()

Book              59486648
Video Disc        30287406
Audio Disc        11238813
Audiobook Disc     2695078
Video Tape         1474457
Kit                 626008
Audiobook Tape      240328
Music Score         130486
Audio Tape           45946
Folder               23900
Data Disc             9886
Periodical             623
Document               471
Art                    129
Film                    81
Name: format_subgroup, dtype: int64

In [43]:
df_merged.category_group.unique()

[Fiction, Nonfiction, Language, Miscellaneous, Interlibrary Loan, Reference, On Order, NaN, Temporary, WTBBL]
Categories (9, object): [Fiction, Nonfiction, Language, Miscellaneous, ..., Reference, On Order, Temporary, WTBBL]

In [48]:
df_merged.category_group.value_counts()

Fiction              65292861
Nonfiction           37539002
Miscellaneous         1734282
Language              1679452
Interlibrary Loan      192959
Reference               57420
On Order                 7422
Temporary                  40
WTBBL                      26
Periodical                  0
Name: category_group, dtype: int64

Some of these (`Miscellaneous`, `On Order`, `Temporary`) can be simplified into an `Other` category, since their current category doesn't provide any valuable information in terms of what the actual item is. I was originally considering including `Interlibrary Loan` in the `Other` category as well, but it may be interesting to see the number of activity between branches, so I'll leave it in for now.

Based on some research, `WTBBL` stands for "Washington Talking Book Library" and includes materials for folks with visual impairments. I was interested in looking into this, but since the numbers are so low, I think I'll also group that into the `Other` category, as I assume this `category_group` value deal with *equipment* that can be rented.

I will also convert `Periodical` to other; even though the count for it is 0, I assume this may relate to to items that have a `format_subgroup` value but no `category_group` value.

In [None]:
convert_values = ['Miscellaneous', 'On Order', 'Temporary', 'WTBBL', 'Periodical']

test['category_group'] = np.where(test.category_group in convert_values, 'Other')

In [55]:
test.loc[0,'category_group']

'Fiction'

In [62]:
%%time

df_merged.groupby('date').format_group.value_counts()

CPU times: user 55.3 s, sys: 2min 39s, total: 3min 34s
Wall time: 8min


date        format_group
2005-04-13  Print           10041
            Media            6397
            Other              32
            Equipment           1
2005-04-14  Print            6267
                            ...  
2020-12-12  Other               7
2020-12-13  Print            3225
            Media             760
2020-12-14  Print            1342
            Media             282
Name: format_group, Length: 16107, dtype: int64

In [63]:
%%time

df_merged.groupby('date').format_group.value_counts().unstack().fillna(0)

CPU times: user 39.8 s, sys: 3.79 s, total: 43.6 s
Wall time: 43.6 s


format_group,Electronic,Equipment,Media,Other,Print
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2005-04-13,0.0,1.0,6397.0,32.0,10041.0
2005-04-14,0.0,1.0,4015.0,75.0,6267.0
2005-04-15,0.0,0.0,5351.0,51.0,7494.0
2005-04-16,0.0,0.0,552.0,0.0,806.0
2005-04-17,0.0,0.0,1555.0,8.0,2992.0
...,...,...,...,...,...
2020-12-10,0.0,0.0,1800.0,2.0,3913.0
2020-12-11,0.0,0.0,1417.0,3.0,4234.0
2020-12-12,0.0,0.0,2169.0,7.0,4681.0
2020-12-13,0.0,0.0,760.0,0.0,3225.0


In [71]:
%%time

df_merged.groupby('date').size()

CPU times: user 19.3 s, sys: 1.19 s, total: 20.5 s
Wall time: 20.6 s


date
2005-04-13    16471
2005-04-14    10358
2005-04-15    12896
2005-04-16     1358
2005-04-17     4555
              ...  
2020-12-10     5715
2020-12-11     5654
2020-12-12     6857
2020-12-13     3985
2020-12-14     1624
Length: 5470, dtype: int64

In [None]:
%%time

cols = ['date', 'format_group', 'format_subgroup', 'category_group', 'age_group']

df_merged.groupby(cols).size().unstack(fill_value=0)

In [70]:
%%time

cols = ['format_group', 'format_subgroup',
#         'category_group', 'age_group'
       ]

df_merged.groupby('date')[cols].count()

CPU times: user 26 s, sys: 2.84 s, total: 28.8 s
Wall time: 29.5 s


Unnamed: 0_level_0,format_group,format_subgroup
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2005-04-13,16471,16415
2005-04-14,10358,10276
2005-04-15,12896,12842
2005-04-16,1358,1357
2005-04-17,4555,4535
...,...,...
2020-12-10,5715,5708
2020-12-11,5654,5648
2020-12-12,6857,6846
2020-12-13,3985,3978


In [56]:
test[test.category_group == 'Fiction']

Unnamed: 0,title,subjects,date,format_group,format_subgroup,category_group,age_group
0,Firewall,"Kidnapping Drama, Video recordings for the hea...",2008-02-13,Media,Video Disc,Fiction,Adult
1,Marley me,"Comedy films, Married people Drama, Philadelph...",2009-07-03,Media,Video Disc,Fiction,Adult
2,Six feet under The complete fourth season,"Video recordings for the hearing impaired, Pro...",2008-10-26,Media,Video Disc,Fiction,Adult
3,Doctor Who The next doctor,"London England Drama, Doctor Who Fictitious ch...",2010-11-10,Media,Video Disc,Fiction,Adult
4,School ties,"Antisemitism Drama, Video recordings for the h...",2008-12-28,Media,Video Disc,Fiction,Adult
...,...,...,...,...,...,...,...
9999995,Unbreakable,"Heroes Drama, Supernatural Drama, Survival Dra...",2019-07-18,Media,Video Disc,Fiction,Adult
9999996,DCs legends of tomorrow The complete second se...,"Good and evil Drama, Time travel Drama, Superh...",2019-07-18,Media,Video Disc,Fiction,Adult
9999997,Unfriended Dark web,"Stalking victims Drama, Murder Drama, Internet...",2019-07-18,Media,Video Disc,Fiction,Adult
9999998,Oh God Book II,"Presence of God Drama, Comedy films, Feature f...",2019-07-18,Media,Video Disc,Fiction,Adult


In [50]:
df_merged.age_group.value_counts()

Adult       71587854
Juvenile    31006173
Teen         3909816
Name: age_group, dtype: int64

In [None]:
test['category_group'] = 

In [46]:
%%time

cols = ['date', 'format_group', 'format_subgroup', 'category_group', 'age_group']

test[cols].groupby(cols).count().head(50).T

CPU times: user 3min 11s, sys: 1min 21s, total: 4min 32s
Wall time: 4min 33s


date,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02,2008-01-02
format_group,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic,Electronic
format_subgroup,Art,Art,Art,Art,Art,Art,Art,Art,Art,Art,...,Audio Disc,Audio Disc,Audio Disc,Audio Disc,Audio Disc,Audio Disc,Audio Disc,Audio Disc,Audio Disc,Audio Disc
category_group,Fiction,Fiction,Fiction,Interlibrary Loan,Interlibrary Loan,Interlibrary Loan,Language,Language,Language,Miscellaneous,...,Miscellaneous,Miscellaneous,Nonfiction,Nonfiction,Nonfiction,On Order,On Order,On Order,Periodical,Periodical
age_group,Adult,Juvenile,Teen,Adult,Juvenile,Teen,Adult,Juvenile,Teen,Adult,...,Juvenile,Teen,Adult,Juvenile,Teen,Adult,Juvenile,Teen,Adult,Juvenile


In [36]:
%%time

test.groupby('date').count()

Unnamed: 0_level_0,title,subjects,format_group,format_subgroup,category_group,age_group
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2008-01-02,514,517,517,517,517,517
2008-01-03,373,378,378,378,378,378
2008-01-04,398,401,404,404,404,404
2008-01-05,392,395,396,396,396,396
2008-01-06,170,172,173,173,173,173
...,...,...,...,...,...,...
2019-07-14,1247,1247,1247,1247,1247,1247
2019-07-15,2364,2364,2364,2364,2364,2364
2019-07-16,2077,2077,2077,2077,2077,2077
2019-07-17,1947,1947,1947,1947,1947,1947


In [35]:
test[test.date == '2008-01-02']

Unnamed: 0,title,subjects,date,format_group,format_subgroup,category_group,age_group


In [33]:
%%time

cols = ['date', 'format_group', 'format_subgroup', 'category_group', 'age_group']

test[cols].value_counts(subset=cols)

AttributeError: 'DataFrame' object has no attribute 'value_counts'

In [None]:
%%time



In [None]:
%%time



In [None]:
%%time



In [None]:
%%time



In [None]:
%%time



## GRAVEYARD

[[go back to the top](#Library-Usage-in-Seattle,-2005-2020)]

In [None]:
# %%time

# # list of columns to keep
# keep_cols = ['title', 'subjects', 'date', 'format_group', 'format_subgroup',
#              'category_group', 'age_group']

# # drop columns
# df_merged = df_merged[keep_cols]

In [None]:
dd[dd.code_type == 'ItemType']

In [None]:
dd[dd.code == 'nadvd']

In [None]:
dd[dd.code == 'acdvd']

In [None]:
dd.code_type.unique()

In [None]:
dd_item = dd[dd.code_type == 'ItemType'][['code', 'description', 'format_group', 'format_subgroup', 'category_group', 
             'category_subgroup', 'age_group']]

dd_item.head()

In [None]:
dd_item2 = dd[dd.code_type == 'ItemCollection'][['code', 'description', 'format_group', 'format_subgroup', 'category_group', 
             'category_subgroup', 'age_group']]

dd_item2.head()

In [None]:
sorted(df.item_type.unique())

In [None]:
dd_loc = dd[dd.code_type == 'Location'][['code', 'description']]

dd_loc.head()

In [None]:
test = df.merge(dd_item, left_on='item_type', right_on='code')
# test = test.merge(dd_loc, left_on='collection', right_on='code')

test.head()

In [None]:
test.isna().sum()

In [None]:
test.format_group.value_counts()

In [None]:
test.collection.unique()

In [None]:
test.shape

In [None]:
test2 = df.merge(dd_item2, left_on='Collection', right_on='code')

test2.head()

In [None]:
test.groupby('format_group').category_group.value_counts()

In [None]:
test2.groupby('format_group').category_group.value_counts()

In [None]:
dd[dd.code == 'nybot']

In [None]:
sorted(df.collection.unique())

In [None]:
dd_loc.code.unique()

In [None]:
[cod for cod in df.collection.unique() if cod in dd_loc.code.unique()]

#### NOTE: Using the code below on 10 million rows is almost 39% faster and results in a saved file that is *nearly the same size as the original CSV file* (file is 23.22GB!).

In [None]:
# %%time

# df_merged.to_hdf('data/seattle_lib_temp_ten_mil__alt.hdf', 'mydata', format='table', mode='w')

In [None]:
df.title.value_counts().head(10)

In [None]:
df[df.title=='reader'].head()