##  06_Split_CheckoutData_Feature_Target

Author: Daniel Hui

License: MIT

This notebook splits the checkout dataset into features and also checkout targets

In [1]:
import pandas as pd

### Load Checkout Data

In [2]:
#Load 2018 Checkouts
checkouts_18df = pd.read_csv('/Users/dhui/Downloads/01_Source_Data/Checkouts_By_Title_Data_Lens_2018.csv',index_col=0)

In [3]:
#Load 2017 Checkouts
checkouts_17df = pd.read_csv('/Users/dhui/Downloads/01_Source_Data/Checkouts_By_Title_Data_Lens_2017.csv',index_col=0)

In [4]:
checkouts_df = pd.concat([checkouts_17df, checkouts_18df])

In [5]:
checkouts_df.info()   #There are 11,125,504 checkout records from Jan 1 2017 to date

<class 'pandas.core.frame.DataFrame'>
Index: 11125504 entries, 201701020813000010063298235 to 201810211805000010087324512
Data columns (total 9 columns):
CheckoutYear        int64
BibNumber           int64
ItemBarcode         int64
ItemType            object
Collection          object
CallNumber          object
ItemTitle           object
Subjects            object
CheckoutDateTime    object
dtypes: int64(3), object(6)
memory usage: 848.8+ MB


In [6]:
checkouts_df.describe()

Unnamed: 0,CheckoutYear,BibNumber,ItemBarcode
count,11125500.0,11125500.0,11125500.0
mean,2017.441,2931739.0,10063070000.0
std,0.4965638,503642.6,540933700.0
min,2017.0,32.0,100000500.0
25%,2017.0,2767197.0,10080940000.0
50%,2017.0,3104455.0,10087400000.0
75%,2018.0,3230605.0,10090370000.0
max,2018.0,3418802.0,1000033000000.0


In [7]:
checkouts_df.head(2)

Unnamed: 0_level_0,CheckoutYear,BibNumber,ItemBarcode,ItemType,Collection,CallNumber,ItemTitle,Subjects,CheckoutDateTime
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
201701020813000010063298235,2017,2543647,10063298235,accd,nacd,CD 782.42166 C6606So,Songs from a room,Popular music 1961 1970,01/02/2017 08:13:00 AM
201701020813000010087522552,2017,3172300,10087522552,acbk,namys,MYSTERY COTTERI 2016,I shot the Buddha,Paiboun Siri Doctor Fictitious character Ficti...,01/02/2017 08:13:00 AM


### Limit Checkout Data to Circulating Books

In [8]:
#load book codes of ItemCollection codes for non-reference books
book_codes = pd.read_csv("../01_Data/03_Cleaned/ItemCollection_Book_Codes.csv",index_col=0,header=None)
book_codes = book_codes[1]

In [9]:
#truncate rows by keeping only items that are in the book code list
circulating_books_df = checkouts_df[checkouts_df["Collection"].isin(book_codes)]

In [10]:
circulating_books_df.shape   #7,106,862 checkouts 01/01/17 to Date for circulating books

(7106862, 9)

### Keep Necessary Columns

In [11]:
circulating_books_df = circulating_books_df[["CheckoutYear","BibNumber","Collection","CheckoutDateTime"]]
circulating_books_df = circulating_books_df.reset_index(drop = True)

In [12]:
circulating_books_df.head(2)

Unnamed: 0,CheckoutYear,BibNumber,Collection,CheckoutDateTime
0,2017,3172300,namys,01/02/2017 08:13:00 AM
1,2017,2393405,camys,01/02/2017 08:24:00 AM


### Clean DateTime to YearMonth

In [13]:
circulating_books_df["month"] = circulating_books_df["CheckoutDateTime"].apply(lambda x: x.split("/")[0])

In [14]:
circulating_books_df["YearMonth"] = circulating_books_df["CheckoutYear"].apply(lambda x: str(x)) + circulating_books_df["month"]

In [15]:
circulating_books_df.head(2)

Unnamed: 0,CheckoutYear,BibNumber,Collection,CheckoutDateTime,month,YearMonth
0,2017,3172300,namys,01/02/2017 08:13:00 AM,1,201701
1,2017,2393405,camys,01/02/2017 08:24:00 AM,1,201701


In [16]:
circulating_books_df = circulating_books_df[["BibNumber","YearMonth","CheckoutDateTime"]]

In [17]:
circulating_books_df.head(2)

Unnamed: 0,BibNumber,YearMonth,CheckoutDateTime
0,3172300,201701,01/02/2017 08:13:00 AM
1,2393405,201701,01/02/2017 08:24:00 AM


In [18]:
circulating_books_df.shape

(7106862, 3)

### Export Target Set

#### 1st Half 2018 Target Set

Export books that were checked out in 2018 in the first half

In [45]:
date_list = ["201801", "201802", "201803", "201804", "201805", "201806"]
target_2018half_df = circulating_books_df[circulating_books_df["YearMonth"].isin(date_list)]
target_2018half_df = target_2018half_df["BibNumber"]
target_2018half_df = target_2018half_df.drop_duplicates()
target_2018half_df = target_2018half_df.reset_index(drop=True)

In [46]:
target_2018half_df.shape #215,873 unique titles checked out Q1-3 2018

(215873,)

In [59]:
target_2018half_df.to_csv("../01_Data/05_Target/Checkout_Set/18_Half.csv")

### Export Feature Set

Export books that were checked out in all of 2017 and leave all duplicate records in so we can later count hoe many times a book was checked out in different time increments

In [47]:
date_list = ["201701", "201702", "201703", "201704", "201705", "201706", 
             "201707", "201708", "201709", "201710", "201711", "201712"]
feature_2017_df = circulating_books_df[circulating_books_df["YearMonth"].isin(date_list)]

In [50]:
feature_2017_df.head()

Unnamed: 0,BibNumber,YearMonth,CheckoutDateTime
0,3172300,201701,01/02/2017 08:13:00 AM
1,2393405,201701,01/02/2017 08:24:00 AM
2,2743540,201701,01/02/2017 08:33:00 AM
3,3216678,201701,01/02/2017 08:51:00 AM
4,3221781,201701,01/02/2017 08:51:00 AM


In [51]:
feature_2017_df.shape 

(3871872, 3)

In [52]:
feature_2017_df.to_csv("../01_Data/06_Features/2017_Checkouts.csv")