In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# from langdetect import detect
# from langdetect.lang_detect_exception import LangDetectException
sns.set_style("whitegrid")

First we load the data, and convert the 'dates' column to a dates datatype.

In [None]:
df_raw=pd.read_csv('printbooksafter2017.csv')
df_raw['dates'] = pd.to_datetime(df_raw['dates'])

df_raw.head()

We want to make a new data frame with one row for each title in the library, and then add a column for each month with the number of checkouts of that title in that month. First we get a new data frame, where we only carry over the Title, Creator, and Subjects field. You might also want the publication year, but we'll later write a function to give us the publication month so I didn't include this.

In [3]:
df= df_raw.filter([ 'Title', 'Creator', 'Subjects'])

In [None]:
df.head()

We check that these two data frames have the same length (5916759 rows).

In [None]:
display(len(df_raw))
        
display(len(df))

Now, this dataframe has several rows repeated, a book will appear once for each month it was checked out. We remove this.

In [None]:
df=df.drop_duplicates()

len(df)

So, after we remove the duplicates the size has dropped by about a factor of 10.

We'll see that some code we would like to use to add in these month columns doesn't run-- I suspect that books whose names either start or end with " or ' are messing things up. I'll make two new sets of data frames to try to deal with this issue. The first will take a sample of our original data frame and strip out all quotation marks, and the second set will just be derived from the first few entries of our original dataframe.

For the data set where we strip the quotations we just take a sample because the code to add in the checkouts is too slow on the whole data set. It was even to slow to run on a sample of 5,000 rows (although this was on a very old machine).

In [7]:
def removequotes(string):
    string=string.replace('"', "")
    string=string.replace("'","")
    return string

In [None]:
df_rawwithoutquotes=df_raw.sample(n=1000)

df_rawwithoutquotes['Title']=df_rawwithoutquotes['Title'].apply(removequotes)

dfwithoutquotes=df_rawwithoutquotes.filter([ 'Title', 'Creator', 'Subjects'])
dfwithoutquotes=dfwithoutquotes.drop_duplicates()


display(len(df_rawwithoutquotes))
display(len(dfwithoutquotes))

It is interesting to see that in the sample there aren very few repeated rows. I noticed this behavior with a sample as high at 50,000, there were only 4,000 duplicated. I'm not sure how to explain this, I don't know if it is surprsing or not.

In [10]:
df_test=df.head(5)
df_rawtest=df_raw.head(5)

Now we want to add in the information about checkouts each month. We'll make a sequence of the dates we are interested in, and loop over that to add each month as a column and for each month to add the number of checkouts for that month

In [11]:
import datetime

months_of_interest=pd.date_range(start='2018-01-01', end='2024-08-01',freq='MS')


In [None]:
print(months_of_interest)

Now we write a function that takes in a title and a month and returns the number of checkouts in that month. This code runs but throws an error message including: 
```
'The behavior of 'isin' with dtype=datetime64[ns] and castable values (e.g. strings) is deprecated'. 
```
I think that to fix this you need to tell the function that the month variable is a date of the appropriate kind, but I don't know how to do this.

In [None]:
def number_of_checkouts_in_month(title,month):
    #this function is really slow, using filter might be better than query.
    #this throws a warning, that the behavior of isin with dates and castable values is deprecated
    if (df_raw.query(f"dates =='{month}' and Title=='{title}' ").shape[0])==0:
        #df_raw.query(f"dates =='{month}' and Title=='{title}' ")['Checkouts'] returns [] if the title wasn't checked out in that month
        return 0
    else:
        return (df_raw.query(f"dates =='{month}' and Title=='{title}' ")['Checkouts'].iloc[0])

number_of_checkouts_in_month(title='Oddity / Sarah Cannon.', month='2022-08-01')

Now we can loop over months_of_interest and use the apply method to add our new columns. Unfortunately, this doesn't work right now! We get the error:
```
('unterminated string literal (detected at line 1)', (1, 82))
```

In [None]:
for month in months_of_interest:
    df[f'{month}']= df_raw.apply(lambda x: number_of_checkouts_in_month(title=x['Title'], month=month), axis=1)
#This fails to run, it throws the error ('unterminated string literal (detected at line 1)', (1, 82)).
# I suspect this has to do with the entries in the title field have a quotation mark in the beginning or ending.



We see that this procedure does work on our toy data frames.

In [None]:
def number_of_checkouts_in_monthtest(title,month):
    #this throws a warning, that the behavior of isin with dates and castable values is deprecated
    if (df_rawtest.query(f"dates =='{month}' and Title=='{title}' ").shape[0])==0:
        #df_rawtest.query(f"dates =='{month}' and Title=='{title}' ")['Checkouts'] returns [] if the title wasn't checked out in that month
        return 0
    else:
        return (df_rawtest.query(f"dates =='{month}' and Title=='{title}' ")['Checkouts'].iloc[0])

number_of_checkouts_in_monthtest(title='Oddity / Sarah Cannon.', month='2022-08-01')

In [None]:
for month in months_of_interest:
    df_test[f'{month}']= df_rawtest.apply(lambda x: number_of_checkouts_in_monthtest(title=x['Title'], month=month), axis=1)

We see that we have the columns we want.

In [None]:
df_test.head()

These books were all checked out in August of 2022, we can see that those columns don't have zeros.

In [None]:
df_test['2022-08-01 00:00:00']

Now, let's see if this works on the sample of data frames with the quotation marks stripped out.

In [None]:
def number_of_checkouts_in_monthnoquotes(title,month):
    #this throws a warning, that the behavior of isin with dates and castable values is deprecated
    if (df_rawwithoutquotes.query(f"dates =='{month}' and Title=='{title}' ").shape[0])==0:
        #df_rawwithoutquotes.query(f"dates =='{month}' and Title=='{title}' ")['Checkouts'] returns [] if the title wasn't checked out in that month
        return 0
    else:
        return (df_rawwithoutquotes.query(f"dates =='{month}' and Title=='{title}' ")['Checkouts'].iloc[0])

number_of_checkouts_in_monthnoquotes(title='Oddity / Sarah Cannon.', month='2022-08-01')
#note: this number of checkouts is zero simply since the sarah cannon book isn't in this corpus.

I ran the following code on the big data set (5 million rows) with the columns stripped out and it ran all night without finishing. It even ran for about an hour on a sample of size 50,000 before I gave up.

I think that the number of checkouts function might run faster using filter rather than query.

In [None]:
for month in months_of_interest:
    dfwithoutquotes[f'{month}']= df_rawwithoutquotes.apply(lambda x: number_of_checkouts_in_monthnoquotes(title=x['Title'], month=month), axis=1)

In [None]:
dfwithoutquotes.head()

In [26]:
dfwithoutquotes.to_csv('datafilewithmonthsadded.csv')