In [1]:
import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

This two functions below help to clean the Titles column of the dataframe. 
    
    check_ascii: inputs a string and outputs a boolean of whether or not string only        contains ascii characters

    clean_title: inputs a string and removes all characters after a colon and a forward slash (in that order). It then removes any text inside parentheses and square brackets including the parentheses/brackets. Finally it strips the string of leading and trailing white space and makes it all lowercase. 

    clean_title_w_colon: an alterative title cleaning function where we do the same processing as in clean_title, but leave all colons and the information after then. 

    clean_creator: inputs a string, strips any leading and trailing whitespace and periods, and makes the string lowercase. It then splits the string on commas, white space, and single quotation marks ('), sorts the resulting list, filters out the non-alphabetical characters, and returns the resulting list as a single string separated by spaces. 

In [2]:
#Function to remove non-ascii Titles
def check_ascii(text):
    return str(text).isascii()

In [68]:
#Function to clean the titles by removing the author after '/' or ':' and  any parenthesis or brackets
def clean_title(title):
    # Remove everything after the first colon
    title = re.split(r':', title)[0]
    # Remove everything after the first forward slash
    title = re.split(r'/', title)[0]
    # Remove text inside parentheses ()
    title = re.sub(r'\(.*?\)', '', title)
    # Remove text inside square brackets []
    title = re.sub(r'\[.*?\]', '', title)
    # Strip any extra spaces and make lowercase
    return title.strip().lower()

In [4]:
#Function to clean the titles by removing the author after '/', any parenthesis or brackets
def clean_title_w_colon(title):
    # Remove everything after the first colon
    #title = re.split(r':', title)[0]
    # Remove everything after the first forward slash
    title = re.split(r'/', title)[0]
    # Remove text inside parentheses ()
    title = re.sub(r'\(.*?\)', '', title)
    # Remove text inside square brackets []
    title = re.sub(r'\[.*?\]', '', title)
    # Strip any extra spaces and make lowercase
    return title.strip().lower()

In [65]:
#Function to clean the Creator column; it makes all words lowercase splits on commas, 
#white-space, hypens, and single quotes and takes the first two entries. It then sorts them to
#  make sure "First Last" and "Last, First" have the same ordering, and then outputs 
# them as a joined string. Note this function also removes any . from names. (H. G. French => H G French)
def clean_creator(text): 
    text= text.strip().lower() #Strip any leading/trailing white space and make lowercase
    text= text.replace('.', '') #Remove any periods
    text  = re.split(r"[-,'\s]+", text) #Split on hypens, commas, single quotes, and white space
    text = sorted(text) #Sort the results
    filter =[str(i).isalpha() for i in text] #Check if string only has alphabetical characters
    filtered_list = [i for (i, v) in zip(text, filter) if v] #Select strings with only alphabetical characters
    return " ".join(filtered_list)

We first read in the full dataset, drop the ISBN column, and then drop any columns with NA values via dropna(). Note that we drop the ISBN column first because a majority of this column has NA values. We then apply the functions above to clean the Title and Creator columns and also clean the PublicationYear column using regular expressions. We write the result of this to "Checkouts_NoISBN_cleaned.csv". 

In [5]:
#Read in full dataset
df = pd.read_csv('data/Checkouts_by_Title.csv')

  df = pd.read_csv('data/Checkouts_by_Title.csv')


In [6]:
#Drop ISBN column since most of the entries are NA
df_noISBN = df.drop(columns=['ISBN'])
df_noISBN.shape

(45794625, 11)

In [7]:
#Drop the NA values now that the ISBN column is gone
df_noISBN2 = df_noISBN.dropna()
df_noISBN2.shape

(31391235, 11)

In [8]:
#Remove the non-ascii titles 
df_filtered= df_noISBN2[df_noISBN2['Title'].apply(check_ascii)]

In [66]:
#Read in full dataset
df_filtered = pd.read_csv('data/Checkouts_NoISBN_cleaned.csv')

In [70]:
#Create new column with cleaned title; see clean_title function for details on cleaing
df_filtered['CleanedTitle'] = df_filtered['Title'].apply(clean_title)
df_filtered['CleanedCreator'] = df_filtered['Creator'].apply(clean_creator)

In [12]:
#Clean the publication year column
#Note: some values become NA in the extraction so we dropna and then cast as int
df_filtered['PublicationYear'] = df_filtered['PublicationYear'].str.extract(r'(\d+)')
df_filtered= df_filtered.dropna()
df_filtered['PublicationYear'] = df_filtered['PublicationYear'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['PublicationYear'] = df_filtered['PublicationYear'].str.extract(r'(\d+)')


In [24]:
#Add a column CheckoutDate with DateTime type of the checkout month/year. 
# Note that here we set the day to be 1 for all checkouts. 
df_filtered['month'] = df_filtered.CheckoutMonth
df_filtered['year'] = df_filtered.CheckoutYear
df_filtered['CheckoutDate'] = pd.to_datetime(df_filtered[['month', 'year']].assign(DAY=1))
df_filtered = df_filtered.drop(columns=['month', 'year'])

In [30]:
df_filtered = df_filtered.drop(columns=['CheckoutYear', 'CheckoutMonth', 'CheckoutType'])

In [72]:
df_filtered.shape

(30245393, 11)

In [73]:
df_filtered.to_csv('data/Checkouts_NoISBN_cleaned.csv', index=False)

We note that we only have checkout data starting from 2005. Thus, we filter our data to only take books with PublicationYear > 2004. We also convert the CheckoutMonth and CheckoutYear column into a DateTime with Day set to 1 for each month. We save the result of this, after dropping the CheckoutYear and CheckoutMonth columns (which can be retrieved from the new CheckoutDate column), into "Checkouts_2005_cleaned.csv"

In [74]:
#Drop books with publication years before 2005 as we only have checkout data after 2005
print(df_filtered.shape)
df_2005 = df_filtered[df_filtered.PublicationYear > 2004]
print(df_2005.shape)

(30245393, 11)
(25179467, 11)


In [78]:
df_2005.to_csv('data/Checkouts_2005_cleaned.csv', index=False)

Note that in the clean_title function, we remove all information after the first colon; this may or may not be the correct thing to do. It is possible that the after the colon information is not consistent among books (e.g. "The Collector" versus "The Collector: The Novel"), but it is also possible that this information differentiates between books in series (e.g. "The Baby-Sitters Club: The Truth About Stacy"). I think leaving the colon might be the better option, so we also create a CleanedTitle column with clean_title_w_colon and write the result to "Checkouts_2005_cleaned_w_colon.csv" in the data folder. 

In [2]:
df_2005=pd.read_csv('data/Checkouts_2005_cleaned.csv')

In [5]:
#Alternative way to clean titles that leaves colons
df_2005['CleanedTitle'] = df_2005['Title'].apply(clean_title_w_colon)


In [7]:
df_2005.to_csv('data/Checkouts_2005_cleaned_w_colon.csv', index=False)