## Books Data Engineering ETL

Spring 2024, by Oliver Seymour


#### My Data Sources:

Goodreads books and reviews: https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks

Audible books and reviews: https://www.kaggle.com/code/satyanarayanam/cleaning-audible-dataset/input

***add another here

## Installs and imports

In [None]:
# TODO: uncomment these!!
# %pip install pandas

In [None]:
import pandas as pd

# Extract

## Load data

In [None]:
audible_df = pd.read_csv('Data/audible.csv')#, encoding='ISO-8859-1')
display(audible_df.head())

# TODO: fix issues with goodreads import!!
# used the chardet library to find with ~70% confidence that this is the right encoding for this file
# goodreads_df = pd.read_csv(sep=';', encoding='ISO-8859-1')
# display(audible_df.head())

## Explore data

### Exploring the Goodreads dataset

### Exploring the Audible dataset

In [None]:
with pd.option_context('display.min_rows', 10000):
	display(cleaned_audible_df)

In [None]:
# what is the shape?
audible_df.shape

In [None]:
# how many NAs are there?
audible_df.isna().sum()

In [None]:
# Let's see if the author column always starts with 'Writtenby:'
display(audible_df[~audible_df['author'].str.contains(pat=r'^Writtenby:', regex=True)])
# wow ok, there are 0 rows that don't have that starting, that will be easy to clean

# and if the narrator column always starts with 'Narratedby:'
display(audible_df[~audible_df['narrator'].str.contains(pat=r'^Narratedby:', regex=True)])
# ok nice, once again they all follow that pattern

In [None]:
# Now that we know these columns always start with 'Writtenby:' and 'Narratedby:', how can we split up the first and last names?
# regex: first 'Writtenby: and then either one or more of: (a capital letter followed by one or more lowercase letters), 
# basically just FirstNameOrMaybeMoreName
# or (one or more of: (a capital letter followed by a period), i.e. for an initial, followed by a capital letter and then one or more lowercase
# basically A.B.LastNameMaybeMore


# display(audible_df[~audible_df["author"].str.contains(pat=r"^Writtenby:( (([A-Z][a-z]+)+ | (([A-Z][.])+[A-Z][a-z]+))+[,]* )")])
# display(audible_df[audible_df["author"].str.contains(pat=r"^Writtenby:((([A-Z][.])+[A-Z][a-z]+)+[,]*)")])
display(audible_df[~audible_df["author"].str.contains(pat=r"^Writtenby:([A-Z][a-z]+)+")])

In [None]:
# let's check for inconsistencies in the formatting of the 'time' column
# do they all have 'and'?
display(audible_df[~audible_df['time'].str.contains(pat=r'and', regex=True)].head())
# ok not all do

# do they all follow the <numbers><space><letters> etc. pattern?
display(audible_df[~audible_df['time'].str.contains(pat=r'^\d+\s+[A-Za-z]+')].head())

# let's get the unique values:
audible_df.loc[~audible_df['time'].str.contains(r'^\d+\s+[A-Za-z]+', regex=True), 'time'].unique()
# ok seems like it always contains 'and', except if the value is 'Less than 1 minute'

In [None]:
# How often is the stars column 'Not rated yet', or some other string that doesn't contain a digit?
display(audible_df[~audible_df['stars'].str.contains(r'\d')].shape)
# wow that's a lot of books that aren't rated yet!
display(audible_df[~audible_df['stars'].str.contains(r'\d')])

# what unique values are there?
audible_df[~audible_df['stars'].str.contains(r'\d')]['stars'].unique()
# ok seems to just be 'Not rated yet'

In [None]:
# Does the release date column always follow the pattern <numbers> - <numbers> - <numbers>?
display(audible_df[~audible_df['releasedate'].str.contains(r'^\d+-\d+-\d+', regex=True)])
# Ok nice they all follow that pattern

display(audible_df[~audible_df['releasedate'].str.contains(r'^\d{2}-\d{2}-\d{4}', regex=True)])
# but it appears they don't always follow DD-MM-YYYY, sometimes it is D-M-YYYY, and sometimes it is just YY

### Exploring the ****third dataset

## Cleaning the data

### Cleaning the Goodreads dataset

### Cleaning the Audible dataset

In [None]:
# make a deep copy
cleaned_audible_df = audible_df.copy()

#### Cleaning the price column:

In [None]:
# let's look for outliers in the price

# first let's convert it to a float
# since there are commas in the numbers, e.g. 1,234.00, we will have to get rid of commas
# display(audible_df['price'].min())

# TODO: more to do here!!!

In [None]:
# TODO: !!!!! have to deal with weird character encoding issues in the audible dataset as well
# like ç¬¬äºŒåäº”è©±ã‚µãƒ³ãƒ»ãƒŸã‚·ã‚§ãƒ«ã®ã„ã„ã...	Writtenby:æ£®æœ¬å“²éƒŽ	Narratedby:å°é‡Žç”°è‹±ä¸€

#### Cleaning the author and narrator columns:

In [None]:
# In my exploration I found that every row in the author and narrator column started with
# 'Writtenby:' and 'Narratedby:', so let's remove that

# replace 'Writtenby:' with an empty string
cleaned_audible_df['author'] = cleaned_audible_df['author'].str.replace(pat=r'^Writtenby:', repl='', regex=True)

# replace 'Narrattedby:' with an empty string
cleaned_audible_df['narrator'] = cleaned_audible_df['narrator'].str.replace(pat=r'^Narratedby:', repl='', regex=True)

display(cleaned_audible_df.head())

#### Cleaning the releasedate column (converting to timestamp):

In [None]:
# convert the release date to a datetime object
# **Important note, they seem to be in a DD-MM-YYYY (or sometimes D-M-YY, etc. but day first) format, 
# since there are some dates like 30-10-18, 25-11-14 (so the day must be first), but there is also 1-5-2018
cleaned_audible_df['cleaned_releasedate'] = pd.to_datetime(cleaned_audible_df['releasedate'], dayfirst=True, format='mixed')

# temporarily display a lot of rows to check it worked
with pd.option_context("display.min_rows", 10):
	# look at just rows where releasedate had a weird format to especially make sure it worked for those
	display(cleaned_audible_df[~cleaned_audible_df['releasedate'].str.contains(r'\d{2}-\d{2}-\d{4}', regex=True)])
	display(cleaned_audible_df)

# ok seems to have worked!

In [None]:
# make sure there are no weird values in cleaned_releasedate
print("min release date values:")
display(cleaned_audible_df[cleaned_audible_df['cleaned_releasedate'] == cleaned_audible_df['cleaned_releasedate'].min()])
# ok the min values look reasonable

# check books that were released after today's date
print("release date values that are after today's date:")
display(cleaned_audible_df[cleaned_audible_df['cleaned_releasedate'] > pd.to_datetime('today').normalize()])
# I don't think this is a mistake since one of the original releasedate values is '9-8-2024'
# These have very weird character encoding issues and I don't think it makes sense to have books that aren't release yet
# so I'm going to drop these

cleaned_audible_df = cleaned_audible_df[cleaned_audible_df['cleaned_releasedate'] <= pd.to_datetime('today').normalize()]
print("new max release date value:")
display(cleaned_audible_df['cleaned_releasedate'].max())
# ok nice, now the max value was released about 2 weeks ago, that seems reasonable

# now let's drop the releasedate column
cleaned_audible_df.drop(columns=['releasedate'], inplace=True)

#### Cleaning the time column and extracting hours and mins:

In [None]:
# attempt at converting the time column by splitting it into hours, mins, etc. components in different columns and then combining again
print("df with the hours extracted:")
display(cleaned_audible_df['time'].str.extract(pat=r'([A-Za-z]+\s*)+')[0].unique())

# at the start of the string, get one or more digits, then a white space, hr, maybe an s
# but only keep the digits
# i.e. we get '8 hrs' or '1 hr' and we only keep the 8 or 1
cleaned_audible_df['hours'] = cleaned_audible_df['time'].str.extract(pat=r'^(\d+)\shr[s]*')
# seems to have worked
# we have an NA if it is 0 hours, so let's fill the NAs with 0's
cleaned_audible_df['hours'] = cleaned_audible_df['hours'].fillna(0)

# with pd.option_context("display.min_rows", 10000):
display(cleaned_audible_df)

# check that it worked for when it is just '1 hr', not hrs
print("rows with time == '1 hr':")
display(cleaned_audible_df[cleaned_audible_df['time'] == '1 hr'])

# check that the fillNA worked correctly
print("rows with 0 in the hours column:")
# with pd.option_context("display.min_rows", 10000):
display(cleaned_audible_df[cleaned_audible_df['hours'] == 0])

In [None]:
# extracting the minutes
# cleaned_audible_df['minutes'] = cleaned_audible_df['time'].str.extract(r'')
cleaned_audible_df[cleaned_audible_df['time'].str.contains(r'min$', regex=True)]
# ok it seems we have to be careful with not getting 'Less than 1 minute' and something like '1 hr and 1 min' mixed up

# one or more digits, then 1 whitespace, then 'min', then maybe an s, all this at the end of the string
# only capture the digits though
# it is important that it's at the end of the string so we don't accidentally capture the 1 from 'Less than 1 minute', 
# we only want the number when the string ends in min or mins
print("df with minutes extracted:")
cleaned_audible_df['minutes'] = cleaned_audible_df['time'].str.extract(r'(\d+)\smin[s]*$')
# cleaned_audible_df[cleaned_audible_df['time'].str.contains(r'(\d+\smin[s]*$)')]

# with pd.option_context("display.min_rows", 10000):
display(cleaned_audible_df)

# check that it correctly did NA for rows where there are no minutes
print("rows with NA in the minutes column:")
display(cleaned_audible_df[cleaned_audible_df['minutes'].isna()])

# the minutes column seems to have worked nicely, let's fill in the NAs with 0's
cleaned_audible_df['minutes'] = cleaned_audible_df['minutes'].fillna(0)

print("df with minutes NAs filled in with 0's:")
# with pd.option_context("display.min_rows", 10000):
display(cleaned_audible_df)

In [None]:
# TODO: what to do with rows where it is 'Less than 1 minute'????
# let's see what unique values there are in the time column where we got NA for both hours and mins, and filled them both in with 0's:
cleaned_audible_df[(cleaned_audible_df['hours'] == 0) & (cleaned_audible_df['minutes'] == 0)]['time'].unique()
# Ok, just 'Less than 1 minute'. I will have to figure out how to take care of this

### Cleaning the ****third dataset

## Integrate the data

# Transform

## Normalize the data

# Load

## Set up the database

## Load in the data

# Queries