# Data Analysis with Pandas — Day 3
## Text Manipulation, Functions, Time Series

This is the Day 3 notebook for the June 2021 course "Data Analysis with Pandas," part of the [Text Analysis Pedagogy Institute](https://nkelber.github.io/tapi2021/book/intro.html).

In this lesson, we will cover:

* String Methods / Text Manipulation
* Applying Functions
* Converting Between Data Types
* Working with Time Series Data

___

## Dataset
### Seattle Public Library Book Circulation Data

This week, we will be working with [circulation data](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6) made publicly avilable by the Seattle Public Library. The dataset includes items that were checked out 20+ times in a month between January 2015 and June 2021.

For more information about this dataset, see the Seattle Public Library's [data portal](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).
___

## Import Pandas

To use the Pandas library, we first need to `import` it.

In [None]:
import pandas as pd

By default, Pandas will display 60 rows and 20 columns. I often change [Pandas' default display settings](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) to show more rows or columns.

In [None]:
pd.options.display.max_colwidth = 100
pd.options.display.max_rows = 100

## Load Data

To read in a CSV file, we will use the function `pd.read_csv()` and insert the name of our desired file path. 

In [None]:
seattle_df = pd.read_csv('Seattle-Library_2015-2021.csv', delimiter=",", encoding="utf-8")

## Lingering Issues with SPL Data

So we've learned a lot about Pandas so far, and we've learned a lot about our Seattle Public Library data, too. Let's take a moment to appreciate how far we've come:

We've been able to get a broad overview of our data, subset our data in different ways, make simple plots, and even figure out the books and material types that were checked out most between 2015-2021. That's pretty awesome!

<img src="https://images-na.ssl-images-amazon.com/images/I/41txHpdA8QL.jpg" width=250/>


But unfortunately there are still some lingering problems and issues with our data.

For example, the titles in the dataset are sometimes recorded inconsistently and have slightly different versions. If we calculate the total checkouts for each title, we can see that Sue Grafton's *A" is for Alibi: Kinsey Millhone Series, Book 1* shows up multiple times in different ways.

In [None]:
seattle_df.groupby('Title')[['Checkouts']].sum()

There are other small inconsistencies in the data, too, such as trailing commas in the "Publishers" column, which make it more difficult for us to get a true sense of publishing trends.

In [None]:
seattle_df['Publisher'].value_counts()[:40]

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/16/Penguin_Random_House.svg/1200px-Penguin_Random_House.svg.png" width=250/>


Additionally, when we try to filter for a particular title, like Elena Ferrante's *My Brilliant Friend*, we have to know the full and complete title for our filtering method to work (pssst can you tell that I'm obsessed with Elena Ferrante yet...?) 

Here's what happens when we filter the DataFrame for just "My Brilliant Friend"...

In [None]:
# Boolean vector
title_filter = seattle_df['Title'] == 'My Brilliant Friend'

# Filter
seattle_df[title_filter]

Nothing. No dice. 🎲

Using this filter method, we have to specify the entire title to get what we're looking for, which is challenging and annoying.

In [None]:
# Boolean vector
title_filter = seattle_df['Title'] == 'My Brilliant Friend: Neapolitan Series, Book 1'

# Filter
seattle_df[title_filter].head()

Lastly, our "Subjects" column still includes multiple subjects in manys rows, so we don't know very much about overall "Subject" trends.

In [None]:
seattle_df['Subjects']

Plus we still don't have any datetime information.

In [None]:
seattle_df['CheckoutYear'].dtype

But don't fret. We're going to resolve all of these lingering issues in this notebook!

## String Methods

We can clean up many of the inconsistencies in our data by using Pandas string methods.

<div class="admonition pythonreview" name="html-admonition" style="background: lightgreen; padding: 10px">
<p class="title"><b/>Python Review 🐍 </b></p>

The Python data type for textual data is a "string." Strings are denoted by single or double quotation marks.

| Data Type       | Explanation          | Example  |
| ------------- |:-------------:| -----:|
| String     | Text | ```"brilliant", '40'``` |
| Integer     | Whole Numbers      |   ```40``` |
| Float | Decimal Numbers      |   ```40.2``` |
| Boolean | True/False     |   ```False``` |

There are a number of convenient, built-in Python methods that allow you to manipulate and work with strings, such as stripping leading and trailing whitespace or replacing certain characters.

| **String Method** | **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| `string.lower()`         | makes the string lowercase                                                   
| `string.strip()`         | removes lead and trailing white spaces     |
| `string.replace('old string', 'new string')`      | replaces `old string` with `new string`          |
| `string.split('delim')`          | returns a list of substrings separated by the given delimiter |

                                                            
</div

Let's assign the string "?My Brilliant Friend?" to the variable `sample_string`.

In [None]:
sample_string = "?My Brilliant Friend?"

When we check its data type, we can see that it is indeed a string.

In [None]:
type(sample_string)

Thus, we can strip leading and trailing characters.

In [None]:
sample_string.strip('?')

And we can make the entire string uppercase.

In [None]:
sample_string.upper()

Well, Pandas has special [Pandas string methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-methods), too. Many of them are very similar to Python string methods, except they will transform every single string value in a column, and we need to add `.str` to the method chain.

Here's a sample of Pandas string methods (you can see a full account in the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary). 

| **Pandas String Method** | **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| df['column_name']`.str.lower()`         | makes the string in each row lowercase                                                                                |
| df['column_name']`.str.upper()`         | makes the string in each row uppercase                                                |
| df['column_name']`.str.title()`         | makes the string in each row titlecase                                                |
| df['column_name']`.str.replace('old string', 'new string')`      | replaces `old string` with `new string` for each row |
| df['column_name']`.str.contains('some string')`      | tests whether string in each row contains "some string" |
| df['column_name']`.str.split('delim')`          | returns a list of substrings separated by the given delimiter |
| df['column_name']`.str.join(list)`         | opposite of split(), joins the elements in the given list together using the string                                                                        |
                                                            

To transform all the values in the "MaterialType" column to lower case, we can use `.str.lower()` 

In [None]:
seattle_df['MaterialType']

In [None]:
seattle_df['MaterialType'].str.lower()

We can also try to clean up some of the inconsistencies in our data by stripping trailing commas with `.str.strip(,)`.

Here are the top 15 most frequent values in the "Publisher" column before we do any clean up.

In [None]:
seattle_df['Publisher'].value_counts()[:15]

Now let's strip commas.

In [None]:
seattle_df['Publisher'] = seattle_df['Publisher'].str.strip(',')

Here are the top 15 most frequent values after our text manipulation. Looks a little better!

In [None]:
seattle_df['Publisher'].value_counts()[:15]

We can also use `df.str.contains()` to search for whether a row contains a string, like *My Brilliant Friend*, even if it doesn't match the title exactly.

If there are `NaN` values in a column, we can also choose to ignore them with `na = False`.

In [None]:
# Boolean vector
title_filter = seattle_df['Title'].str.contains('My Brilliant Friend', na=False)

# Filter
seattle_df[title_filter].sample(10)

We can also test whether a row contains a certain word/phrase OR (`|`) other words/phrases. Additionally, we can choose to ignore case.

In [None]:
# Boolean vector
title_filter = seattle_df['Title'].str.contains('(my brilliant friend|story of a new name)',
                                                na=False, case = False)

# Filter
seattle_df[title_filter]['Title'].value_counts()


| Regular Expression Pattern       | Matches |
|:---------------------------:|:-----------------------------------------------------------------------------------------------------------:|
| `.` | any character                                         | 
| `\w` | word                                         | 
| `\W`                      | NOT word                                           |  
| `\d` | digit                                         | 
| `\D`                      | NOT digit                                           | 
| `\s` | whitespace                                         | 
| `\S`                      | NOT whitespace                                          | 
| `[abc]`                      | Any of abc                                         |
| `[^abc]`                      | Not any of abc                                         | 
| `(abc)`                      | Specific capture of "abc"                                         
| `+`                      | 1 or more instances                                       | 
| `*`                      | 0 or more instances                                         | 
| `?`                      | 0 or 1 instance                                        | 
| `{number}`                      | any specific number of instances                                        | 

                   


We can also use regular expressions with `.str.contains()`. For example, we could search for anything that has 4 numbers in a row `\d{4}`.

In [None]:
# Boolean vector
title_filter = seattle_df['Title'].str.contains('\d{4}',
                                                na=False, case = False, regex=True)

# Filter
seattle_df[title_filter]['Title'].value_counts()

## Applying Functions

| **Pandas Method** | **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| df['column_name']`.apply(function_name)`         | Call function on every row in column                                                                                |
| df`.apply(function_name, axis='columns')`         | Call function on every row in DataFrame                                                |

Perhaps we want to create a Python function that will clean up our data in a more nuanced and comprehensive way.

For example, our "PublicationYear" column has a lot of irregularity. Many of the values are surrounded by square brackets`[]`, begin with the letter `c`, end with a period, or include multiple years (presumably a copyright year and a reissue year).

In [None]:
seattle_df['PublicationYear'].value_counts()

Lucky for us, Pandas makes it easy to apply functions to DataFrame and Series objects.

<div class="admonition pythonreview" name="html-admonition" style="background: lightgreen; padding: 10px">
<p class="title"><b/>Python Review 🐍 </b></p>

Python functions enable you to bundle up code and call that code whenever you need it. There are a number of built-in Python functions, such as:
- `print()`
- `len()`
- `type()`

To create a Python function of your own, you need to begin with the keyword `def`, short for define. Then you give the function a name followed by parenthesis, and often you specify an argument that it will accept inside the parenthesis, followed by a colon.

```
def clean_year(year):
    year = year.replace('[', '')
    year = year.replace(']', '')
    return year

clean_year("[2012]")  
"2012"  
```

Then you move to the next line, indent, and include the code that you want to be run when you "call" the function.

Finally, you typically end with a `return` statement that will return a certain value(s).  


</div>

To use a Python function on a DataFrame or Series object, you use `.apply()` and give the name of the desired function as an argument. This means the function will be called on every row in the DataFrame or every column.

To make a useful Python function for your DataFrame/column, it can be useful to think about making a function that will work well on a single value, like "c[2012]. " or "c2012 2020.  " For these values, we want to remove characters like `[` or `c`, strip whitespace, and find a way to deal with multiple years. To make things simple, let's simply return "Unknown" if there is more than one year listed.

In [None]:
def clean_year(year):
    
    # Convert number to string
    year = str(year)
    # Replace characters
    year = year.replace('[', '')
    year = year.replace(']', '')
    year = year.replace('-', '')
    year = year.replace(',', '')
    year = year.replace('.', '')
    year = year.replace('©', '')
    year = year.replace('c', '')
    year = year.replace('?', '')
    # Strip whitespace
    year = year.strip()
    
    # If there are more than 4 characters, return Unknown 
    if len(year) > 4:
        year = 'Unknown'
        
    return year

Let's test out the function on single values.

In [None]:
clean_year("c[2012].  ")

In [None]:
clean_year("c2012 2020.  ")

Great! It works. Now let's apply the function to the entire column. Note that when we apply the function, we simply give the name and do not call it with parenthesis.

In [None]:
seattle_df['PublicationYear'].apply(clean_year)

Once we clean up the "PublicationYear" column, it looks a lot better.

In [None]:
seattle_df['PublicationYear'] = seattle_df['PublicationYear'].apply(clean_year)

seattle_df['PublicationYear'].value_counts()

You may have noticed that there are a lot of different variations of the titles, as well.

Let's make a function that attempts to aggregate titles of the same name. Note that we're using more regular expressions here. Python functions allow you to get as fancy as you want!

In [None]:
import re

def clean_title(title):
    
    # Replace some words
    title = title.lower().replace('(unabridged)', '')
    title = title.lower().replace(': a novel', '')
    
    # Use regex expression to remove everything after / or :
    # Test to see if there is a / character
    if re.search('.+?(?=/)/', title):
        # If so, pull out the text before the / character
        title = re.search('.+?(?=/)/', title).group(0)
    
    # Test to see if there is a : character
    if re.search('.+?(?=:)', title):
        # If so, pull out the text before the / character
        title = re.search('.+?(?=:)', title).group(0)
    
    # Strip character and whitespace
    title = title.strip('/')
    title = title.strip()
    
    title = title.title()
    
    return title

Make sure to run this cell!

In [None]:
seattle_df['Title'] = seattle_df['Title'].apply(clean_title)

<div class="admonition note" name="html-admonition" style="background: lightyellow; padding: 10px">
<p class="Question"><b/>❓ Question</b></p>

How dominant are the "Big 5" Publishers — Penguin/Random House, Harper Collins, Hachette, Simon & Schuster, and Macmillan — in the Seattle Public Library system?

To answer this question, we need to `.apply()` a function.
</div>

It's hard to get a sense of publishing trends in this dataset because there are a lot of different values for the same publisher — not only slight differences like "Random House, Inc." and "Random House" but bigger differences, too, like the fact that "Viking" is *owned* by Penguin/Random House.

Can we account for these nuances?

In [None]:
seattle_df['Publisher'].value_counts()[:35]

In [None]:
def big_5_checker(publisher):
    
    # Make lowercase to catch variations
    publisher = publisher.lower()
    
    # Test to see if certain words are in the row, then return corresponding publisher
    if "random house" in publisher or "penguin" in publisher or "knopf" in publisher or "viking" in publisher or "ballantine" in publisher:
        return "Penguin/Random House"
    elif "harper" in publisher:
        return "Harper Collins"
    elif "simon" in publisher:
        return "Simon & Schuster"
    elif "hachette" in publisher:
        return "Hachette"
    elif "macmillan" in publisher:
            return "Macmillan"
    else:
        return "Other"

Let's test out the function on single values.

In [None]:
big_5_checker("Simon & Schuster - Audiobooks")

In [None]:
big_5_checker("Viking,")

Nice! It's working. Let's apply it to the column.

In [None]:
seattle_df['Big 5'] = seattle_df['Publisher'].apply(big_5_checker)

Uh oh! We've gotten an error. The error reports that we can't use a string method, `.lower()`, on a `float` data type, which is likely a `NaN`.

One way that we could handle this error is to simply drop `NaN` values before we apply the function with `.dropna()`.

In [None]:
seattle_df['Big 5'] = seattle_df['Publisher'].dropna().apply(big_5_checker)

In [None]:
seattle_df[['Title', 'MaterialType', 'Checkouts', 'Publisher','Big 5']].sample(15)

In [None]:
seattle_df['Big 5'].value_counts()

<div class="admonition note" name="html-admonition" style="background: lightyellow; padding: 10px">
<p class="Question"><b/>❓ Question</b></p>

What are the most common "Subjects" in the Seattle Public Library circulation data between 2015-2021?
</div>

Another lingering issue with this data is that the "Subjects" column contains a list of different subjects in each row, making it difficult to understand overall subject borrowing patterns.

This is a tricky probelm and a case where we might want to call a Python function on the entire column, not just apply a function to each row of the column.

In [None]:
seattle_df['Subjects'].value_counts()[:15]

Let's make a function that will accept an entire Pandas Series as an argument ("Subjects"), consider each row in the Series/column, split the row into a list based on comma separation, add each item to a list, and then make that list into its own Pandas Series object.

In [None]:
def count_upper_subjects(subjects_column):
    
    # Empty list
    list_of_subjects = []
    
    # For each item's subjects in the entire column
    for item_subjects in subjects_column:
        
        # Split on comma
        item_subjects = item_subjects.split(',')
        
        for item_subject in item_subjects:

            item_subject = item_subject.replace(',', '')
            item_subject = item_subject.strip(',')
            item_subject = item_subject.strip()

            # Add to big list
            list_of_subjects.append(item_subject)
    
    return pd.Series(list_of_subjects)

If we call the function and give the "Subjects" column as an argument, we get a Series object of individual titles.

In [None]:
count_upper_subjects(seattle_df['Subjects'].dropna())

And because it is a Series, we can use `.value_counts()` to count them all up. It seems to look ok, but...

In [None]:
count_upper_subjects(seattle_df['Subjects'].dropna()).value_counts()[:40]

...if we examine the last items in the Series, we can see that this method certainly isn't perfect. 

In [None]:
count_upper_subjects(seattle_df['Subjects'].dropna()).value_counts()[-40:]

It appears that some subjects in this dataset, like "Women immigrants Australia Drama", are NOT separated by commas. If we were really interested in this information, we would need to spend more time creating a function that could capture all the contingencies.

## Converting Data Types

| **Pandas Data Convert Method** | **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| `df['column_name'].astype(str/int/float)`         | Convert column to a different data type                                                                                |
| `pd.to_numeric(df['column_name'], errors='coerce')`         | Convert column to numerical data, with option to convert errors to `NaN`                                              |
| `pd.to_datetime(df['column_name'], format='&Y-%M')`         | Convert column to datetime, specify date input format                                               |

In the last section, we cleaned up a lot of our "PublicationYear" column. So now we should be able to filter the data and only look at works that were published in the 1970s...right?

In [None]:
seventies_filter = (seattle_df['PublicationYear'] >= 1970) & (seattle_df['PublicationYear'] < 1980)

seattle_df[seventies_filter]

Drat! We're getting an error message that says we can't use `>=` on a string object. If we remember, we do have string values in this column: "Unknown."

In [None]:
seattle_df['PublicationYear']

One way to convert between Pandas data types is to use `.astype()`. 

For example, the column "CheckoutYear" is an integer.

In [None]:
seattle_df['CheckoutYear']

But if we wanted to make it an object instead, we could use `.astype()`.

In [None]:
seattle_df['CheckoutYear'].astype(str)

So let's try it out on "PublicationYear"...

In [None]:
seattle_df['PublicationYear'].astype(int)

Double drat! We can't convert a `NaN` into an integer because it's a `float`.

So instead we'll need to use a different way of converting to numeric data — specficially, `pd.to_numeric()`, which includes an option for converting any problematic data into a `NaN`.

In [None]:
pd.to_numeric(seattle_df['PublicationYear'], errors='coerce')

In [None]:
seattle_df['PublicationYear'] = pd.to_numeric(seattle_df['PublicationYear'], errors='coerce')

In [None]:
seventies_filter = (seattle_df['PublicationYear'] >= 1970) & (seattle_df['PublicationYear'] < 1980)

seattle_df[seventies_filter].sample(10)

In [None]:
null_filter = seattle_df['PublicationYear'].isna()

seattle_df[null_filter].sample(3)

## Converting to Datetime

| **Pandas Data Convert Method** | **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| `pd.to_datetime(df['column_name'], format='&Y-%M')`         | Convert column to datetime, specify date input format                                               |

<div class="admonition note" name="html-admonition" style="background: lightyellow; padding: 10px">
<p class="Question"><b/>❓ Question</b></p>

How do Seattle Public Library checkouts between 2015-2021 fluctuate *month by month*?
</div>

In the last notebook, we were able to create a pretty basic plot of checkouts between 2015-2021 by year.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

materialtype_checkouts_byyear = seattle_df.groupby(['MaterialType', 'CheckoutYear'])\
                                [['Checkouts']].sum().reset_index()

# Use Seaborn to make a line plot
sns.lineplot(data=materialtype_checkouts_byyear,
             x='CheckoutYear', y='Checkouts', hue='MaterialType')

# Put legend to the right
plt.legend(bbox_to_anchor=(1.05, 1))

But we couldn't get more granular than that, because we don't currently have a column with more granular date information.

However, we do have a column with the check out year and the check out month, so let's try to combine that information together!

In [None]:
seattle_df[['Title', 'Checkouts', 'CheckoutYear', 'CheckoutMonth']]

One option for combining this data together is to simply concatenate the columns together.

In [None]:
seattle_df['CheckoutYear'] + seattle_df['CheckoutMonth']

Oops! Since these are integers, when we try to concatenate them together, we're just adding them together. To concatenate them, we need to convert them to string objects.

In [None]:
seattle_df['CheckoutYear'].astype(str) + '-' + seattle_df['CheckoutMonth'].astype(str)

That's better! Let's make a new column with the check out year and month combined.

In [None]:
seattle_df['CheckoutYearMonth'] = seattle_df['CheckoutYear'].astype(str) + '-' + seattle_df['CheckoutMonth'].astype(str)

To explicitly make this column into datetime data, we need to use `pd.to_datetime(format=%Y-%m)` and specify the date format of our inputs (here are the [codes for datetime formatting](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes)).

In [None]:
pd.to_datetime(seattle_df['CheckoutYearMonth'], format='%Y-%m')

If we specificed that this information was year and day `format=%Y-%d`, rather than month, then it would interpret the second numerical value as day information.

In [None]:
pd.to_datetime(seattle_df['CheckoutYearMonth'], format='%Y-%d')

In [None]:
seattle_df['Date'] = pd.to_datetime(seattle_df['CheckoutYearMonth'], format='%Y-%m')

In [None]:
seattle_df['Date']

<div class="admonition warning" name="html-admonition" style="background: pink; padding: 10px">
<p class="title"><b/>Note</b></p>


Note that if you have three columns in your DataFrame that have the titles "year," "month," and "day," you can also make them a single datetime object in one fell swoop.
</div>

In [None]:
seattle_df['CheckoutDay'] = 1
seattle_df[['Year', 'Month', 'Day']] = seattle_df[['CheckoutYear','CheckoutMonth', 'CheckoutDay']]

In [None]:
pd.to_datetime(seattle_df[['Year', 'Month', 'Day']], format='%Y-%M-%D', errors='coerce')

In [None]:
seattle_df = seattle_df.drop(['Year', 'Month', 'Day'], axis='columns')

## Time Series Data

Now that we have actual datetime data, we can do more sophisticated time series analyses. For example, we can group by "Date" and calculate the total checkouts for each month.

In [None]:
seattle_df.groupby('Date')['Checkouts'].sum()

In [None]:
seattle_df.groupby('Date')['Checkouts'].sum().plot()

This is a lot more informative than year alone!

In [None]:
seattle_df.groupby('CheckoutYear')['Checkouts'].sum().plot()

We can do the same thing with our previous plot of material type checkouts over time.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

materialtype_checkouts_byyear = seattle_df.groupby(['MaterialType', 'Date'])\
                                            [['Checkouts']].sum().reset_index()

top_material_filter = materialtype_checkouts_byyear['MaterialType']\
                        .isin(['BOOK', 'EBOOK', 'AUDIOBOOK', 'VIDEODISC'])

# Use Seaborn to make a line plot
sns.lineplot(data=materialtype_checkouts_byyear[top_material_filter],
             x='Date', y='Checkouts', hue='MaterialType')

# Put legend to the right
plt.legend(bbox_to_anchor=(1.05, 1))

## Datetime Index

| **Pandas Datetime Index Methods** | **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| `df.resample('M')`         | Resample, or essentially group by, different spans of time, e.g., `Y`, `M`, `D`, `17min`                                                |
| `df.loc['2018':'2019']`         | Index by label and slice DataFrame between the years 2018 and 2019                             |

Another common approach to working with time series infromation is to make the datetime column into our Pandas index. There are some special things we can do with a datetime index, such as slice the data by dates more efficiently.

Let's make the "Date" column our index.

In [None]:
seattle_df = seattle_df.set_index('Date')

In [None]:
seattle_df.index

In [None]:
type(seattle_df.index)

Another thing we can do with a Datetime Index is to `.resample()`, or essentailly group by, different time period spans.

In [None]:
seattle_df.resample('M')['Checkouts'].sum().plot()

In [None]:
seattle_df.resample('Y')['Checkouts'].sum().plot()

In [None]:
seattle_df.resample('Q')['Checkouts'].sum().plot()

Additionally, because we can use `.loc` to index by date label, we can easily slice the DataFrame between 2019 and 2020.

In [None]:
seattle_df.loc['2019':'2020']

In [None]:
seattle_df.loc['2019':'2020'].resample('M')['Checkouts'].sum().plot()

Or we could slice based on an even more granular dates.

In [None]:
seattle_df.loc['2020-04':'2020-10'].resample('M')['Checkouts'].sum().plot()

## Putting It All Together:
## Plot Checkouts of Specific Titles and Creators Over Time

In [None]:
title_filter = seattle_df['Title'].str.contains('So You Want To Talk About Race', na=False, case=False)

seattle_df[title_filter].resample('M')['Checkouts'].sum().plot()

In [None]:
seattle_df[title_filter].loc['2019-10':'2021-01'].resample('M')['Checkouts'].sum().plot()

In [None]:
checkouts_byyear = seattle_df.groupby(['Title','Creator', 'Date', 'MaterialType'])[['Checkouts']].sum().reset_index()
checkouts_byyear

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15,7))

creator_filter = checkouts_byyear['Creator'].str.contains('Ferrante')

# Use Seaborn to make a line plot
sns.lineplot(data= checkouts_byyear[creator_filter],
             x='Date', y='Checkouts', hue='Title', lw=3, ci=None)

# Put legend to the right
plt.legend(bbox_to_anchor=(.3, .8),  loc='center')

## Your Turn!

Fill in the code with a Title or Creator of your choice.

In [None]:
title_filter = seattle_df['Title'].str.contains('Title Of Your Choice', na=False, case=False)

seattle_df[title_filter].resample('M')['Checkouts'].sum().plot()

In [None]:
plt.figure(figsize=(15,7))

creator_filter = checkouts_byyear['Creator'].str.contains('Creator of Your Choice',
                                                          na=False, case=False)

# Use Seaborn to make a line plot
sns.lineplot(data= checkouts_byyear[creator_filter],
             x='Date', y='Checkouts', hue='Title', lw=3, ci=None)

# Put legend to the right
plt.legend(bbox_to_anchor=(.3, .8),  loc='center')

If you're getting a lot of variations of titles, make sure you've applied the `clean_title()` function...

In [None]:
import re

def clean_title(title):
    
    # Replace some words
    title = title.lower().replace('(unabridged)', '')
    title = title.lower().replace(': a novel', '')
    
    # Use regex expression to remove everything after / or :
    # Test to see if there is a / character
    if re.search('.+?(?=/)/', title):
        # If so, pull out the text before the / character
        title = re.search('.+?(?=/)/', title).group(0)
    
    # Test to see if there is a : character
    if re.search('.+?(?=:)', title):
        # If so, pull out the text before the / character
        title = re.search('.+?(?=:)', title).group(0)
    
    # Strip character and whitespace
    title = title.strip('/')
    title = title.strip()
    
    title = title.title()
    
    return title

In [None]:
seattle_df['Title'] = seattle_df['Title'].apply(clean_title)