# Data Analysis with Pandas — Day 2 Workbook
## Word Counts

This is the Day 2 practice workbook for the June 2021 course "Data Analysis with Pandas," part of the [Text Analysis Pedagogy Institute](https://nkelber.github.io/tapi2021/book/intro.html).

## Dataset

### HathiTrust Extracted Features

The [HathiTrust Digital Library](https://www.hathitrust.org/) has released word frequencies per page for all 17 million books in its catalog. These word frequencies — plus part of speech tags and other information — are known as "extracted features." 

The HTRC team has developed a Python package, the [HathiTrust Feature Reader](https://github.com/htrc/htrc-feature-reader), which allows you to access and work with the extracted features of books.

Guess what: the HathiTrust Feature Reader relies heavily on Pandas! So we're going to practice our Pandas knowledge by applying the concepts to a new form of textual data. We're specifically going to examine Sandra Cisneros's coming-of-age novel *The House on Mango Street*.

## Install HathiTrust Feature Reader

To work with HathiTrust's extracted features, we first need to install and import the [HathiTrust Feature Reader](https://github.com/htrc/htrc-feature-reader).

In [None]:
!pip install htrc-feature-reader

## Import Libraries

In [None]:
from htrc_features import Volume
import pandas as pd
pd.options.display.max_rows = 800

## Make DataFrame of Word Frequencies

To get HathiTrust extracted features for a single volume, we can create a [Volume object](https://github.com/htrc/htrc-feature-reader#volume) with `Volume()` and the unique HathiTrust volume ID, then use the `.tokenlist()` method. 

<div class="admonition note" name="html-admonition" style="background: skyblue; padding: 10px">
<p class="title"><b/>How to Find a HathiTrust Volume ID</b></p>

To locate the the HathiTrust volume ID for *The House on Mango Street*, we can search the HathiTrust catalog for this book and then click on "Limited (search only)," which will take us to the following web page: https://babel.hathitrust.org/cgi/pt?id=uc1.32106012740764.

The HathiTrust Volume ID for *The House on Mango Street* is located after `id=` this URL: `uc1.32106012740764`. 
</div>

In [None]:
Volume('uc1.32106012740764').tokenlist(case=False, drop_section=True).reset_index()

This DataFrame displays each page number, the words/tokens that appear on the page, the part-of-speech, and the number of times that the words/tokens appears on the page. 

In [None]:
mango_df = Volume('uc1.32106012740764').tokenlist(case=False, pos=True, drop_section=True).reset_index()

## Examine 10 Random Rows

In [None]:
## Examine 10 Random Rows

## Filter the DataFrame

Filter the DataFrame `mango_df` to only show words that are nouns or `NN`

In [None]:
# Filter the DataFrame to only show words that are nouns or `NN`

## Make a New DataFrame

Make a new DataFrame `pos_df` that only includes words that are nouns or `NN`

In [None]:
# Make a new DataFrame `pos_df` that only includes words that are nouns or `NN`

## Sort Values

Sort the DataFrame `pos_df` by word count from highest to lowest

In [None]:
# Sort the DataFrame `pos_df` by word count from highest to lowest

Sort the DataFrame `pos_df` by word count from highest to lowest, then examine first 30 values

In [None]:
# Sort the DataFrame `pos_df` by word count from highest to lowest, then examine first 30 values

## Groupby Word

<div class="admonition note" name="html-admonition" style="background: lightyellow; padding: 10px">
<p class="Question"><b/>❓ Question</b></p>

What are the most frequent nouns in *The House on Mango Street* overall?

</div>

To find out, group by the word column "lowercase", then sum all the word counts per page. Finally, sort values from highest to lowest.

In [None]:
# Group by the word column "lowercase", then sum all the word counts per page
# Finally sort values from highest to lowest

Now examine just the top 30 nouns overall

In [None]:
# Group by the word column "lowercase", then sum all the word counts per page
# Finally sort values from highest to lowest
# Examine top 30 values

## Make a Plot of Top Words

Save the top 15 nouns as a new variable `top15_df`

In [None]:
# Group by the word column "lowercase", then sum all the word counts per page
# Finally sort values from highest to lowest
# Save top 15 values as top15_df

Then plot a bar chart of this smaller DataFrame `top15_df`

In [None]:
# Make a bar chart of top15_df

## Bonus: Chart Word(s) Frequency Across the Volume

We can plot word frequency across the book by filtering by word. Try it out with other words!

In [None]:
# Boolean vector
word_filter = mango_df['lowercase'] == 'house'
# Filter, then plot
mango_df[word_filter].plot(x='page', y='count')

In [None]:
# Boolean vector
word_filter = mango_df['lowercase'].isin(['mango', 'street'])
# Filter, then plot
mango_df[word_filter].plot(x='page', y='count')