# Louisville Free Public Library

Analysis of Young Adult (YA) genre in the Louisville Free Public Library collection.

## Questions

In this analysis we will look at the following questions:

- How much was spent on the collection for YA? 
- How many books are in the collection for YA?
- How does YA spending compare to other collections?
- Did the spending on YA change over time?
- Is YA more or less popular at any of the locations?

### Load the clean library collection data and show a preview of the data

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

# load the clean books data into a dataframe and show the first few rows
books_data_path = Path('results/books_clean.csv.gz')
books_df = pd.read_csv(books_data_path)
books_df.head()

Unnamed: 0,BibNum,Title,Author,PublicationYear,ItemType,ItemCollection,ItemLocation,ItemPrice,Genre,Audience
0,707409,"Jeff Immelt and the new GE way : innovation, t...","Magee, David, 1965-",2009,Book,Adult Non-Fiction,Main,25.95,Non-Fiction,Adult
1,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",2009,Book,Adult Non-Fiction,Southwest,19.99,Non-Fiction,Adult
2,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",2009,Book,Adult Non-Fiction,Southwest,19.99,Non-Fiction,Adult
3,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",2009,Book,Adult Non-Fiction,Remote Shelving - Main,19.99,Non-Fiction,Adult
4,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",2009,Book,Adult Non-Fiction,Remote Shelving - Main,19.99,Non-Fiction,Adult


### How much was spent on the collection for YA?

In [2]:
# TODO: First figure out which records in the dataframe are YA using a mask
# YA = Genre: Fiction, Audience = Teen. Then slice the DataFrame using the mask 
# and sum the ItemPrice column and format the result.
ya_mask = (books_df['Audience'] == 'Teen') & (books_df['Genre'] == 'Fiction')
ya_mask


0          False
1          False
2          False
3          False
4          False
           ...  
1187198    False
1187199    False
1187200    False
1187201    False
1187202    False
Length: 1187203, dtype: bool

In [5]:
"${:,}".format(books_df[ya_mask]['ItemPrice'].sum())

'$555,691.26'

The YA collection has a total cost of $555,691.26.

### How many books in  the collection are YA?

In [6]:
# TODO: Create a new column in the dataframe called YA_Category, use 
# value_counts() to get the count and percent of YA books, and use the concat()
# function to combibooks_df['YA_Category'] = np.where(ya_mask, 'YA', 'Other')
books_df['YA_Category'] = np.where(ya_mask, 'YA', 'Other')

ya_counts = books_df['YA_Category'].value_counts().apply(lambda x: "{:,}".format(x))
ya_percents = books_df['YA_Category'].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'

pd.concat([ya_counts, ya_percents], axis=1, keys=['books','percentage'])

Unnamed: 0,books,percentage
Other,1145946,96.5%
YA,41257,3.5%


YA accounts for 3.5% percent of the total number of books in the collection.

### How does YA spending compare to other collections?

In [7]:
# TODO: Group the data by Genre and Audience using groupby() and use sum() to 
# get the total cost. Format the totals as currency.
books_df.groupby(['Genre','Audience'])['ItemPrice'].sum().apply(lambda x: "${:,.2f}".format(x))

Genre        Audience
Fiction      Adult       $3,457,835.27
             Children      $687,553.59
             Teen          $555,691.26
             Unknown     $1,731,767.36
Non-Fiction  Adult       $9,209,529.31
             Children    $1,597,204.37
             Teen          $401,104.39
             Unknown       $875,794.32
Unknown      Adult         $281,617.43
             Children    $2,505,961.73
             Teen              $119.09
             Unknown       $533,619.31
Name: ItemPrice, dtype: object

### Did the spending on YA change over time?

In [26]:
# TODO: Calculate the counts (value_counts()), total cost (sum()), and average 
# cost (mean()) for all YA books by publication year using groupby(). 
# Concatenate the counts and costs using concat() into a single dataframe. 
# Format the counts and costs as numbers and currency.
ya_years_count = books_df[['PublicationYear', 'ItemPrice']][books_df['YA_Category']=='YA']\
.groupby('PublicationYear').count()
ya_years_count.columns = ['BookCount']

ya_years_total = books_df[['PublicationYear', 'ItemPrice']][books_df['YA_Category']=='YA']\
.groupby('PublicationYear').sum()
ya_years_total.columns = ['TotalCost']

ya_years_avg = books_df[['PublicationYear', 'ItemPrice']][books_df['YA_Category']=='YA']\
.groupby('PublicationYear').mean()
ya_years_avg.columns = ['AverageCost']

ya_years_summary = pd.concat([ya_years_count, ya_years_total, ya_years_avg], axis=1)

ya_years_summary['BookCount'] = ya_years_summary['BookCount'].apply(lambda x: "{:,}".format(x))
ya_years_summary['TotalCost'] = ya_years_summary['TotalCost'].apply(lambda x: "${:,}".format(x))
ya_years_summary['AverageCost'] = ya_years_summary['AverageCost'].apply(lambda x: "${:,}".format(x))

ya_years_summary

Unnamed: 0_level_0,BookCount,TotalCost,AverageCost
PublicationYear,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1919,1,$20.0,$20.0
1938,3,$42.53,$14.176666666666668
1939,5,$124.94999999999999,$24.99
1966,1,$2.99,$2.99
1967,1,$6.99,$6.99
1968,2,$22.939999999999998,$11.469999999999999
1970,3,$30.89,$10.296666666666667
1971,8,$126.89999999999999,$15.862499999999999
1972,2,$27.95,$13.975
1973,15,$101.13,$6.742


Spending on YA books peaked in 2014 and 2015.

Is there a correlation between ItemPrice and PublicationYear?

In [29]:
# TODO: Use the corr() function to determine if there is a correlation between
# the ItemPrice and PublicationYear.

books_df[['ItemPrice','PublicationYear']].corr()

Unnamed: 0,ItemPrice,PublicationYear
ItemPrice,1.0,-0.256709
PublicationYear,-0.256709,1.0


Price goes down over time, but the correlation is weak.

### Is YA more or less popular at any of the locations?

In [35]:
# TODO: Calculate the total number of books by location and the number of YA
# books by location and concatenate them into a single DataFrame. Add a new column
# to show the % of books by location that are YA. Format the values appropriately.

location_ya = books_df['ItemLocation'][books_df['YA_Category'] == 'YA'].value_counts()
location_ya.rename("YABookCount", inplace=True)

location_all = books_df['ItemLocation'].value_counts()
location_all.rename("TotalBookCount", inplace=True)

location_summary = pd.concat([location_all, location_ya], axis=1)

location_summary['PercentYA'] = (location_summary['YABookCount'] / location_summary['TotalBookCount'])

location_summary['TotalBookCount'] = location_summary['TotalBookCount']\
                                   .apply(lambda x: "{:,}".format(x, axis=1))
location_summary['YABookCount'] = location_summary['YABookCount']\
                                   .apply(lambda x: "{:,.0f}".format(x, axis=1))
location_summary['PercentYA'] = location_summary['PercentYA'].mul(100).round(1)
location_summary.sort_values(by=['PercentYA'], ascending=False)
location_summary

Unnamed: 0,TotalBookCount,YABookCount,PercentYA
Remote Shelving - Main,139969,26.0,0.0
Northeast,124339,4928.0,4.0
Southwest,121914,5413.0,4.4
Main,120742,12.0,0.0
South Central,115614,5238.0,4.5
Bon Air,74551,2072.0,2.8
St Matthews,69417,2636.0,3.8
Jeffersontown,56620,1425.0,2.5
Iroquois,52190,1910.0,3.7
Highlands - Shelby Park,45352,1457.0,3.2


Southwest is the location with the most YA books.