# Classifying Genre with Book Descriptions

By: Johnathon Smith

Date: Dec 28, 2021
***

### Executive Summary
***

__Project Goal__

The goal of this project was to build a model capable of accurately classifying a book's genre as Horror, Romance, Mystery and Crime, or Sci-Fi and Fantasy based on its cover blurb, or description. Does the description actually describe the book or is it just designed to sell it?

__Overall Findings__

* There are consistent patterns between each genre that allow them to be accurately classified.
* My best model had a Mean Cross-Validated Accuracy of about 88%.

***

### My Process

* Write a README.md file that details my process, my findings, and instructions on how to recreate my project.
* Acquire the data by web scraping the book descriptions and genres from the Barnes & Noble website.
* Clean and prepare the data:
    * Change all characters to lowercase
    * Normalize and encode the characters
    * Replace anything that is not a letter, number, whitespace, or a single quote
    * Tokenize the strings
    * Remove stop words
    * Create Stemmed and Lemmatized versions of each string
    * Create engineered features
* Explore the train data set and look for identifying features for each genre.
* Set baseline using the Dummy Classifier.
* Create and evaluate models on train data set using GridSearchCV.
* Choose best model and evaluate it on test data set.
* Document conclusions, takeaways, and next steps in the Final Report Notebook.

***

### Necessary Imports

In [1]:
import numpy as np
import pandas as pd

import scipy.stats as stats

#Custom modules
import acquire
import prepare
import explore
import model

### Wrangle

***

| Target | Datatype | Definition |
|:-------|:---------|:-----------|
| genre | String | The overall genre of the book according to the Barnes & Noble website. |


| Feature | Datatype | Definition |
|:--------|:---------|:----------- |
| sub-genre | String | The subject of the book according to the Barnes & Noble website. |
| original | String | The original book description as found on the Barnes & Noble website. |
| clean | String | The cleaned version of the book description. |
| stemmed | String | The cleaned, stemmed version of the book description. |
| lemmatized | String | The cleaned, lemmatized version of the book description. |
| lem_char_count | int | The character count of the lemmatized book description. |
| lem_word_count | int | The word count of the lemmatized book description. |
| lem_unique_word_count | int | The unique word count of the lemmatized book description. |
| sentence_count | int | The sentence count of the original book description. |
| avg_words_per_sentence | int | The average number of words per sentence. |
| sentiment | float | The compound sentiment analysis score. Ranges from -1 to 1. |
| stopword_count | int | The number of stopwords found in the original description. |
| word_stopword_ratio | float | The ratio of stopwords to all other words found in the description. |

__Acquire the Data__

The following line of code will begin webscraping the Barnes & Noble website for all book descriptions across the Horror, Romance, Mystery and Crime, and Sci-Fi and Fantasy genres. Since this takes hours to complete, I suggest loading the pre-made dataset. If you choose to run the function, please understand that it will take time to complete and may not run at all if the website's structure has changed since I wrote it. It originally took about 8 hours to gather all of the data.

In [2]:
#Acquire the original data
#book_blurbs = acquire.acquire_data()

__Prepare the Data__

The following line of code will prepare the original webscraped data. Only run it if you chose to run the previous line of code.

In [3]:
#Prepare the data
#book_blurbs = prepare.prepare_articles(book_blurbs, 'blurb')

__Load the Prepared Dataset__

This is the suggested and default action when running this notebook. It will load the prepared data from a saved .csv file.

In [4]:
#Load the prepared data
book_blurbs = pd.read_csv('cleaned_book_blurbs.csv')

__Brief Overview__

Take a quick look at the data before moving on to the explore section.

In [5]:
book_blurbs.head()

Unnamed: 0,genre,sub-genre,original,clean,stemmed,lemmatized
0,Horror,ghost-stories,"Designed to appeal to the book lover, the Macm...",designed appeal book lover macmillan collector...,design appeal book lover macmillan collector '...,designed appeal book lover macmillan collector...
1,Horror,ghost-stories,"Part of the Penguin Orange Collection, a limit...",part penguin orange collection limitedrun seri...,part penguin orang collect limitedrun seri twe...,part penguin orange collection limitedrun seri...
2,Horror,ghost-stories,Part of a new six-volume series of the best in...,part new sixvolume series best classic horror ...,part new sixvolum seri best classic horror sel...,part new sixvolume series best classic horror ...
3,Horror,ghost-stories,A USA TODAY BESTSELLER!An Indie Next Pick!An O...,usa today bestselleran indie next pickan octob...,usa today bestselleran indi next pickan octob ...,usa today bestselleran indie next pickan octob...
4,Horror,ghost-stories,From the New York Times best-selling author of...,new york times bestselling author southern boo...,new york time bestsel author southern book clu...,new york time bestselling author southern book...


In [6]:
book_blurbs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21414 entries, 0 to 21413
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   genre       21414 non-null  object
 1   sub-genre   21414 non-null  object
 2   original    21414 non-null  object
 3   clean       21414 non-null  object
 4   stemmed     21414 non-null  object
 5   lemmatized  21414 non-null  object
dtypes: object(6)
memory usage: 1003.9+ KB


__Target Distribution__

What is the distribution of the target variable?

In [7]:
book_blurbs.genre.value_counts(normalize = True)

Sci-Fi and Fantasy    0.317596
Horror                0.300551
Mystery and Crime     0.206734
Romance               0.175119
Name: genre, dtype: float64

Although the Mystery and Romance genres have lower counts than the others, I don't think these numbers warrant resampling.

__Key Takeaways__

* There are 4 different overall genres.
* There are 21,414 total entries.
* The target variable is not perfectly balanced, but I chose not to resample.
* Created cleaned, stemmed, and lemmatized versions of the original book descriptions.

***
### Explore

Create new features to explore with. This may take some time.

In [8]:
#The following function call adds engineered features to the data set.
#book_blurbs = prepare.prep_for_exploration(book_blurbs)

book_blurbs = pd.read_csv('blurbs_for_exploration.csv')

__Splitting the Data__

I will split the data into train and test sets. A validate set will not be necessary because I will be utilizing cross-validation in my modeling section.

In [10]:
train, test = prepare.split(book_blurbs)
train.shape, test.shape

((16060, 14), (5354, 14))

In [11]:
#Take a look at the engineered features 
train.head()

Unnamed: 0,genre,sub-genre,original,clean,stemmed,lemmatized,lem_char_count,lem_word_count,lem_unique_word_count,sentence_count,avg_words_per_sentence,sentiment,stopword_count,word_stopword_ratio
17405,Sci-Fi and Fantasy,other-fantasy-fiction-categories,A charmingly witty fantasy adventure starring ...,charmingly witty fantasy adventure starring gr...,charmingli witti fantasi adventur star greta h...,charmingly witty fantasy adventure starring gr...,1140,152,128,8,19,0.9734,78,0.51
11323,Mystery and Crime,crime-fiction,A #1 New York Times BestsellerVirgil Flowers i...,1 new york times bestsellervirgil flowers inve...,1 new york time bestsellervirgil flower invest...,1 new york time bestsellervirgil flower invest...,671,89,82,10,9,0.7574,48,0.54
18975,Sci-Fi and Fantasy,science-fiction-fantasy-media-tie-in-fiction,An original novel set in the Halo Universe and...,original novel set halo universe based new yor...,origin novel set halo univers base new york ti...,original novel set halo universe based new yor...,965,126,119,6,21,-0.9531,76,0.6
8934,Romance,other-romance-categories,New York Times bestselling author Michelle Sag...,new york times bestselling author michelle sag...,new york time bestsel author michel sagara swe...,new york time bestselling author michelle saga...,904,130,110,7,19,-0.9698,85,0.65
594,Horror,ghost-stories-other,When a late night storm drives a young couple ...,late night storm drives young couple take refu...,late night storm drive young coupl take refug ...,late night storm drive young couple take refug...,975,140,125,8,18,0.2003,114,0.81
