# Trend Analysis of the New York Times Best Sellers

## Table of content:  

0. [Introduction](#intro)
1. [Data collection from the NYT](#nyt_data)
2. [Data collection from Goodreads](#goodreads_data)
3. [Data Cleaning and preparation](#data_preparation)
4. [Data analysis](#data_analysis)
5. [Conclusion](#conclusion)

<a class="anchor" id="intro"></a>
## Introduction

This project's goal is to determine if there are any identifiable trends and patterns in the books that appear in The New York Times Best Sellers list. Mainly, I'll be focusing on genre, trying to determine what are the popular genres and determine if there are any trends over time that can help us predict what will be popular next year.

In [2]:
# Libraries needed for the project
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import time

<a class="anchor" id="nyt_data"></a>
## 1. Data collection from the NYT

"The New York Times Best Seller list is widely considered the preeminent list of best-selling books in the United States. It has been published weekly in The New York Times Book Review since October 12, 1931. In the 21st century, it has evolved into multiple lists, grouped by genre and format, including fiction and non-fiction, hardcover, paperback and electronic." (Wikipedia)

In this project, we will be interested in the following list, which started in 2011 and cover the last 10 years of best sellers:
- Combined Print & E-Book Fiction
- Combined Print & E-Book Nonfiction

### The API

In order to obtain the data from the NYT, we need to extract it from their website. Luckily, The New York Times provides an easy to use [API](https://developer.nytimes.com/) that can be used to pull data from their website including the best sellers list. The only limitation of their API is a limit of 4,000 requests per day and 10 requests per minute. This means that we have to sleep 6 seconds between calls to avoid hitting the per minute rate limit. Considering that we want to pull 10 years of weekly data from two lists, this equates to 1040 calls taking 6 seconds each. The total time to collect the data will therefore be roughly 2 hours.

### The data

The data that we will be collecting was published from 2011-02-13 to 2021-05-23. For the first 6 years, until 2017-01-29, the list consisted of 20 books per week, but was reduced to 15 books in the following editions. The list also accounts for data collected 15 days prior to the list being published. 

The data itself comes with the following information:
- **rank** - The current ranking of the book
- **rank last week** - What rank it was the previous week
- **weeks on list** - How many weeks it appeared on the list
- **primary_isbn10** - The main ISBN10
- **primary_isbn10** - The main ISBN13
- **publisher** - The name of the publisher
- **description** - Brief description of the book
- **title** - The title
- **author** - The auhtor
- **bestsellers date** - When the data was collected
- **published date** - When the list was published

A few more information appear on the data obtained from the NYT, such as a "dagger" columns that marks if an entry has made the list in suspicious ways such as bulk purchases. However, for this project we will not worry about this and consider everything on the list.

[<img src="images/poweredby_nytimes_150c.PNG">](https://developer.nytimes.com)

### Collecting the data

The first step is to find the encoded name of the lists that we are interested in so we can requests them with the API. 

In [None]:
my_key = "*****************"

In [None]:
url = "https://api.nytimes.com/svc/books/v3/lists/names.json?api-key={}".format(my_key)
r = requests.get(url).json()
nyt_lists = pd.DataFrame.from_dict(r['results'])
nyt_lists.head()

Unnamed: 0,list_name,display_name,list_name_encoded,oldest_published_date,newest_published_date,updated
0,Combined Print and E-Book Fiction,Combined Print & E-Book Fiction,combined-print-and-e-book-fiction,2011-02-13,2021-05-23,WEEKLY
1,Combined Print and E-Book Nonfiction,Combined Print & E-Book Nonfiction,combined-print-and-e-book-nonfiction,2011-02-13,2021-05-23,WEEKLY
2,Hardcover Fiction,Hardcover Fiction,hardcover-fiction,2008-06-08,2021-05-23,WEEKLY
3,Hardcover Nonfiction,Hardcover Nonfiction,hardcover-nonfiction,2008-06-08,2021-05-23,WEEKLY
4,Trade Fiction Paperback,Paperback Trade Fiction,trade-fiction-paperback,2008-06-08,2021-05-23,WEEKLY


We see that the encoded names of the list that we are interested in are:
- combined-print-and-e-book-fiction
- combined-print-and-e-book-nonfiction

In [None]:
#list_name = "combined-print-and-e-book-fiction"
list_name = "combined-print-and-e-book-nonfiction"

We can also see that they span from 2011-02-12 to 2021-05-23. Considering that we are interested in the weekly lists, published every 7 days, we need to create a list with all the dates of when the lists were published.

In [73]:
start_date = datetime.date(2011, 2, 13)
end_date = datetime.date(2021,5,23)

dates = []
dates.append(start_date)

while dates[-1] < end_date:
    dates.append(dates[-1] + datetime.timedelta(days=7))

With the following code we can then requests the whole history of the lists. To duplicate the code with both fiction and nonfiction, we can simply change the list name above and rerun the code below.

In [139]:
# Obtaining the lists of best sellers
url = "https://api.nytimes.com/svc/books/v3/lists/{date}/{name}.json?api-key={key}"
weekly_bestsellers = pd.DataFrame()

for i in range(len(dates)):
    r = requests.get(url.format(date=dates[i], name=list_name, key=my_key))
    if r.status_code != 200:
        print("ERROR {} at {}".format(r.status_code, dates[i]))
        continue
    
    # Getting the list for that week
    results = r.json()['results']
    df = pd.DataFrame.from_dict(results['books'])

    # Appending the weekly bestsellers and published date to the list, then combining the lists together
    df['bestsellers_date'], df['published_date'] = [results['bestsellers_date'], dates[i]]
    weekly_bestsellers = pd.concat([weekly_bestsellers, df])
    
    # suspend execution for 6 secs so we don't hit the limit per minute rate
    time.sleep(6)  

In [120]:
weekly_bestsellers.columns

In [140]:
# Droping unecessary columns
weekly_bestsellers.drop(['asterisk', 'dagger', 'price','contributor', 'contributor_note', 'book_image',
                         'book_image_width', 'book_image_height', 'amazon_product_url', 'age_group',
                         'book_review_link', 'first_chapter_link', 'sunday_review_link',
                         'article_chapter_link', 'isbns', 'buy_links', 'book_uri'], axis=1, inplace=True)

In [140]:
# Saving to csv
weekly_bestsellers.to_csv("data/{}.csv".format(list_name), index=False)

<a class="anchor" id="goodreads_data"></a>
## 2. Data collection from Goodreads

What we are intesrested in is the books' genre, but the New York Times doesn't provide us with more details than "Fiction" and "Nonfiction". To obtain the genre, we will use Goodreads, the most popular english platform for books, where users can vote for the genre which corresponds books they've read. Unfortunately, Goodreads doesn't provide new users with an API anymore so to obtain the genre we will have to scrape it from their website.

The quick way to scrape Goodreads is to input the book's ISBN into their search engine, which will then give us the main page for the book. With the main page, we can then scrape the genre. However, the data provided by the NYT is inaccurate and incomplete. There are a few hundred missing ISBNs where for some of them they have used their internal id number, while others are not recognized by Goodreads and some don't even have an ISBN (most often because they were self published on amazon). Therefore, there's a lot of cleaning work to do.

We can deal with faulty isbns by trying to retrieve it from other sources like Google Books, or simply search the name and author into Goodreads instead of the ISBN, or even simply imputing the values manually. All these solutions don't all work perfectly, or are too tedious, and require a lengthly verification process to make sure that we get the right Goodreads page for the book. I used a mix of these methods to fill in the missing or wrong data. I didn't show all of these methods below because they require a lot of manual verification, but I was eventually able to obtain the genres for the full list.

The code below will get the genre from Goodreads by using the ISBN of the book. I used Beautiful Soup, but it is not the best option. It easy to use, but very slow since every requests need to be redirected to the book's page. Using a spider like Scrapy should be significantly faster.

In [None]:
# Creating a function to extract the genre
def get_genre(soup):
    tags = soup.find_all(lambda tag: tag.name == 'div' and 
                                 tag.get('class') == ['left'])
    # Extracting the genre
    genres = []
    for tag in tags:
        genre = tag.find_all("a", {"class": "actionLinkLite bookPageGenreLink"})
        if len(genre) > 1:
            genres.append(genre[-1].getText())
        else:
            genres.append(genre[0].getText())
    
    # Extracting user votes
    votes = []
    lst = soup.find_all("a", {"class": "actionLinkLite greyText bookPageGenreLink"})
    for tags in lst:
        votes.append(tags.getText()[:-6])
    
    return dict(zip(genres, votes))

Note that we preserved the number of votes from the genre list. This helps us filter down valid genres for the book. For instance, the main genre might have over a 100 votes, while the one that the bottom might only have 2. The genre with only 2 votes might not be as relevant as the ones with more votes.

In [None]:
#ISBN_list = fiction['primary_isbn13'].unique()
ISBN_list = nonfiction['primary_isbn13'].unique()

URL = "https://www.goodreads.com/search?q={}"

for ISBN in ISBN_list:
    page = requests.get(link)
    if page.status_code != 200:
        print("ERROR {} at {}".format(page.status_code, link))
        continue
    soup = BeautifulSoup(page.content, 'html.parser')
    genres.update({link:get_genre(soup)})

To get both the list for fiction and nonfiction, we can change between the commented lines above and below, then rerun the code.

In [None]:
genres = pd.DataFrame(list(genres.items()))
genres.columns = ['primary_isbn13','genres_dict']

#fiction = pd.merge(fiction,genres,on='primary_isbn13', how='left')
nonfiction = pd.merge(nonfiction,genres,on='primary_isbn13', how='left')

In [None]:
#Saving the work
fiction.to_csv('data/fiction_1.csv')
nonfiction.to_csv('data/nonfiction_1.csv')

<a class="anchor" id="data_preparation"></a>
## 3. Data Cleaning and Preparation

In [259]:
fiction = pd.read_csv("data/fiction_1.csv",index_col=0)
nonfiction = pd.read_csv("data/nonfiction_1.csv",index_col=0)
fiction.drop(['rank_last_week','primary_isbn10','primary_isbn13','description'], axis=1, inplace=True)
nonfiction.drop(['rank_last_week','primary_isbn10','primary_isbn13','description'], axis=1, inplace=True)

Giving each book a unique ID so we can join each tables that we will create more easily

In [260]:
fiction['id'] = fiction.groupby(['title','author']).grouper.group_info[0]
nonfiction['id'] = nonfiction.groupby(['title','author']).grouper.group_info[0]

### Publisher Cleaning  
Some of the publishers' name are written differently, some with 'Publishing' at the end and some without, some are written under both the sub-publisher and parent publisher. We want to fix these issues to get a better sense of popular publishers.

In [262]:
fiction['publisher'] = fiction['publisher'].str.replace(r' Press| Publishers| Publishing','')
nonfiction['publisher'] = nonfiction['publisher'].str.replace(r' Press| Publishers| Publishing','')

In [263]:
fiction['publisher'] = fiction['publisher'].apply(lambda x: x.split('/')[0])
nonfiction['publisher'] = nonfiction['publisher'].apply(lambda x: x.split('/')[0])

In [264]:
pmap = {'Little, Brown':'Little, Brown & Company',
        'Little Brown':'Little, Brown & Company',
        'Little, Brown and Knopf':'Little, Brown & Company',
        'Little ,Brown':'Little, Brown & Company',
        'Knopf':'Knopf Doubleday',
        'HQN':'Harlequin',
        'Harlequin Mira':'Harlequin',
        'Harlequin HQN': 'Harlequin',
        'Doubleday':'Knopf Doubleday',
        'Penguin':'Penguin Group',
        'Harper':'HarperCollins'}

fiction['publisher'].replace(pmap, inplace=True)
nonfiction['publisher'].replace(pmap, inplace=True)

In [265]:
fiction.to_csv('data/fiction.csv',index=False)
nonfiction.to_csv('data/nonfiction.csv',index=False)

### Genre
Since we extracted the genre from Goodreads in a dictonary where keys were genre and votes wer the value, we need to extract the genre and votes from the dictionary. This involves "exploding" the dictionary into columns, then "melting" it back down into rows.

In [266]:
#fiction
fiction_genres = pd.concat([fiction, fiction['genres_dict'].map(eval).apply(pd.Series)], axis=1)
fiction_genres = fiction_genres.melt(id_vars=fiction.columns,
                                     var_name='genre', 
                                     value_name='votes')
fiction_genres = fiction_genres.dropna(subset=['votes']).reset_index(drop=True)
fiction_genres['votes'] = fiction_genres['votes'].astype(str).str.replace(',','').astype(float).astype(int)

#nonfiction
nonfiction_genres = pd.concat([nonfiction, nonfiction['genres_dict'].map(eval).apply(pd.Series)], axis=1)
nonfiction_genres = nonfiction_genres.melt(id_vars=nonfiction.columns,
                                           var_name='genre', 
                                           value_name='votes')
nonfiction_genres = nonfiction_genres.dropna(subset=['votes']).reset_index(drop=True)
nonfiction_genres['votes'] = nonfiction_genres['votes'].astype(str).str.replace(',','').astype(float).astype(int)

In [267]:
#Removing the dictionary column
fiction_genres.drop('genres_dict', axis=1, inplace=True)
nonfiction_genres.drop('genres_dict', axis=1, inplace=True)

We also remove a few unecessary or repetitive genre that won't help us in our analysis.

In [268]:
remove_list = ['Fiction','Nonfiction','Audiobook','Ebook','Amazon','Novels','Unfinished','Roman','Book Club',
               'Adult','Adult Fiction','Anthologies','Collections','Mystery Thriller','Biography Memoir','Autobiography','Historical']
fiction_genres = fiction_genres[~fiction_genres['genre'].isin(remove_list)]
nonfiction_genres = nonfiction_genres[~nonfiction_genres['genre'].isin(remove_list)]

We can also extract the most voted genre, which we can safely assume is the primary genre for most books.

In [269]:
fiction_top_genres = fiction_genres.loc[fiction_genres.groupby(['bestsellers_date','rank'])['votes'].idxmax()].reset_index(drop=True)
nonfiction_top_genres = nonfiction_genres.loc[nonfiction_genres.groupby(['bestsellers_date','rank'])['votes'].idxmax()].reset_index(drop=True)

In [270]:
#Saving DataFrame
fiction_genres.to_csv('data/fiction_genres.csv', index=False)
nonfiction_genres.to_csv('data/nonfiction_genres.csv', index=False)

fiction_top_genres.to_csv('data/fiction_top_genres.csv', index=False)
nonfiction_top_genres.to_csv('data/nonfiction_top_genres.csv', index=False)

### Author

Since some authors often co-author more books than they write themselves (e.g. James Patterson), in order to portray accurately the success of an author, we need to split books with multiple author names into individual rows. For instance, for books authored by "James Patterson and Maxine Paetro", we want a row for "James Patterson" and another for "Maxine Paetro". Below, I create a unique dataframe for co-authored books.

In [271]:
from itertools import chain

#fiction
cols = fiction.columns.difference(['author'])
authors = fiction['author'].str.split(' and | with ')

fiction_author = (fiction.loc[fiction.index.repeat(authors.str.len()), cols]
         .assign(author=list(chain.from_iterable(authors.tolist()))))

#nonfiction
nonfiction_author = nonfiction.dropna(subset=['author'])
cols = nonfiction_author.columns.difference(['author'])
authors = nonfiction_author['author'].str.split(' and | with ')

nonfiction_author = (nonfiction_author.loc[nonfiction_author.index.repeat(authors.str.len()), cols]
         .assign(author=list(chain.from_iterable(authors.tolist()))))

In [272]:
fiction_author.to_csv('data/fiction_author.csv', index=False)
nonfiction_author.to_csv('data/nonfiction_author.csv', index=False)

### Title

It might also be insightful to look at the title of the book and popular words that people might latch onto more. To do so, we fist want to remove symbols and words that are of no use to us. We then want to remove stopwords like 'a', 'the', etc., that again are not useful.

In [273]:
regex_remove = "\d(th|st|rd|-)|\d|,|&|/|>|robert b. parker's|tom clancy:|[’']s|:|i've|i'm|#|b\.|_|\(|\)|\.\.\.|\.|we're|\?|%"

#fiction
fiction_title = fiction.copy()
fiction_title['title_word'] = fiction_title['title'].str.lower().str.replace(regex_remove,'').str.split()
fiction_title = fiction_title.explode('title_word')

#nonfiction
nonfiction_title = nonfiction.copy()
nonfiction_title['title_word'] = nonfiction_title['title'].str.lower().str.replace(regex_remove,'').str.split()
nonfiction_title = nonfiction_title.explode('title_word')

In [274]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\micha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [275]:
fiction_title = fiction_title[~fiction_title['title_word'].isin(stop_words)]
nonfiction_title = nonfiction_title[~nonfiction_title['title_word'].isin(stop_words)]

In [276]:
fiction_title.to_csv('data/fiction_title.csv')
nonfiction_title.to_csv('data/nonfiction_title.csv')

<a class="anchor" id="data_analysis"></a>
## 4. Data Analysis

### A Brief Overview

For the analysis we can calculate a few summary statistics for publishers, authors, genres and words, however we are more interested in genre trends. To avoid cluttering the notebook with unecessary statistics, I decided to remove them. But kept a few regarding the genre that will help us understand the trends:

In [277]:
fiction_genres[fiction_genres['weeks_on_list'] == 1]['genre'].value_counts()[:10]

Romance                 1274
Mystery                 1220
Contemporary            1174
Suspense                1042
Thriller                1026
Crime                    794
Contemporary Romance     607
Chick Lit                553
Fantasy                  471
Historical Fiction       388
Name: genre, dtype: int64

In [278]:
nonfiction_genres[nonfiction_genres['weeks_on_list'] == 1]['genre'].value_counts()[:10]

Biography           1042
History              902
Memoir               782
Politics             629
American History     339
Science              219
Humor                211
Business             180
War                  172
Psychology           153
Name: genre, dtype: int64

The lists above show summary statistics for all related genres to each book on the list that appear once. This is a good way to see the general popularity of genres, but doesn't give us accurate information because a book can be categorized into different genres. For instance a book about Benjamin Franklin might be categorized under both biography and history, but the main genre is biography.

Below is a bit more specific to main genres of the book.

In [279]:
fiction_top_genres[fiction_top_genres['weeks_on_list'] == 1]['genre'].value_counts()[:10]

Mystery               691
Romance               684
Thriller              207
Historical Fiction    172
Fantasy               106
Paranormal             98
Chick Lit              60
Contemporary           58
Urban Fantasy          47
Science Fiction        42
Name: genre, dtype: int64

In [280]:
nonfiction_top_genres[nonfiction_top_genres['weeks_on_list'] == 1]['genre'].value_counts()[:10]

History       284
Politics      275
Memoir        268
Biography     177
Science        69
Humor          48
Sports         45
Business       43
True Crime     38
Music          37
Name: genre, dtype: int64

### Trend analysis

Trend analysis can be done in Python, but it isn't the most efficient way to analyse this type of data for the analysis we are doing. An easier and more practical way is to look at the data into Tabeau. Here's the graphs produced: 

*Click the image to redirect to Tableau Public*

[<img src="images/fiction_nonfiction.PNG">](https://public.tableau.com/app/profile/michael2724/viz/NYTBestSellersGenreTrends/NYTBestSellersTrendsofPopularGenresTop5)


**Fiction**  

Looking at the plot for fiction, we can see that both mystery and romance are the most popular genres as we observed previously, however they are slowly declining in popularity in favor of alternative genres. Most significantly though is the sharp decline in romance, which seems to be overtaken by contemporary, historical fiction and thriller. Moreover we can see that historical fiction is the genre most on the rise and seem like it might eventually overtake mystery as the most popular genre

**Nonfiction**

Looking at the plot for fiction, we can see a significant increase in popularity for memoirs and politics, while the others seem to be slowly declining. One important thing to note about politics however is that the rise in populary seems to coincide with the election of Donald Trump, which brought a lot more interest in politics than before. This rise in popularity seems also to be declining in the last two years, which might suggest that it would go back to closer numbers than it was before 2016.

Also we should note that while history seems to be declining, it is actually growing in popularity. Books purely classified as history are not growing, but books with a historical theme are, they are only categorized under a parent category such as biography or sports for instance. Looking at the graph of all the genres classifying each books appearing on the list, we can see that history books are growing in popularity.

<img src="images/Nonfiction Genres.PNG">

<a class="anchor" id="conclusion"></a>
## 5. Conclusion

To keep the conlcusion short, we have observed a significance in the following trends:  
- Romance is declining in popularity
- Memoirs are increasing in popularity
- History is increasing in popularity, for both for fiction and nonfiction
