# DSCI 320 Group Project

## Members
* Tien Nguyen
* Edison Le
* Javier De Ezkauriatza
* Gabrielle Wong

The paragraphs below describe the dataset. [Source](https://github.com/shlrley/amazon_bestsellers/blob/main/README.md) from the TA's (Shirley) repo.

### **Merged Datasets**

1. [Amazon bestsellers](https://www.kaggle.com/datasets/sootersaalu/amazon-top-50-bestselling-books-2009-2019) 
   - "Dataset on Amazon's Top 50 bestselling books from 2009 to 2019. Contains 550 books, data has been categorized into fiction and non-fiction using Goodreads."  
2. [Conlit](https://figshare.com/articles/dataset/CONLIT/21166171/1?file=37535605)
   - "This dataset includes derived data on a collection of ca. 2,700 books in English published between 2001-2021 and spanning twelve different genres."
3. [Goodreads best books ever](https://zenodo.org/record/4265096#.ZAgSxOzMKvA)
   - "The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site)."
4. [New York Times (NYT) bestsellers](https://www.kaggle.com/datasets/dhruvildave/new-york-times-best-sellers)
   - "The data contains Best Sellers List published by The New York Times every Sunday. The temporal range is from 03-Jan-2010 to 29-Dec-2019 which makes it a whole decade of data. Each week, 5 books are named as best sellers for each category."

### **Column Descriptions**

| Column      | Description | Dataset Source | Column Name in Original Dataset |
| ----------- | ----------- |  -----------   |  -----------   |
| 1. `title` | Title of the book (str) | Amazon | `Name` | 
| 2. `amazon_author` | Author (first and last name) of the book (str) | Amazon | `Author` | 
| 3. `amazon_rating` | Rating of the book given by Amazon user on a scale of 1 to 5 (float) | Amazon | `User Rating` | 
| 4. `amazon_num_reviews` | Number of written reviews of the book given on Amazon (int) | Amazon | `Reviews` | 
| 5. `amazon_price`  | Price of the book as of October 13, 2020 (int) | Amazon | `Price` | 
| 6. `amazon_year` | Year the book was ranked on the bestsellers list | Amazon | `Year` | 
| 7. `amazon_genre` | Whether the book is fiction or non-fiction (str) | Amazon | `Genre` | 
| 8. `conlit_genre` | Genre of the book, lists 1 genre out of 12 categories (str) | Conlit | `Genre` | 
| 9. `conlit_pubdate` | Original publication date of the book | Conlit | `Pubdate` | 
| 10. `conlit_author_gender` | Gender of the author (M/F/O) (str) | Conlit | `Author_Gender` | 
| 11. `conlit_author_nationality` | Nationality of the author (str) | Conlit | `Author_Nationality` | 
| 12. `conlit_total_ratings` | Total number of ratings of the book on Goodreads as of May 23, 2022 (int) | Conlit | `total_ratings` | 
| 13. `goodreads_rating` | Global Goodreads rating (float) | Goodreads | `rating` | 
| 14. `goodreads_series` | Series name (str) | Goodreads | `series` | 
| 15. `goodreads_genres` | Genre(s) of the book (list[str]) | Goodreads | `genres` | 
| 16. `goodreads_edition` | Type of edition (str) | Goodreads | `edition` | 
| 17. `goodreads_publisher` | Publisher of book (str) | Goodreads | `publisher` | 
| 18. `goodreads_publish_date` | Publication date | Goodreads | `publishDate` | 
| 19. `goodreads_first_publish_date` | Publication date of first edition | Goodreads | `firstPublishDate` | 
| 20. `goodreads_awards` | List of awards received by the book (list[str]) | Goodreads | `awards` | 
| 21. `goodreads_num_ratings` | Number of total ratings on Goodreads (int) | Goodreads | `numRatings` | 
| 22. `goodreads_likedPercent` | Percent of ratings over 2 stars on Goodreads (float) | Goodreads | `likedPercent` | 
| 23. `goodreads_price` | Price of the book (extracted from Iberibro) (float) | Goodreads | `price` | 
| 24. `nyt_published_date` | Date the list was published | NYT | `published_date` | 
| 25. `nyt_list_name_encoded` | Category of the list (str) | NYT | `list_name_encoded` | 
| 26. `nyt_price` | Price of the book (float) | NYT | `price` | 
| 27. `nyt_weeks_on_list` | Number of weeks the book was on the best sellers list (int) | NYT | `weeks_on_list` | 


### **Final Dataset Description**
- Rows: 366 
- Columns: 27 
- Unique books: 222 

### The codes/content from this point onward is authored by the students (us)

# Milestone 1

In [1]:
import numpy as np
import pandas as pd
import altair as alt
import ast

In [2]:
url = "https://github.com/kemiolamudzengi/dsci-320-datasets/blob/main/amazon_conlit_goodreads_nyt.csv?raw=true"
books = pd.read_csv(url, parse_dates=['amazon_year', 'conlit_pubdate', 'nyt_published_date',
                                      'goodreads_publish_date', 'goodreads_first_publish_date'
                                     ] )

Change the data type in `goodreads_genres` and `goodreads_awards` from string representation of list into actual list.

For example: `"[1, 2, 3]"` -> `[1, 2, 3]`

In [3]:
books['goodreads_genres'] = books.goodreads_genres.apply(lambda x: ast.literal_eval(x))
books['goodreads_awards'] = books.goodreads_awards.apply(lambda x: ast.literal_eval(x))

## Take a good look at the data

In [4]:
books.head()

Unnamed: 0,title,amazon_author,amazon_rating,amazon_num_reviews,amazon_price,amazon_year,amazon_genre,conlit_genre,conlit_pubdate,conlit_author_gender,...,goodreads_publish_date,goodreads_first_publish_date,goodreads_awards,goodreads_num_ratings,goodreads_likedPercent,goodreads_price,nyt_published_date,nyt_list_name_encoded,nyt_price,nyt_weeks_on_list
0,11/22/63: A Novel,Stephen King,4.6,2052,22,2011-01-01,Fiction,BS,2011-01-01,M,...,2011-11-08,NaT,"[Locus Award Nominee for Best SF Novel (2012),...",420225.0,96.0,6.21,2012-02-26,hardcover-fiction,35.0,14.0
1,A Dance with Dragons (A Song of Ice and Fire),George R. R. Martin,4.4,12643,11,2011-01-01,Fiction,BS,2011-01-01,M,...,2011-07-12,NaT,"[Hugo Award Nominee for Best Novel (2012), Loc...",555900.0,97.0,,NaT,,,
2,A Gentleman in Moscow: A Novel,Amor Towles,4.7,19699,15,2017-01-01,Fiction,BS,2016-01-01,M,...,2019-03-26,2016-09-06,"[Kirkus Prize Nominee for Fiction (2016), Good...",293330.0,96.0,6.29,2019-12-29,trade-fiction-paperback,0.0,34.0
3,A Stolen Life: A Memoir,Jaycee Dugard,4.6,4149,32,2011-01-01,Non Fiction,MEM,2011-01-01,F,...,2011-07-12,2011-07-11,[Goodreads Choice Award Nominee for Memoir & A...,96524.0,92.0,3.32,2011-10-02,hardcover-nonfiction,24.99,10.0
4,All the Light We Cannot See,Anthony Doerr,4.6,36348,14,2014-01-01,Fiction,PW,2014-01-01,M,...,2014-05-06,2014-05-28,"[Pulitzer Prize for Fiction (2015), Audie Awar...",1024442.0,96.0,4.45,2016-04-10,hardcover-fiction,0.0,99.0


In [5]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   title                         366 non-null    object        
 1   amazon_author                 366 non-null    object        
 2   amazon_rating                 366 non-null    float64       
 3   amazon_num_reviews            366 non-null    int64         
 4   amazon_price                  366 non-null    int64         
 5   amazon_year                   366 non-null    datetime64[ns]
 6   amazon_genre                  366 non-null    object        
 7   conlit_genre                  111 non-null    object        
 8   conlit_pubdate                111 non-null    datetime64[ns]
 9   conlit_author_gender          111 non-null    object        
 10  conlit_author_nationality     65 non-null     object        
 11  conlit_total_ratings          11

### There are many columns with NaN values, should we drop them and see what we got left?

In [6]:
books_no_nan = books.dropna(axis=1)
books_no_nan.head()

Unnamed: 0,title,amazon_author,amazon_rating,amazon_num_reviews,amazon_price,amazon_year,amazon_genre,goodreads_rating,goodreads_genres,goodreads_awards
0,11/22/63: A Novel,Stephen King,4.6,2052,22,2011-01-01,Fiction,4.31,"[Fiction, Historical Fiction, Science Fiction,...","[Locus Award Nominee for Best SF Novel (2012),..."
1,A Dance with Dragons (A Song of Ice and Fire),George R. R. Martin,4.4,12643,11,2011-01-01,Fiction,4.32,"[Fantasy, Fiction, Epic Fantasy, Science Ficti...","[Hugo Award Nominee for Best Novel (2012), Loc..."
2,A Gentleman in Moscow: A Novel,Amor Towles,4.7,19699,15,2017-01-01,Fiction,4.34,"[Historical Fiction, Fiction, Russia, Historic...","[Kirkus Prize Nominee for Fiction (2016), Good..."
3,A Stolen Life: A Memoir,Jaycee Dugard,4.6,4149,32,2011-01-01,Non Fiction,3.91,"[Nonfiction, Memoir, True Crime, Biography, Au...",[Goodreads Choice Award Nominee for Memoir & A...
4,All the Light We Cannot See,Anthony Doerr,4.6,36348,14,2014-01-01,Fiction,4.33,"[Historical Fiction, Fiction, Historical, War,...","[Pulitzer Prize for Fiction (2015), Audie Awar..."


In [7]:
books_no_nan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   title               366 non-null    object        
 1   amazon_author       366 non-null    object        
 2   amazon_rating       366 non-null    float64       
 3   amazon_num_reviews  366 non-null    int64         
 4   amazon_price        366 non-null    int64         
 5   amazon_year         366 non-null    datetime64[ns]
 6   amazon_genre        366 non-null    object        
 7   goodreads_rating    366 non-null    float64       
 8   goodreads_genres    366 non-null    object        
 9   goodreads_awards    366 non-null    object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(5)
memory usage: 28.7+ KB


## Using original dataset, do some EDA vizzes

### Distribution of book prices

In [8]:
alt.Chart(books).mark_bar().encode(
    alt.X("amazon_price:Q", 
          axis=alt.Axis(title="Price listed on Amazon"), 
          bin=alt.BinParams(maxbins=55)),
    alt.Y("count()"),
    alt.Color("amazon_genre:O", 
              legend=alt.Legend(title="Amazon Genre"),
              scale=alt.Scale(scheme='viridis'))
)

### When were the books published and how many in each year?

In [9]:
# year() will parse the year value from the datetime of "goodreads_publish_date" variable
alt.Chart(books).mark_line().encode(
    alt.X("year(goodreads_publish_date):O", 
          axis=alt.Axis(title="")),
    alt.Y("count()")
)

### Cardinality for quanlitative attributes

In [10]:
min(books['conlit_pubdate'])

Timestamp('2005-01-01 00:00:00')

In [11]:
books.agg(
    {
    'amazon_num_reviews':['min','max','median','std'],
    'amazon_price':['min','max','median','std'],
    'conlit_total_ratings':['min','max','median','std'],
    'goodreads_rating':['min','max','median','std'],
    'goodreads_num_ratings':['min','max','median','std'],
    'goodreads_likedPercent':['min','max','median','std'],
    'goodreads_price':['min','max','median','std'],
    'nyt_price':['min','max','median','std'],
    'nyt_weeks_on_list':['min','max','median','std'],
    'amazon_rating':['min','max','median','std'],   
    'amazon_year':['min','max'],
    'conlit_pubdate':['min','max'],
    'goodreads_publish_date':['min','max'],
    'goodreads_first_publish_date':['min','max'],
    'nyt_published_date':['min','max']
    }
)

Unnamed: 0,amazon_num_reviews,amazon_price,conlit_total_ratings,goodreads_rating,goodreads_num_ratings,goodreads_likedPercent,goodreads_price,nyt_price,nyt_weeks_on_list,amazon_rating,amazon_year,conlit_pubdate,goodreads_publish_date,goodreads_first_publish_date,nyt_published_date
min,548.0,0.0,39.0,3.28,608.0,74.0,0.85,0.0,0.0,3.3,2009-01-01,2005-01-01,1970-01-01,1952-01-01,2010-01-03
max,87841.0,105.0,4322160.0,4.73,6376780.0,98.0,86.87,40.0,561.0,4.9,2019-01-01,2017-01-01,2020-03-17,2070-10-30,2019-12-29
median,10070.0,11.0,337498.0,4.15,279060.0,94.0,4.59,0.0,34.0,4.7,NaT,NaT,NaT,NaT,NaT
std,13474.42378,9.918357,976230.5,0.242408,1040133.0,4.040564,7.544212,12.171395,98.966936,0.24216,NaT,NaT,NaT,NaT,NaT


### Cardinality for categorical attributes

In [12]:
cat_attr = ['title', 'amazon_author', 'amazon_genre', 'conlit_genre', 'conlit_author_gender',
           'conlit_author_nationality', 'goodreads_series', 'goodreads_edition', 'goodreads_publisher',
            'nyt_list_name_encoded']
for n in cat_attr: 
    print(pd.crosstab(index=books[n], columns = 'count'))

col_0                                               count
title                                                    
11/22/63: A Novel                                       1
12 Rules for Life: An Antidote to Chaos                 1
1984 (Signet Classics)                                  1
A Dance with Dragons (A Song of Ice and Fire)           1
A Gentleman in Moscow: A Novel                          1
...                                                   ...
Wild: From Lost to Found on the Pacific Crest T...      1
Winter of the World: Book Two of the Century Tr...      1
Wonder                                                  5
Wrecking Ball (Diary of a Wimpy Kid Book 14)            1
You Are a Badass: How to Stop Doubting Your Gre...      4

[222 rows x 1 columns]
col_0                       count
amazon_author                    
Abraham Verghese                2
Adam Mansbach                   1
Admiral William H. McRaven      1
Alan Moore                      1
Alex Michaelides  