## Goal and motivation

Online reviews have become a powerful source of influence in our buying decision process. Books are no exception. When trying to decide what to read next, our choices get swayed by other readers' experiences and perceptions. 
The goal of this analysis is to examine whether reviews of different genres have unique characteristics in the form of word choice, length of reviews, ratings and more.

## Data source

The datasets were obtained from UCSD Book Graph project. They were collected from goodreads.com in late 2017, and updated in May 2019. The data reflect users' public shelves (everyone can see it on web without login). User IDs and review IDs have been anonymized. 

For the purposes of this project, we decided to use datasets from 3 different genres: children, history&biography, and mystery&thrillers. One important thing to note is that a book may belong to multiple genres. The genre of a book in this dataset was ultimately decided by how many votes it received from users.

Citation:
* Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18.[bibtex]
* Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19.[bibtex]


## The analysis

We divided our analysis into 3 parts, each guided by a set of questions: 

#### Reviews by genres:
- Does one genre tend to receive higher ratings than other genres? What is the rating distribution of each genre? 
- Is there a relationship between the length of reviews and the rating given in different genres?
- What are the words and themes associated with different review genres?

#### Reviews by users:
- Are there any 'likers' or 'haters' in our sample of book reviews? 'Likers' are those who tend to leave more positive reviews, whereas 'haters' tend to do the opposite. (Not sure we can answer this question with the data that we have)
- Do ratings align well with polarity score for reviews for different users? (Not sure we can answer this with our data)


#### Reviews by authors:

- Is there any bias when it comes to the gender of the author? Using https://pypi.org/project/gender-guesser/, we can examine whether male authors are rated better than female authors or vice versa

## Data preparation and Manipulation Steps
### To run the notebook for analysis, please skip ahead to section: Reading the dataframe from csv file

Given the size of the data, we ran into memory issues when attempting to read the files using pandas methods. 

To work around this issue, we will use pyspark dataframes to read the json files and extract samples from each genre. The sampling process will be done using `.sample()` method that takes 3 arguments: ``boolean withReplacement, fraction and a long seed.`` One important thing to note is that the fraction argument doesn't represent the fraction of the actual size of the dataframe but rather the probability of each element in the population getting selected for the sample. As a result the returned sample is not an exact percentage of the original dataframe.  



1. Load and read our json files. One thing to note is that each json file is actually a collection of multiple json files(i.e each row is a json file). For this reason, we will make use of the optional parameter `lines` and set it to `True` so that each json is treated as a line in the file and not a file itself.

2. Merge datasets that belong to the same genre together. Books meta-data and reviews will be merged on `book_id`.

3. Add a genre column to each dataframe to identify it once all the files are merged into 1 dataframe.

4. Take a sample of each dataframe.

5. Concatenate all the dataframes.

6. Convert the spark dataframe into a pandas dataframe.

7. Remove unecessary columns.

8. Only keep books that have a rating (unrated books have a rating of 0).


In [1]:
#boilerplate code for running spark
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.functions import lit
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from functools import reduce
from pyspark.sql import DataFrame

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName('Goodreads Spark Application') \
    .getOrCreate() 

sc = spark.sparkContext

In [20]:
#display entire output of a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [21]:
# load and parse the json files
children_books = spark.read.json('../data/goodreads_books_children.json')
children_reviews = spark.read.json('../data/goodreads_reviews_children.json')
history_books = spark.read.json('../data/goodreads_books_history_biography.json')
history_reviews = spark.read.json('../data/goodreads_reviews_history_biography.json')
mystery_books = spark.read.json('../data/goodreads_books_mystery_thriller_crime.json')
mystery_reviews = spark.read.json('../data/goodreads_reviews_mystery_thriller_crime.json')

In [22]:
children_books.select("book_id").distinct().count() # check that book id is unique, to check for duplicates
print((children_books.count(), len(children_books.columns))) # check the size of the children books dataframe

124082

(124082, 29)


In [23]:
children_reviews.select("review_id").distinct().count() # check that review id is unique, to check for duplicates
print((children_reviews.count(), len(children_reviews.columns))) # check the size of the children reviews dataframe

734640

(734640, 11)


In [24]:
history_books.select("book_id").distinct().count() # check that book id is unique, to check for duplicates
print((history_books.count(), len(history_books.columns))) # check the size of the history books dataframe

302935

(302935, 29)


In [25]:
history_reviews.select("review_id").distinct().count() # check that review id is unique, to check for duplicates
print((history_reviews.count(), len(history_reviews.columns))) # check the size of the history reviews dataframe

2066193

(2066193, 11)


In [26]:
mystery_books.select("book_id").distinct().count() # check that book id is unique, to check for duplicates
print((mystery_books.count(), len(mystery_books.columns))) # check the size of the mystery books dataframe

219235

(219235, 29)


In [27]:
mystery_reviews.select("review_id").distinct().count() # check that review id is unique, to check for dups
print((mystery_reviews.count(), len(mystery_reviews.columns))) # check the size of the mystery reviews dataframe

1849236

(1849236, 11)


In [28]:
# drop unwanted columns in reviews
columns_to_drop = ['date_added', 'date_updated','read_at','started_at']
children_reviews = children_reviews.drop('date_added', 'date_updated','read_at','started_at')
history_reviews = history_reviews.drop(*columns_to_drop)
mystery_reviews= mystery_reviews.drop(*columns_to_drop)


# drop unwanted columns in books
drop_columns = ['description','format','image_url','is_ebook','asin','kindle_asin','link','popular_shelves','url']
children_books = children_books.drop(*drop_columns)
history_books = history_books.drop(*drop_columns)
mystery_books = mystery_books.drop(*drop_columns)

In [29]:
#join the dataframes of the same genres on book id, to have detailed information about a book

merged_children =children_reviews.join(children_books, on =["book_id"], how = "inner")
merged_history = history_books.join(history_reviews, on = ["book_id"], how = 'inner')
merged_mystery = mystery_books.join(mystery_reviews, on = ["book_id"], how = "inner")

In [30]:
#create a genre column to identify each dataframe
children = merged_children.withColumn('genre', lit('children'))
history = merged_history.withColumn('genre',lit('history'))
mystery = merged_mystery.withColumn('genre', lit('mystery'))

In [31]:
#only keep books and reviews written in English
children= children.filter(f.col('language_code')=='eng')
history = history.filter(f.col('language_code')=='eng') 
mystery = mystery.filter(f.col('language_code')=='eng')

In [32]:
#take a sample of the dataframes
children_sample = children.sample(False,0.06,40)
history_sample = history.sample(False, 0.05,40)
mystery_sample = mystery.sample(False, 0.05,40)

In [33]:
#arrange the order of the columns for each dataframe to allow for union
children_sample.createOrReplaceTempView("children_table")

children = spark.sql("select book_id,title,isbn,isbn13,title_without_series,review_text,review_id,\
rating,user_id,authors,average_rating,country_code,edition_information,n_comments,n_votes,language_code,num_pages,\
publication_day,publication_month,publication_year,publisher,ratings_count,series,similar_books,text_reviews_count,\
work_id, genre from children_table")

history_sample.createOrReplaceTempView("history_table")

history = spark.sql("select book_id,title,isbn,isbn13,title_without_series,review_text,review_id,\
rating,user_id,authors,average_rating,country_code,edition_information,n_comments,n_votes,language_code,num_pages,\
publication_day,publication_month,publication_year,publisher,ratings_count,series,similar_books,text_reviews_count,\
work_id, genre from history_table")

mystery_sample.createOrReplaceTempView("mystory_table")

mystery = spark.sql("select book_id,title,isbn,isbn13,title_without_series,review_text,review_id,\
rating,user_id,authors,average_rating,country_code,edition_information,n_comments,n_votes,language_code,num_pages,\
publication_day,publication_month,publication_year,publisher,ratings_count,series,similar_books,text_reviews_count,\
work_id, genre from mystory_table")


In [34]:
#concatenate the sample dataframes
def unionAll(*dfs):
    ''' Input-spark dataframes
        Output- union of dataframes'''
    return reduce(DataFrame.unionAll, dfs)

complete_df = unionAll(children,history,mystery)

In [35]:
# convert spark dataframe into a pandas dataframe
df = complete_df.toPandas()

In [36]:
#save our data in a csv file
df.to_csv('../data/goodreads.csv', index = False)

## Reading the dataframe from csv file

In [37]:
#data manipulation
import pandas as pd
import numpy as np

#display entire output of a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#hide warnings from jupyter notebook
import warnings
warnings.filterwarnings('ignore')

#display all columns
pd.set_option("display.max.columns", None)

#visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# gender guesser
from gender_detector import gender_detector as gd
#detector = gd.GenderDetector('us')

# for word cloud
import re
import string

# for stats
import scipy.stats

In [38]:
goodreads = pd.read_csv('../data/goodreads.csv')

In [39]:
# drop authors column. We were able to locate a clean json file that contains authors infomation
goodreads.drop('authors',1,inplace=True)

In [41]:
len(goodreads)

129797

#### This section has been commented out once the csv files have been created

In [8]:
#Creating csv files for each genre
#df_c = goodreads.loc[goodreads['genre'] == 'children']
#df_h = goodreads.loc[goodreads['genre'] == 'history']
#df_m = goodreads.loc[goodreads['genre'] == 'mystery']

#converting them to csv files
#df_c.to_csv('../data/children_books_reviews.csv')
#df_h.to_csv('../data/history_books_reviews.csv')
#df_m.to_csv('../data/mystery_books_reviews.csv')

## Explore dataset

In [45]:
goodreads.head()

Unnamed: 0,book_id,title,isbn,isbn13,title_without_series,review_text,review_id,rating,user_id,average_rating,country_code,edition_information,n_comments,n_votes,language_code,num_pages,publication_day,publication_month,publication_year,publisher,ratings_count,series,similar_books,text_reviews_count,work_id,genre
0,11346143,Strange Case Of Origami Yoda,0810996502,9780810996502.0,Strange Case Of Origami Yoda,Quirky and silly will save it for my son when ...,a95478e83fb0549916181dec2e6c02de,3,ae6c9ceec7a41254191ffdb8852bd031,3.9,US,,0,0,eng,154.0,,,,,143,['257006'],"['7172060', '6330886', '9564947', '7739868', '...",29,7415356,children
1,12354883,Shadow the Sheep-dog,,,Shadow the Sheep-dog,I read this as a kid and enjoyed it and now ha...,d348df0f30853c95f16bf5f292e42bb9,3,1fee0f40606726eb46e30612e6dd8485,4.32,US,,0,0,eng,188.0,,,1948.0,Angus and Robertson,10,[],"['794739', '2740021', '885350', '31964', '2220...",2,1433595,children
2,12734774,Mikolay and Julia Meet the Fairies (Mikolay an...,,,Mikolay and Julia Meet the Fairies (Mikolay an...,What a delightful children's story. I can't wa...,3c550b2365eb869d57d81236f4cfbf27,5,7f778517ad88c4feed6183c54b6403e5,4.65,US,,0,1,eng,38.0,,9.0,2010.0,Mayan Books,24,['305783'],[],10,17872488,children
3,130196,The Trumpet of the Swan,0590406191,9780590406192.0,The Trumpet of the Swan,"Even as an adult, I enjoy some children's lite...",87379d304d5d38d555e9a04b8eafe74a,3,1fa3b5759854065c9d1e1048f38d2507,4.06,US,,0,0,eng,210.0,,,1970.0,,280,[],"['89546', '24384', '827430', '240815', '426206...",43,1835542,children
4,13790759,Sarah Gives Thanks: How Thanksgiving Became a ...,080757239X,9780807572399.0,Sarah Gives Thanks: How Thanksgiving Became a ...,More like a 3.5'er,d2acf777a2748ec4a481de94e17a1666,3,4b3548b067eaea2a7eb23eb45da9d375,4.08,US,,0,0,eng,32.0,1.0,9.0,2012.0,Albert Whitman Company,184,[],"['13330625', '9885866', '13414838', '12763989'...",49,19424630,children


In [46]:
#check dimentionality of our data
goodreads.shape

# check the first few rows
goodreads.head()

#examine data types
goodreads.info()

#basic statistics
goodreads.describe(include=np.object)

(129797, 26)

Unnamed: 0,book_id,title,isbn,isbn13,title_without_series,review_text,review_id,rating,user_id,average_rating,country_code,edition_information,n_comments,n_votes,language_code,num_pages,publication_day,publication_month,publication_year,publisher,ratings_count,series,similar_books,text_reviews_count,work_id,genre
0,11346143,Strange Case Of Origami Yoda,0810996502,9780810996502.0,Strange Case Of Origami Yoda,Quirky and silly will save it for my son when ...,a95478e83fb0549916181dec2e6c02de,3,ae6c9ceec7a41254191ffdb8852bd031,3.9,US,,0,0,eng,154.0,,,,,143,['257006'],"['7172060', '6330886', '9564947', '7739868', '...",29,7415356,children
1,12354883,Shadow the Sheep-dog,,,Shadow the Sheep-dog,I read this as a kid and enjoyed it and now ha...,d348df0f30853c95f16bf5f292e42bb9,3,1fee0f40606726eb46e30612e6dd8485,4.32,US,,0,0,eng,188.0,,,1948.0,Angus and Robertson,10,[],"['794739', '2740021', '885350', '31964', '2220...",2,1433595,children
2,12734774,Mikolay and Julia Meet the Fairies (Mikolay an...,,,Mikolay and Julia Meet the Fairies (Mikolay an...,What a delightful children's story. I can't wa...,3c550b2365eb869d57d81236f4cfbf27,5,7f778517ad88c4feed6183c54b6403e5,4.65,US,,0,1,eng,38.0,,9.0,2010.0,Mayan Books,24,['305783'],[],10,17872488,children
3,130196,The Trumpet of the Swan,0590406191,9780590406192.0,The Trumpet of the Swan,"Even as an adult, I enjoy some children's lite...",87379d304d5d38d555e9a04b8eafe74a,3,1fa3b5759854065c9d1e1048f38d2507,4.06,US,,0,0,eng,210.0,,,1970.0,,280,[],"['89546', '24384', '827430', '240815', '426206...",43,1835542,children
4,13790759,Sarah Gives Thanks: How Thanksgiving Became a ...,080757239X,9780807572399.0,Sarah Gives Thanks: How Thanksgiving Became a ...,More like a 3.5'er,d2acf777a2748ec4a481de94e17a1666,3,4b3548b067eaea2a7eb23eb45da9d375,4.08,US,,0,0,eng,32.0,1.0,9.0,2012.0,Albert Whitman Company,184,[],"['13330625', '9885866', '13414838', '12763989'...",49,19424630,children


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129797 entries, 0 to 129796
Data columns (total 26 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   book_id               129797 non-null  int64  
 1   title                 129797 non-null  object 
 2   isbn                  110349 non-null  object 
 3   isbn13                113385 non-null  object 
 4   title_without_series  129797 non-null  object 
 5   review_text           129782 non-null  object 
 6   review_id             129797 non-null  object 
 7   rating                129797 non-null  int64  
 8   user_id               129797 non-null  object 
 9   average_rating        129797 non-null  float64
 10  country_code          129797 non-null  object 
 11  edition_information   13595 non-null   object 
 12  n_comments            129797 non-null  int64  
 13  n_votes               129797 non-null  int64  
 14  language_code         129797 non-null  object 
 15  

Unnamed: 0,title,isbn,isbn13,title_without_series,review_text,review_id,user_id,country_code,edition_information,language_code,publisher,series,similar_books,genre
count,129797,110349,113385,129797,129782,129797,129797,129797,13595,129797,110187,129797,129797,129797
unique,37529,31829,33793,37529,56106,56844,31477,1,1242,1,7222,17966,27156,3
top,The Girl on the Train,1594633665,9781594633669,The Girl on the Train,Great pace. Nothing I haven't ready or seen be...,0555e260601f31b99e735da9144324b3,37e01e4d9600745d939c44e4f0823a4b,US,First Edition,eng,Scribner,[],[],mystery
freq,790,685,685,790,685,685,685,129797,1737,129797,2008,69299,14187,55289
