In this project, I am trying to predict the ratings of the books given the author name, number of pages and the publisher of the book. The data is the Goodreads books data I downloaded from Kaggle. Data has a lot of other columns, but I think the rating is effected most by these columns.

In [127]:
# Import necessary libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import numpy as np
from nltk import word_tokenize

In [73]:
# Read in the Books Dataset

books_dataset = pd.read_csv('books.csv')

In [74]:
# Make a DataFrame from only those columns (features), that you think will effect the average book rating

books_df = pd.DataFrame(books_dataset).filter(['bookID', 'authors', 'num_pages', 'publisher'])

In [75]:
# Check the shape of the resulting DataFrame

books_df.shape

(11126, 4)

 At this point, I start the Data Cleaning, and making the data fit for our application (i.e. the data we require, such that Machine Learning Techniques can be applied to it.)

In [76]:
# Get the unique publishers

unique_publishers = books_df['publisher'].unique()

In [77]:
# Check the number of unique publishers

len(unique_publishers)

2293

In [78]:
num_publishers = [i for i in enumerate(unique_publishers)]

In [79]:
# Make a dictionary of publishers ID and publishers

publishers_dict = dict(num_publishers)

In [80]:
publishers_dict

{0: 'Scholastic Inc.',
 1: 'Scholastic',
 2: 'Nimble Books',
 3: 'Gramercy Books',
 4: 'Del Rey Books',
 5: 'Crown',
 6: 'Random House Audio',
 7: 'Wings Books',
 8: 'Broadway Books',
 9: 'William Morrow Paperbacks',
 10: 'Ballantine Books',
 11: 'Houghton Mifflin Harcourt',
 12: 'Pragmatic Bookshelf',
 13: 'Atheneum Books for Young Readers: Richard Jackson Books',
 14: 'Teacher Created Resources',
 15: 'Delacorte Press',
 16: 'Cherry Lane Music Company',
 17: 'The New Press',
 18: 'Changeling Press',
 19: 'Viking Juvenile',
 20: 'Firebird',
 21: 'iUniverse',
 22: 'Shambhala',
 23: 'Ivy Books',
 24: 'Amistad',
 25: 'HarperAudio',
 26: 'Harper',
 27: 'FT Press',
 28: 'Archaia',
 29: 'Farrar  Straus and Giroux',
 30: 'Farrar Straus Giroux',
 31: 'Dramatists Play Service',
 32: 'Vintage',
 33: 'Routledge',
 34: 'North Light Books',
 35: 'Chosen Books',
 36: 'Association for Supervision & Curriculum Development',
 37: 'Kingfisher',
 38: 'ASCD',
 39: 'Sovereign World',
 40: 'Workman Publish

In [81]:
# The Books DataFrame upto this point

books_df

Unnamed: 0,bookID,authors,num_pages,publisher
0,1,J.K. Rowling/Mary GrandPré,652,Scholastic Inc.
1,2,J.K. Rowling/Mary GrandPré,870,Scholastic Inc.
2,4,J.K. Rowling,352,Scholastic
3,5,J.K. Rowling/Mary GrandPré,435,Scholastic Inc.
4,8,J.K. Rowling/Mary GrandPré,2690,Scholastic
5,9,W. Frederick Zimmerman,152,Nimble Books
6,10,J.K. Rowling,3342,Scholastic
7,12,Douglas Adams,815,Gramercy Books
8,13,Douglas Adams,815,Del Rey Books
9,14,Douglas Adams,215,Crown


In [82]:
unique_publishers_id = []
for pub_id, pub_name in publishers_dict.items():
    unique_publishers_id.append(pub_id)
unique_publishers_id = pd.DataFrame(unique_publishers_id)
    

In [83]:
books_dataset

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,Unnamed: 12
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,439785960,9.78044E+12,eng,652,2095690,27591,9/16/2006,Scholastic Inc.,
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,439358078,9.78044E+12,eng,870,2153167,29221,9/1/2004,Scholastic Inc.,
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,439554896,9.78044E+12,eng,352,6333,244,11/1/2003,Scholastic,
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9.78044E+12,eng,435,2339585,36325,5/1/2004,Scholastic Inc.,
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,439682584,9.78044E+12,eng,2690,41428,164,9/13/2004,Scholastic,
5,9,"Unauthorized Harry Potter Book Seven News: ""Ha...",W. Frederick Zimmerman,3.74,976540606,9.78098E+12,en-US,152,19,1,4/26/2005,Nimble Books,
6,10,Harry Potter Collection (Harry Potter #1-6),J.K. Rowling,4.73,439827604,9.78044E+12,eng,3342,28242,808,9/12/2005,Scholastic,
7,12,The Ultimate Hitchhiker's Guide: Five Complete...,Douglas Adams,4.38,517226952,9.78052E+12,eng,815,3628,254,11/1/2005,Gramercy Books,
8,13,The Ultimate Hitchhiker's Guide to the Galaxy ...,Douglas Adams,4.38,345453743,9.78035E+12,eng,815,249558,4080,4/30/2002,Del Rey Books,
9,14,The Hitchhiker's Guide to the Galaxy (Hitchhik...,Douglas Adams,4.22,1400052920,9.7814E+12,eng,215,4930,460,8/3/2004,Crown,


In [84]:
# Have a look at the publishers dictionary

publishers_dict

{0: 'Scholastic Inc.',
 1: 'Scholastic',
 2: 'Nimble Books',
 3: 'Gramercy Books',
 4: 'Del Rey Books',
 5: 'Crown',
 6: 'Random House Audio',
 7: 'Wings Books',
 8: 'Broadway Books',
 9: 'William Morrow Paperbacks',
 10: 'Ballantine Books',
 11: 'Houghton Mifflin Harcourt',
 12: 'Pragmatic Bookshelf',
 13: 'Atheneum Books for Young Readers: Richard Jackson Books',
 14: 'Teacher Created Resources',
 15: 'Delacorte Press',
 16: 'Cherry Lane Music Company',
 17: 'The New Press',
 18: 'Changeling Press',
 19: 'Viking Juvenile',
 20: 'Firebird',
 21: 'iUniverse',
 22: 'Shambhala',
 23: 'Ivy Books',
 24: 'Amistad',
 25: 'HarperAudio',
 26: 'Harper',
 27: 'FT Press',
 28: 'Archaia',
 29: 'Farrar  Straus and Giroux',
 30: 'Farrar Straus Giroux',
 31: 'Dramatists Play Service',
 32: 'Vintage',
 33: 'Routledge',
 34: 'North Light Books',
 35: 'Chosen Books',
 36: 'Association for Supervision & Curriculum Development',
 37: 'Kingfisher',
 38: 'ASCD',
 39: 'Sovereign World',
 40: 'Workman Publish

In [69]:
publishers_dict_useful = {v:k for k, v in publishers_dict.items()}
pub_column = []
for x in books_df['publisher']:
    pub_column.append(publishers_dict_useful[x])


In [86]:
books_df['publisher'] = pd.DataFrame(pub_column)

In [87]:
books_df['publisher']

0           0
1           0
2           1
3           0
4           1
5           2
6           1
7           3
8           4
9           5
10          6
11          7
12          8
13          8
14          8
15          8
16          8
17          9
18          9
19          9
20          9
21         10
22         11
23         11
24         11
25         11
26         11
27         12
28         13
29         14
         ... 
11096     104
11097    2288
11098    1433
11099      53
11100      88
11101     661
11102     611
11103     386
11104    2289
11105    1253
11106     889
11107     531
11108     328
11109     280
11110    2290
11111     255
11112    2291
11113     418
11114    2292
11115    2292
11116    2292
11117    2292
11118    2292
11119    2292
11120      55
11121     316
11122      55
11123      55
11124     473
11125     801
Name: publisher, Length: 11126, dtype: int64

In [88]:
unique_authors = books_df['authors'].unique()

In [93]:
authors_list = [a for a in enumerate(unique_authors)]

In [94]:
authors_dict = dict(authors_list)
authors_dict

{0: 'J.K. Rowling/Mary GrandPré',
 1: 'J.K. Rowling',
 2: 'W. Frederick Zimmerman',
 3: 'Douglas Adams',
 4: 'Douglas Adams/Stephen Fry',
 5: 'Bill Bryson',
 6: 'J.R.R. Tolkien',
 7: 'J.R.R. Tolkien/Alan  Lee',
 8: 'Chris   Smith/Christopher  Lee/Richard Taylor',
 9: 'Jude Fisher',
 10: 'Dave Thomas/David Heinemeier Hansson/Leon Breedt/Mike Clark/Thomas  Fuchs/Andreas  Schwarz',
 11: 'Gary Paulsen',
 12: 'Donna Ickes/Edward Sciranko/Keith Vasconcelles',
 13: 'Molly Hatchet',
 14: 'Dale Peck',
 15: 'Angela Knight/Sahara Kelly/Judy Mays/Marteeka Karland/Kate Douglas/Shelby Morgen/Lacey Savage/Kate Hill/Willa Okati',
 16: 'Delia Sherman',
 17: 'Patricia A. McKillip',
 18: 'Zilpha Keatley Snyder',
 19: 'Kate Horsley',
 20: 'Philippa Carr',
 21: 'Edward P. Jones',
 22: 'Edward P. Jones/Kevin R. Free',
 23: 'Satyajit Das',
 24: 'Mark Smylie',
 25: 'John McPhee/William Howarth',
 26: 'John McPhee',
 27: 'Wendy Wasserstein',
 28: 'Heidi Hayes Jacobs',
 29: 'Heidi Boyd',
 30: 'Heidi Baker/Rolla

In [95]:
authors_dict_useful = {v:k for k, v in authors_dict.items()}
author_column = []
for x in books_df['authors']:
    author_column.append(authors_dict_useful[x])

In [96]:
books_df['authors'] = pd.DataFrame(author_column)

In [97]:
books_df

Unnamed: 0,bookID,authors,num_pages,publisher
0,1,0,652,0
1,2,0,870,0
2,4,1,352,1
3,5,0,435,0
4,8,0,2690,1
5,9,2,152,2
6,10,1,3342,1
7,12,3,815,3
8,13,3,815,4
9,14,3,215,5


In [100]:
books_df.drop(columns = ['bookID'])

Unnamed: 0,authors,num_pages,publisher
0,0,652,0
1,0,870,0
2,1,352,1
3,0,435,0
4,0,2690,1
5,2,152,2
6,1,3342,1
7,3,815,3
8,3,815,4
9,3,215,5


In [132]:
features_df = books_df

In [133]:
output_df = pd.DataFrame(books_dataset).filter(['average_rating'])

In [134]:
output_df.shape

(11126, 1)

This completes our Data Cleaning Step!


In [184]:
enc = OneHotEncoder(categories='auto', drop=None, sparse=False, handle_unknown='error')

In [185]:
for_encoding = features_df.drop(columns = ['num_pages'])

In [186]:
one_hot_encoded_authors = enc.fit_transform(for_encoding, y=None)

In [188]:
one_hot_encoded_authors.shape

(11126, 20061)

In [126]:
# output_df_useful = pd.cut(output_df, np.linspace(0, 5, 50),labels = np.linspace(0, 9, 20)).fillna(0).astype(float)

In [118]:
X_train, X_test, Y_train, Y_Test = train_test_split(features_df, output_df, test_size = 0.2)