# [Predict The Price Of Books](https://machinehack.com/hackathon/predict_the_price_of_books/overview)

***

The aim of the project is to use the dataset to build a Machine Learning model to predict the price of books based on a given set of features.

***
* Size of training set: 6237 records
* Size of test set: 1560 records
* FEATURES: Title: The title of the book
* Author: The author(s) of the book.
* Edition: The edition of the book eg (Paperback,– Import, 26 Apr 2018)
* Reviews: The customer reviews about the book
* Ratings: The customer ratings of the book
* Synopsis: The synopsis of the book
* Genre: The genre the book belongs to Book
* Category: The department the book is usually available at.
***
* **Price: The price of the book (Target variable)**

In [1]:
# Importing packages necessary to preprocess the dataset and create features which can be used for prediction

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
!pip install openpyxl # Opening Excel File

Collecting openpyxl
  Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
     |████████████████████████████████| 242 kB 287 kB/s            
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.9


In [2]:
train = pd.read_excel("../input/PredictThePriceOfBooks-MH/Participants_Data/Data_Train.xlsx", engine="openpyxl")

In [3]:
train.nunique() #Checking unique values in each coloumn in the dataset

Title           5568
Author          3679
Edition         3370
Reviews           36
Ratings          342
Synopsis        5549
Genre            345
BookCategory      11
Price           1614
dtype: int64

In [4]:
train.shape #Shape of the dataset

(6237, 9)

In [5]:
train.isna().sum() #Checking if there are any null values in the dataset

Title           0
Author          0
Edition         0
Reviews         0
Ratings         0
Synopsis        0
Genre           0
BookCategory    0
Price           0
dtype: int64

In [6]:
train.head(10) #Visualizing first 10 rows

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0 out of 5 stars,8 customer reviews,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9 out of 5 stars,14 customer reviews,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8 out of 5 stars,6 customer reviews,"""During the time men live without a common Pow...",International Relations,Humour,299.0
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",4.1 out of 5 stars,13 customer reviews,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006",5.0 out of 5 stars,1 customer review,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62
5,ChiRunning: A Revolutionary Approach to Effort...,Danny Dreyer,"Paperback,– 5 May 2009",4.5 out of 5 stars,8 customer reviews,The revised edition of the bestselling ChiRunn...,Healthy Living & Wellness (Books),Sports,900.0
6,Death on the Nile (Poirot),Agatha Christie,"Paperback,– 5 Oct 2017",4.4 out of 5 stars,72 customer reviews,Agatha Christie’s most exotic murder mystery\n...,"Crime, Thriller & Mystery (Books)","Crime, Thriller & Mystery",224.0
7,Yoga Your Home Practice Companion: A Complete ...,Sivananda Yoga Vedanta Centre,"Hardcover,– Import, 1 Mar 2018",4.7 out of 5 stars,16 customer reviews,"Achieve a healthy body, mental alertness, and ...",Sports Training & Coaching (Books),Sports,836.0
8,Karmayogi: A Biography of E. Sreedharan,M S Ashokan,"Paperback,– 15 Dec 2015",4.2 out of 5 stars,111 customer reviews,Karmayogi is the dramatic and inspiring story ...,Biographies & Autobiographies (Books),"Biographies, Diaries & True Accounts",130.0
9,"The Iron King (The Accursed Kings, Book 1)",Maurice Druon,"Paperback,– 26 Mar 2013",4.0 out of 5 stars,1 customer review,‘This is the original game of thrones’ George ...,Action & Adventure (Books),Action & Adventure,695.0


In [7]:
# Creating new feature "Average Star Rating" from "Reviews"
train["Average_Star_Rating"] = pd.to_numeric(train.Reviews.str.split(pat = " out of ", n = 1, expand = True)[0],\
                                             downcast='float')

In [8]:
# Creating new feature "Number of Reviews" from "Ratings"
train["No_of_Reviews"] = pd.to_numeric(train.Ratings.str.split(pat = " ", n = 1, expand = True)[0].apply(lambda x: x.replace(",", ""))\
                                       , downcast='float')

**Analyzing "Edition" column and Creating new features**

Usually Hardcover books are more expensive as compared to softcover and paperbacks

In [9]:
# Checking the unique word that appear in Edition column

In [10]:
from collections import Counter
results = Counter()
train.Edition.str.casefold().str.replace(",–", "").str.replace(",", "").str.split().apply(results.update)
print(results)

Counter({'paperback': 5349, 'hardcover': 823, '2018': 811, '1': 762, '2017': 757, '2016': 659, 'oct': 639, 'import': 625, 'sep': 543, 'may': 537, '2015': 519, 'jan': 514, 'jun': 501, 'nov': 487, 'apr': 470, 'jul': 457, 'mar': 455, 'aug': 446, 'feb': 410, 'dec': 408, '2014': 402, '2013': 388, '2019': 361, '5': 307, '2012': 304, '2011': 267, '15': 246, '2010': 235, '7': 231, '30': 229, '2': 216, '28': 214, '3': 210, '4': 204, '10': 202, '25': 200, '2009': 182, '6': 180, '20': 179, '26': 172, '14': 167, '2008': 163, '18': 156, 'mass': 155, 'market': 155, '27': 155, '29': 154, '24': 148, '31': 133, '22': 132, '12': 132, '21': 130, '8': 129, '16': 129, '23': 127, '2005': 125, '19': 118, '13': 117, '2006': 110, '17': 110, '2007': 108, '9': 100, '2003': 99, '11': 96, '2004': 85, '2002': 72, '2000': 69, '2001': 66, 'illustrated': 53, 'edition': 44, '1999': 39, '1997': 33, '1994': 33, '1992': 31, '1998': 31, '1995': 30, '1996': 27, 'sheet': 24, 'music': 24, '1993': 22, '1989': 20, '1991': 19, '

In [11]:
print(train.Edition.str.split(pat = ",–", n = 1, expand = True)[0].unique().tolist()) # Splitting the String to check unique values of cover type
print(train.Edition.str.split(pat = ",–", n = 1, expand = True)[0].nunique()) # Count of unique cover types
train["cover_type"] = train.Edition.str.split(pat = ",–", n = 1, expand = True)[0] #Creating new column Cover Type

['Paperback', 'Hardcover', 'Mass Market Paperback', 'Sheet music', 'Flexibound', 'Plastic Comb', 'Loose Leaf', 'Tankobon Softcover', 'Perfect Paperback', 'Board book', 'Cards', 'Spiral-bound', '(Kannada),Paperback', 'Product Bundle', 'Library Binding', '(German),Paperback', 'Leather Bound', '(French),Paperback', '(Spanish),Paperback']
19


In [12]:
train[train["cover_type"].isin(['Mass Market Paperback', 'Sheet music', 'Flexibound', 'Plastic Comb',\
                                'Loose Leaf', 'Tankobon Softcover', 'Perfect Paperback', 'Board book',\
                                'Cards', 'Spiral-bound', '(Kannada),Paperback', 'Product Bundle', 'Library Binding',\
                                '(German),Paperback', 'Leather Bound', '(French),Paperback', '(Spanish),Paperback'])].shape

(221, 12)

In [13]:
#Checking if year of Publication can be extracted as Prices can be a function of year
train.Edition[train.Edition.str.rsplit(pat = " ", n= 2, expand = True)[2].isnull()]

19      Paperback,– 2016
35      Paperback,– 2019
44      Hardcover,– 2019
60      Paperback,– 2016
98      Paperback,– 2011
              ...       
6165    Paperback,– 2017
6176    Paperback,– 2011
6177    Paperback,– 2010
6217    Hardcover,– 2015
6223    Paperback,– 2014
Name: Edition, Length: 338, dtype: object

In [14]:
train.Edition[train.Edition.str.contains("Box set")]

27              Paperback,– Box set, 15 Jun 2014
1605    Paperback,– Abridged, Audiobook, Box set
1769    Hardcover,– Abridged, Audiobook, Box set
2007            Paperback,– Box set, 10 Dec 2012
2359             Paperback,– Box set, 7 Oct 2008
2660    Paperback,– Abridged, Audiobook, Box set
3511    Paperback,– Abridged, Audiobook, Box set
3655             Hardcover,– Box set, 2 Aug 2009
4449            Paperback,– Box set, 13 Sep 2011
4907             Hardcover,– Box set, 7 Nov 2013
4964             Paperback,– Box set, 9 Jul 2003
5117    Paperback,– Abridged, Audiobook, Box set
5221               Paperback,– Box set, Aug 2013
5449             Paperback,– Box set, 5 Mar 2016
5927             Paperback,– Box set, 7 Oct 2008
5974             Paperback,– Box set, 7 Aug 2012
Name: Edition, dtype: object

In [15]:
train.Edition.str.rsplit(pat = " ", n= 2, expand = True)[2].unique()

array(['2016', '2012', '1982', '2017', '2006', '2009', '2018', '2015',
       '2013', '1999', '2002', '2011', '1991', None, '2014', '1989',
       '2000', '2005', '2008', '2019', '2004', '2010', '2007', '2001',
       '1969', '1993', '1992', '2003', '1996', 'Import', '1997', '1995',
       'NTSC', '1987', '1986', '1990', '1988', '1981', '1976', '1994',
       '1998', '1977', '1974', '1983', '1985', '1971', 'Facsimile', 'set',
       'Edition', '1964', '1984', '1980', 'Unabridged', 'Print', '1960',
       '1970', '1905', '1900', 'Audiobook', '1975', '1961', '1925',
       '1979', '1978'], dtype=object)

In [16]:
train.Edition[train.Edition.str.rsplit(pat = " ", n= 2, expand = True)[2]\
              .isin(["Import","NTSC","Facsimile","set", "Edition", "Unabridged",\
                    "Print", "Audiobook"])]

169                        Paperback,– Abridged, Import
235                            Plastic Comb,– DVD, NTSC
582                     Paperback,– Illustrated, Import
972                     Paperback,– Illustrated, Import
1233                    Paperback,– Large Print, Import
1558                      Hardcover,– Import, Facsimile
1605           Paperback,– Abridged, Audiobook, Box set
1631                    Paperback,– Large Print, Import
1643       Paperback,– Student Edition, Special Edition
1769           Hardcover,– Abridged, Audiobook, Box set
2101                  Hardcover,– Audiobook, Unabridged
2229       Paperback,– Abridged, Audiobook, Large Print
2660           Paperback,– Abridged, Audiobook, Box set
2779                    Paperback,– Illustrated, Import
3511           Paperback,– Abridged, Audiobook, Box set
3875                      Hardcover,– Facsimile, Import
3960    Paperback,– Illustrated, Large Print, Audiobook
4036                    Paperback,– Illustrated,

In [17]:
train.Edition[train.Edition.str.rsplit(pat = " ", n= 2, expand = True)[2]=="Print"]

2229    Paperback,– Abridged, Audiobook, Large Print
5860            Paperback,– Illustrated, Large Print
Name: Edition, dtype: object

In [18]:
train.Edition[train.Edition.str.rsplit(pat = " ", n= 2, expand = True)[2]=="Audiobook"]

3960    Paperback,– Illustrated, Large Print, Audiobook
Name: Edition, dtype: object

In [19]:
train.Edition[train.Edition.str.rsplit(pat = " ", n= 2, expand = True)[2]=="Edition"]

1643    Paperback,– Student Edition, Special Edition
Name: Edition, dtype: object

In [20]:
train.Edition[train.Edition.str.rsplit(pat = " ", n= 2, expand = True)[2]=="NTSC"]

235    Plastic Comb,– DVD, NTSC
Name: Edition, dtype: object

In [21]:
train.Edition[train.Edition.str.rsplit(pat = " ", n= 2, expand = True)[2]=="Unabridged"]

2101    Hardcover,– Audiobook, Unabridged
Name: Edition, dtype: object

In [22]:
train.Edition[train.Edition.str.rsplit(pat = " ", n= 2, expand = True)[2]=="Import"]

169        Paperback,– Abridged, Import
582     Paperback,– Illustrated, Import
972     Paperback,– Illustrated, Import
1233    Paperback,– Large Print, Import
1631    Paperback,– Large Print, Import
2779    Paperback,– Illustrated, Import
3875      Hardcover,– Facsimile, Import
4036    Paperback,– Illustrated, Import
4403    Hardcover,– Illustrated, Import
Name: Edition, dtype: object

In [23]:
train.Edition.str.split(pat = ",–", n = 1, expand = True)[0].unique()

array(['Paperback', 'Hardcover', 'Mass Market Paperback', 'Sheet music',
       'Flexibound', 'Plastic Comb', 'Loose Leaf', 'Tankobon Softcover',
       'Perfect Paperback', 'Board book', 'Cards', 'Spiral-bound',
       '(Kannada),Paperback', 'Product Bundle', 'Library Binding',
       '(German),Paperback', 'Leather Bound', '(French),Paperback',
       '(Spanish),Paperback'], dtype=object)

In [24]:
train.Edition[train.Edition.str.contains("Kannada")].unique()

array(['(Kannada),Paperback,– 2014'], dtype=object)

In [25]:
train.Genre[train.Genre.str.contains("International Relations")].unique()

array(['International Relations',
       'International Relations & Globalization (Books)'], dtype=object)

In [26]:
sorted(train.Genre.unique().tolist()) # Checking Genre Visually

['API & Operating Environments',
 'Action & Adventure (Books)',
 'Active Outdoor Pursuits (Books)',
 'Aeronautical Engineering',
 'Aesthetics',
 'Agriculture & Farming (Books)',
 'Air Sports (Books)',
 'Algebra & Trigonometry',
 'Algorithms',
 'Alphabet Reference',
 'Alternative Medicine (Books)',
 'American Football (Books)',
 'American Literature',
 'Americas',
 'Anatomy & Physiology',
 'Ancient History (Books)',
 'Anthologies (Books)',
 'Anthropology (Books)',
 'Archery (Books)',
 'Architecture (Books)',
 'Art Encyclopedias',
 'Art History',
 'Artificial Intelligence',
 'Arts History, Theory & Criticism (Books)',
 'Arts, Film & Photography (Books)',
 'Asian History',
 'Asian Literature',
 'Astrology',
 'Astronomy & Astrophysics',
 'Astronomy (Books)',
 'Atheism',
 'Banks & Banking',
 'Baseball (Books)',
 'Basketball (Books)',
 'Biographies & Autobiographies (Books)',
 'Biographies, Diaries & True Accounts (Books)',
 'Biology & Life Sciences',
 'Biology Books',
 'Biomedical Engineeri

In [27]:
train.BookCategory.unique() # Checking BookCategory Visually

array(['Action & Adventure', 'Biographies, Diaries & True Accounts',
       'Humour', 'Crime, Thriller & Mystery', 'Arts, Film & Photography',
       'Sports', 'Language, Linguistics & Writing',
       'Computing, Internet & Digital Media', 'Romance',
       'Comics & Mangas', 'Politics'], dtype=object)