# Predict The Price Of Books
> [Machine Hack Hackathon](https://www.machinehack.com/course/predict-the-price-of-books/)

<img src="https://www.machinehack.com/wp-content/uploads/2019/09/gregory-culmer-e8ThqioFqgs-unsplash.jpg" width="50%" height="50%">
<br>
<b>Size of training set:</b> 6237 records
<br>
<b>Size of test set:</b> 1560 records
<br>
<b>FEATURES:<b>
<ul>
    <li>Title: <i>The title of the book</i>
    <li>Author: <i>The author(s) of the book.</i>
    <li>Edition: <i>The edition of the book eg (Paperback,– Import, 26 Apr 2018)</i>
    <li>Reviews: <i>The customer reviews about the book</i>
    <li>Ratings: <i>The customer ratings of the book</i>
    <li>Synopsis: <i>The synopsis of the book</i>
    <li>Genre: <i>The genre the book belongs to</i>
    <li>BookCategory: <i>The department the book is usually available at.</i>
    <li>Price: <i>The price of the book (Target variable)</i>
</ul>

**Importing Libraries**

In [1]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns

from collections import Counter
from matplotlib import pyplot as plt

warnings.filterwarnings('ignore')

  import pandas.util.testing as tm


**Importing Data**

In [2]:
train = pd.read_excel('Data_Train.xlsx')
test = pd.read_excel('Data_Test.xlsx')

In [3]:
train.drop_duplicates(inplace=True)

In [4]:
train.head()

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0 out of 5 stars,8 customer reviews,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9 out of 5 stars,14 customer reviews,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8 out of 5 stars,6 customer reviews,"""During the time men live without a common Pow...",International Relations,Humour,299.0
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",4.1 out of 5 stars,13 customer reviews,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006",5.0 out of 5 stars,1 customer review,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62


**Combining Dataset(Train + Test)** - _for cleaning and feature engineering_

In [5]:
combined = pd.concat([train, test], sort=False)
combined.reset_index(drop=True, inplace=True)

# Feature Cleaning & Extraction

In [6]:
combined['Title'] = combined['Title'].str.lower()

*Tried extracting subtitle from Title but didn't help*

In [None]:
# title_split = combined['Title'].str.split('\(').str.join('--')\
#                                .str.split('\)').str.join('--')\
#                                .str.split('--',expand=True,n=2)

# combined['subtitle'] = title_split[1].fillna('None')
# combined['Title'] = title_split[0]


**Splitting Edition** - *to Edition Binding type and other feature*

In [7]:
combined[['EditionBinding','EditionType1']] = \
    combined['Edition'].str.split(',– ',expand=True)

In [8]:
combined[['EditionBinding','EditionType1']]

Unnamed: 0,EditionBinding,EditionType1
0,Paperback,10 Mar 2016
1,Paperback,7 Nov 2012
2,Paperback,25 Feb 1982
3,Paperback,5 Oct 2017
4,Hardcover,10 Oct 2006
...,...,...
7792,Paperback,14 Apr 2011
7793,Paperback,8 May 2013
7794,Paperback,6 Sep 2011
7795,Paperback,22 Sep 2009


**Binning Edition Binding** - *combined edition binding ( with occurence < 9 --> "other" )*

In [9]:
edition_binding_dict = combined['EditionBinding'].value_counts().to_dict()

combined['EditionBinding'] = combined['EditionBinding'].apply(lambda x: \
        (x if edition_binding_dict[x] > 9 else 'other'))

**Splitting Edition remainder part** - *extracting edition date and edition type*

In [12]:
def split_edition_1(x):
    j_arr = []
    date = ''

    for j in x.split(', '):
        if not any(k.isnumeric() for k in j):
            j_arr.append(j.strip())
        else:
            date = j

    if ''.join(j_arr) != '':
        ed = ', '.join(j_arr)
    else:
        ed = 'other'

    if ed != 'Import' and ed != 'Illustrated' and ed \
        != 'Special Edition' and ed != 'Unabridged' and ed \
        != 'Student Edition' and ed != 'Box set' and ed \
        != 'International Edition' and ed != 'Abridged':
        ed_ret = 'other'
    else:
        ed_ret = ed

    return (ed_ret, date)

In [13]:
combined['EditionType'],combined['EditionDate'] = \
    zip(*combined['EditionType1'].apply(split_edition_1))

**Splitting Edition date** - *extracting Month & Year*

In [15]:
def split_edition_date(x):
    (mon, year) = ('', '')
    if len(x.split()) == 1:
        year = int(x)
    elif len(x.split()) == 2:
        mon = x.split()[0]
        year = int(x.split()[1])
    elif len(x.split()) == 3:
        mon = x.split()[1]
        year = int(x.split()[2])
    return (mon, year)

In [16]:
combined['EditionMon'], combined['EditionYear'] = \
    zip(*combined['EditionDate'].apply(split_edition_date))

**Binning Month** - *combining quaterly*

In [17]:
def bin_edition_mon(x):
    x = x.lower()
    if x == 'jan' or x == 'feb' or x == 'mar':
        return 'first'
    elif x == 'apr' or x == 'may' or x == 'jun':
        return 'second'
    elif x == 'jul' or x == 'aug' or x == 'sep':
        return 'third'
    elif x == '':
        return ''
    else:
        return 'fourth'

*making columns to mark null values*

In [18]:
combined['EditionMon'] = combined['EditionMon'].apply(bin_edition_mon)

combined['Mon_null'] = combined['EditionMon'].apply(lambda x: \
        ('not_null' if x != '' else 'null'))
combined['Year_null'] = combined['EditionYear'].apply(lambda x: \
        ('not_null' if x != '' else 'null'))

**Imputing Month and Year** - *by most common values*

In [19]:
combined['EditionMon'].replace('', combined['EditionMon'].mode()[0],
                               inplace=True)
combined['EditionYear'].replace('', combined['EditionYear'].mode()[0],
                                inplace=True)

**Extracting Reviews & Ratings** - *converting to numerical data*

In [20]:
combined['Reviews'] = combined['Reviews'].apply(lambda x: \
        float(x.split()[0]))

In [21]:
combined['Ratings'] = combined['Ratings'].apply(lambda x: \
        int(''.join(x.split()[0].split(','))))

# Feature Engineering 
*Engineering new features*

**Ratings and Reviews Ratio**

In [22]:
combined['RatingPerReview'] = round(combined['Ratings']
                                    / combined['Reviews'], 2)

**Impact of Book Age on Reviews**

In [23]:
combined['Review_Year_Impact'] = combined['Reviews'] \
    * combined['EditionYear'].apply(lambda x: 2019 - x)

**Author Name Cleaning**

In [24]:
author_replacements = {' & ':', ',"0":"other","2":"other",'A. P. J. Abdul Kalam':'A.P.J. Abdul Kalam','APJ Abdul Kalam':'A.P.J. Abdul Kalam','Agrawal P. K.': 'Agrawal P.K','Ajay K Pandey': 'Ajay K. Pandey','Aravinda Anantharaman': 'Aravinda Anatharaman','Arthur Conan Doyle': 'Sir Arthur Conan Doyle','B A Paris': 'B. A. Paris','E L James': 'E. L. James','E.L. James':'E. L. James','Eliyahu M Goldratt': 'Eliyahu M. Goldratt','Ernest Hemingway': 'Ernest Hemmingway','Frank Miler': 'Frank Miller','Fyodor Dostoevsky': 'Fyodor Dostoyevsky','George R R Martin': 'George R. R. Martin','George R.R. Martin':'George R. R. Martin','H. G. Wells': 'H.G. Wells','Johann Wolfgang Von Goethe': 'Johann Wolfgang von Goethe','John Le Carré': 'John le Carré','Judith McNaught': 'Judith Mcnaught','Keith Giffen': 'Kieth Giffen','Ken Hultgen': 'Ken Hultgren','Kentaro Miura': 'Kenturo Miura','Kohei Horikoshi': 'Kouhei Horikoshi','M.K Gandhi': 'M.K. Gandhi','Matthew K Manning': 'Matthew Manning','Michael Crichton': 'Micheal Crichton','N.K Aggarwala': 'N.K. Aggarwala','Oxford University Press (India)': 'Oxford University Press India','P D James': 'P. D. James','Paramahansa Yogananda': 'Paramhansa Yogananda','R K Laxman': 'R. K. Laxman','R.K. Laxman': 'R. K. Laxman','R. M. Lala': 'R.M. Lala','Raina Telgemaeier': 'Raina Telgemeier','Rajaraman': 'Rajaraman V','Rajiv M. Vijayakar': 'Rajiv Vijayakar','Ramachandra Guha': 'Ramchandra Guha','Rene Goscinny': 'René Goscinny','Richard P Feynman': 'Richard P. Feynman','S Giridhar': 'S. Giridhar','S Hussain Zaidi': 'S. Hussain Zaidi','S. A. Chakraborty': 'S. Chakraborty','Santosh Kumar K': 'Santosh Kumar K.',"S.C. Gupta" : "S. C. Gupta",'Shiv Prasad Koirala': 'Shivprasad Koirala','Shivaprasad Koirala': 'Shivprasad Koirala','Simone De Beauvoir': 'Simone de Beauvoir','Sir Arthur Conan Doyle': 'Arthur Conan Doyle',"Terry O' Brien": "Terry O'Brien",'Thich Nhat Hahn': 'Thich Nhat Hanh','Trinity College Lond': 'Trinity College London',"Trinity College London Press" : "Trinity College London",'Ursula K. Le Guin': 'Ursula Le Guin','Willard A Palmer': 'Willard A. Palmer','Willard Palmer': 'Willard A. Palmer','William Strunk Jr': 'William Strunk Jr.','Yashavant Kanetakr': 'Yashavant Kanetkar','Yashavant P. Kanetkar': 'Yashavant Kanetkar','Yashwant Kanetkar': 'Yashavant Kanetkar','et al': 'et al.',' et al': 'et al.','Peter Clutterbuck': ' Peter Clutterbuck','Scholastic': 'Scholastic ','Ullekh N. P.': 'Ullekh N.P.','Shalini Jain': 'Dr. Shalini Jain','Kevin Mitnick': 'Kevin D. Mitnick'}
combined['Author'] = combined['Author'].replace(author_replacements,regex=True)

**No. of Authors of a book**

In [25]:
combined['Authors_count'] = combined['Author'].apply(lambda x: \
        len(x.split(',')))

**Average Author reviews**

In [26]:
author_avg_review_dict = round(combined[combined.Authors_count== 1]
                               .groupby('Author',sort=False)['Reviews']
                               .mean(), 2).to_dict()

In [27]:
def check_author(x):
    reviews = []
    for name in x.split(', '):
        try:
            reviews.append(author_avg_review_dict[name])
        except:
            pass
    if len(reviews) != 0:
        return sum(reviews) / len(reviews)
    else:
        return ''

In [28]:
combined['AuthorAvgReview'] = combined['Author'].apply(check_author)
combined['AuthorAvgReview'] = combined[['Reviews', 'AuthorAvgReview']]\
        .apply(lambda x: (x[0] if x[1] == '' else x[1]), axis=1)

**No. of Books from an Author**

In [29]:
combined['Count_Author_Title'] = combined['Author'].map(combined.groupby('Author',sort=False)['Title'].apply(lambda x: len(x.unique())).to_dict())

**No. of occurences of a Title**
<br>
**Average:** 
- Book - Author Count
- Title - reviews

In [30]:
combined['MEAN_Title_Authors_count'] = round(combined
                                            .groupby('Title',sort=False)['Authors_count']
                                            .transform('mean'), 2)

combined['MEAN_Ttle_Reviews'] = round(combined
                                      .groupby('Title',sort=False)['Reviews']
                                      .transform('mean'), 2)

combined['Title_count'] = combined.groupby('Title',sort=False)['Title']\
                                  .transform('count')

**Various Categories of a book**

In [31]:
title_cat_dict = combined[combined.Authors_count == 1]\
                 .groupby('Title',sort=False)['BookCategory']\
                 .apply(lambda x: ', '.join(x)).to_dict()
combined['TitleCategories'] = combined['Title'].map(title_cat_dict)
combined['TitleCategories'] = combined[['BookCategory','TitleCategories']]\
                              .apply(lambda x: (x[0] if pd.isna(x[1]) else x[1]),axis=1)

**Various Genres of a book**

In [32]:
title_genre_dict = combined[combined.Authors_count == 1]\
                   .groupby('Title',sort=False)['Genre']\
                   .apply(lambda x: ', '.join(x)).to_dict()
combined['TitleGenres'] = combined['Title'].map(title_genre_dict)
combined['TitleGenres'] = combined[['Genre', 'TitleGenres']]\
                          .apply(lambda x: (x[0] if pd.isna(x[1]) else x[1]), axis=1)

**Various Category books written by an author**

In [33]:
author_cat_dict = combined[combined.Authors_count==1]\
                 .groupby('Author',sort=False)['BookCategory']\
                 .apply(lambda x: ', '.join(x)).to_dict()
combined['AuthorCategories'] = combined['Author'].map(author_cat_dict)
combined['AuthorCategories'] = combined[['BookCategory','AuthorCategories']]\
                               .apply(lambda x: x[0] if pd.isna(x[1]) else x[1],axis=1)

**Various Genre books written by an author**

In [34]:
author_genre_dict = combined[combined.Authors_count==1]\
                    .groupby('Author',sort=False)['Genre']\
                    .apply(lambda x: ', '.join(x)).to_dict()
combined['AuthorGenres'] = combined['Author'].map(author_genre_dict)
combined['AuthorGenres'] = combined[['Genre','AuthorGenres']]\
                           .apply(lambda x: x[0] if pd.isna(x[1]) else x[1],axis=1)

In [35]:
combined['TitleGenres'] = combined['TitleGenres'].str.replace(' & ',', ')
combined['AuthorGenres'] = combined['AuthorGenres'].str.replace(' & ',', ')
combined['Genre'] = combined['Genre'].str.replace(' & ',', ')

**Binning Edition Year ** - *by distribution over years*

In [36]:
combined['EditionYearBin'] = pd.qcut(combined['EditionYear'],5,labels=False)

In [37]:
combined.head()

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price,EditionBinding,...,AuthorAvgReview,Count_Author_Title,MEAN_Title_Authors_count,MEAN_Ttle_Reviews,Title_count,TitleCategories,TitleGenres,AuthorCategories,AuthorGenres,EditionYearBin
0,the prisoner's gold (the hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0,8,THE HUNTERS return in their third brilliant no...,"Action, Adventure (Books)",Action & Adventure,220.0,Paperback,...,3.98,4,1.0,4.0,1,Action & Adventure,"Action, Adventure (Books)","Action & Adventure, Crime, Thriller & Mystery,...","Action, Adventure (Books), Crime, Thriller, My...",2
1,guru dutt: a tragedy in three acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9,14,A layered portrait of a troubled genius for wh...,"Cinema, Broadcast (Books)","Biographies, Diaries & True Accounts",202.93,Paperback,...,3.9,1,1.0,3.9,2,"Biographies, Diaries & True Accounts, Arts, Fi...","Cinema, Broadcast (Books), Cinema, Broadcast (...","Biographies, Diaries & True Accounts, Arts, Fi...","Cinema, Broadcast (Books), Cinema, Broadcast (...",1
2,leviathan (penguin classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8,6,"""During the time men live without a common Pow...",International Relations,Humour,299.0,Paperback,...,4.8,1,1.0,4.8,3,"Humour, Language, Linguistics & Writing, Politics","International Relations, International Relatio...","Humour, Language, Linguistics & Writing, Politics","International Relations, International Relatio...",0
3,a pocket full of rye (miss marple),Agatha Christie,"Paperback,– 5 Oct 2017",4.1,13,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0,Paperback,...,4.23,78,1.0,4.1,1,"Crime, Thriller & Mystery",Contemporary Fiction (Books),"Crime, Thriller & Mystery, Crime, Thriller & M...","Contemporary Fiction (Books), Crime, Thriller,...",3
4,life 70 years of extraordinary photography,Editors of Life,"Hardcover,– 10 Oct 2006",5.0,1,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62,Hardcover,...,5.0,1,1.0,5.0,1,"Arts, Film & Photography",Photography Textbooks,"Arts, Film & Photography",Photography Textbooks,0


## Dummy & Count Encoding

In [38]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
combined[['BookCategory','EditionBinding','EditionMon','EditionType','EditionYearBin','Mon_null','Year_null',]] = \
combined[['BookCategory','EditionBinding','EditionMon','EditionType','EditionYearBin','Mon_null','Year_null',]]\
    .apply(enc.fit_transform)

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

tc_vectorizer = CountVectorizer(lowercase=True, tokenizer=lambda x: \
                                 x.split(', '))
title_categories_vector = tc_vectorizer.fit_transform(combined['TitleCategories']).toarray()
df_title_categories = pd.DataFrame(data=title_categories_vector,
                      columns=tc_vectorizer.get_feature_names())

In [40]:
ac_vectorizer = CountVectorizer(lowercase=True, 
                                 tokenizer=lambda x: x.split(', '))
author_categories_vector = ac_vectorizer.fit_transform(combined['AuthorCategories']).toarray()
df_author_categories = pd.DataFrame(data=author_categories_vector,
                      columns=ac_vectorizer.get_feature_names())

In [41]:
tg_vectorizer = CountVectorizer(max_features=10, lowercase=True,
                                tokenizer=lambda x: x.split(', '))
title_genres_vector = tg_vectorizer.fit_transform(combined['TitleGenres']).toarray()
df_title_genres = pd.DataFrame(data=title_genres_vector,
                     columns=tg_vectorizer.get_feature_names())

In [42]:
ag_vectorizer = CountVectorizer(max_features=10, lowercase=True,
                                    tokenizer=lambda x: x.split(', '))
author_genres_vector = ag_vectorizer.fit_transform(combined['AuthorGenres']).toarray()
df_author_genres = pd.DataFrame(data=author_genres_vector,
                         columns=ag_vectorizer.get_feature_names())

In [43]:
title_vectorizer = CountVectorizer(max_features=10, lowercase=True)
title_vector = title_vectorizer.fit_transform(combined['Title']).toarray()
df_title = pd.DataFrame(data=title_vector,
                        columns=title_vectorizer.get_feature_names())

In [44]:
vectorizer_author = CountVectorizer(max_features=10, lowercase=True,
                                    tokenizer=lambda x: x.split(', '))
vector_author = vectorizer_author.fit_transform(combined['Author']).toarray()
df_author = pd.DataFrame(data=vector_author,
                         columns=vectorizer_author.get_feature_names())

In [45]:
vectorizer_genre = CountVectorizer(max_features=10,
                                   lowercase=True, tokenizer=lambda x: x.split(', '))
vector_genre = vectorizer_genre.fit_transform(combined['Genre']).toarray()
df_genre = pd.DataFrame(data=vector_genre,
                        columns=vectorizer_genre.get_feature_names())

In [46]:
vectorizer_synopsis = CountVectorizer(max_features=10,
                                      stop_words='english', 
                                      strip_accents='ascii', 
                                      lowercase=True)
vector_synopsis = vectorizer_synopsis.fit_transform(combined['Synopsis']).toarray()
df_synopsis = pd.DataFrame(data=vector_synopsis,
                           columns=vectorizer_synopsis.get_feature_names())

In [47]:
combined.drop(columns=[
    'Title',
    'Author',
    'Genre',
    'Synopsis',
    'Edition',
    'EditionDate',
    'EditionType1',
    'AuthorCategories',
    'AuthorGenres',
    'TitleGenres',
    'TitleCategories'
    ], inplace=True)

In [48]:
print('No. of Features:',combined.shape[1])

No. of Features: 19


**Feature correlations**

In [49]:
df = pd.concat([
    combined,# dummy encoded features
    df_author, # author count encoded
    df_genre, # genre count encoded
    df_title, # title count encoded
    df_synopsis, # synopsis count encoded
    df_author_genres, # author_genres count encoded
    df_title_genres, # title_genres count encoded
    df_author_categories, # author_categories count encoded
    df_title_categories, # title_categories count encoded
    ], axis=1)
df.reset_index(drop=True, inplace=True)

In [None]:
# feature correlations
# corr = df.corr()
# corr = corr[(corr.Price>-0.005) & (corr.Price< 0.005)]
# corr = pd.concat([corr[corr.index],corr.Price],axis=1)
# plt.figure(figsize=(10,10))
# sns.heatmap(corr, xticklabels=corr.columns,
#                     yticklabels=corr.columns,
#                     vmin=-0.1, vmax=0.1
#             )

In [50]:
print('No. of Features(final):',df.shape[1])

No. of Features(final): 111


# Train / Test / Val Split

In [51]:
train = df[df['Price'].notna()]
test = df[df['Price'].isna()]
test.drop(['Price'], axis=1, inplace=True)

In [52]:
# train = train[(train['Price'] <= 12000) 
#             & (train['EditionYear']>= 1980) 
#             & (train['Ratings'] < 680)]
train = train[train['Price'] <= 12000]

In [53]:
X = train.loc[:, train.columns != 'Price'].values
X = X.astype(float)

# Dependent Variable

y = np.log1p(train['Price'].values)
y = y.astype(float)

# Test - (Independent Variables)

test = test.loc[:].values
test = test.astype(float)

# Model Training

**Importing libraries**

In [54]:
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import make_scorer
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection \
    import cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble \
    import RandomForestRegressor, VotingRegressor, AdaBoostRegressor

**RMLSE scoring func**

In [55]:
def score(y_true, y_pred):
    y_pred = np.exp(y_pred) - 1
    for i in range(len(y_pred)):
        if y_pred[i] < 0:
            y_pred[i] = 0
    y_true = np.exp(y_true) - 1
    error = np.square(np.log10(y_pred + 1) - np.log10(y_true
                      + 1)).mean() ** 0.5
    score = 1 - error
    return score

#### RandomForestRegressor

In [56]:
rf = RandomForestRegressor(random_state=0,bootstrap=False,max_features='sqrt')

cvs = cross_val_score(rf, X, y, cv=5,verbose=2,n_jobs=-1,
                      scoring=make_scorer(score,greater_is_better=True))
mean_score = sum(cvs)/len(cvs)
# print("Average Score:",mean_score)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   16.6s remaining:   25.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   16.7s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   16.7s finished


#### LGBMRegressor

In [57]:
lgbm = lgb.LGBMRegressor(random_state=0,n_jobs=-1)

cvs = cross_val_score(lgbm, X, y, cv=5,verbose=2,n_jobs=-1,
                      scoring=make_scorer(score,greater_is_better=True))
mean_score = sum(cvs)/len(cvs)
# print("Average Score:",mean_score)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    7.4s remaining:   11.2s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    7.5s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    7.5s finished


#### XGBRegressor

In [58]:
xgb = XGBRegressor(n_jobs=-1)

cvs = cross_val_score(xgb, X, y, cv=5,verbose=2,n_jobs=-1,
                        scoring=make_scorer(score,greater_is_better=True))
mean_score = sum(cvs)/len(cvs)
# print("Average Score:",mean_score)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    4.5s remaining:    6.8s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    6.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    6.2s finished


#### VotingRegressor

In [59]:
vr = VotingRegressor([('xgb', xgb), ('rf', rf), ('lgbm', lgbm)])

cvs = cross_val_score(vr, X, y, cv=5,verbose=50,n_jobs=-1,
                        scoring=make_scorer(score,greater_is_better=True))
mean_score = sum(cvs)/len(cvs)
# print("Average Score:",mean_score)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   11.0s
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   11.0s remaining:   16.6s
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:   11.1s remaining:    7.4s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   11.3s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   11.3s finished


# Final Predictions

In [60]:
vr.fit(X, y)
Y_pred2 = vr.predict(test)
Y_pred2 = np.exp(Y_pred2)-1
for i in range(len(Y_pred2)):
      if Y_pred2[i] < 0:
         Y_pred2[i] = 0

pd.DataFrame(Y_pred2, columns = ['Price'])\
    .to_excel("predictions.xlsx", index=False)

