# USED ELECTRONICS PRICE PREDICTION HACKATHON by MachineHack
### Solution by: Pratik Nabriya |[ Github](https://github.com/pratiknabriya) | [LinkedIn](https://www.linkedin.com/in/pratiknabriya/) | 

### Description

We live in a world that is driven by technology and electronic devices as gadgets have become a part of our daily life. It is near impossible to think of a world without smartphones or tablets. Like many kinds of goods or products, used electronic devices have a good demand in our country. In this hackathon, we challenge the data science community to predict the price of used electronic devices based on certain factors.

Given are 6 distinguishing factors that can influence the price of a used device. Your objective as a data scientist is to build a machine learning model that can predict the price of used electronic devices based on the given factors.

Data Description:-

The unzipped folder will have the following files.

Train.csv –  2326 observations.

Test.csv –  997 observations.

Target Variable: Price

Evaluation:-

The leaderboard is evaluated using RMSLE for the participant’s submission.

For more info and data set visit: https://www.machinehack.com/course/used-electronics-price-prediction-weekend-hackathon-7/

## Import Libraries 

In [1]:
# import necessary libraries 

%matplotlib inline
import warnings 
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sqlite3 
import nltk
import string
import re
import pickle

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors

from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder

from collections import Counter
from tqdm import tqdm
import os

from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

!pip install autocorrect
from autocorrect import Speller

Collecting autocorrect
[?25l  Downloading https://files.pythonhosted.org/packages/aa/5b/6510d8370201fc96cbb773232c2362079389ed3285b0b1c6a297ef6eadc0/autocorrect-2.0.0.tar.gz (1.8MB)
[K     |████████████████████████████████| 1.8MB 1.3MB/s 
[?25hBuilding wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py) ... [?25l[?25hdone
  Created wheel for autocorrect: filename=autocorrect-2.0.0-cp36-none-any.whl size=1811641 sha256=2cdb95aad00dace91874207c3c04d2ece9e98513e5858a681755fdd978b2305b
  Stored in directory: /root/.cache/pip/wheels/0b/06/bc/e66f28d72bed29591eadc79cebb2e7964ad0282804ab233da3
Successfully built autocorrect
Installing collected packages: autocorrect
Successfully installed autocorrect-2.0.0


In [None]:
# mount google drive 

from google.colab import drive 
drive.mount('/content/gdrive')

## Load Data into Pandas DataFrame 

In [22]:
# reading data from google drive 

mypath = '/content/gdrive/My Drive/MachineHack/Used Electronics Price Prediction/'
train_data = pd.read_csv(mypath + 'Train.csv')
test_data = pd.read_csv(mypath + 'Test.csv')

In [4]:
train_data.head(10)

Unnamed: 0,Brand,Model_Info,Additional_Description,Locality,City,State,Price
0,1,name0 name234 64gb space grey,1yesr old mobile number 999two905two99 bill c...,878,8,2,15000
1,1,phone 7 name42 name453 new condition box acce...,101004800 1010065900 7000,1081,4,0,18800
2,1,name0 x 256gb leess used good condition,1010010000 seperate screen guard 3 back cover...,495,11,4,50000
3,1,name0 6s plus 64 gb space grey,without 1010020100 id 1010010300 colour 10100...,287,10,7,16500
4,1,phone 7 sealed pack brand new factory outet p...,101008700 10100000 xs max 64 gb made 10100850...,342,4,0,26499
5,1,name0 6 name1694 128gb clean condition,looks 1010035500 101008700 10100000 8 plus 64...,503,15,5,13800
6,1,name87 watch name251 3 38 mm gps name119 name...,one 101009200 3 perfect working condition def...,940,8,2,17000
7,2,name271 name1622 note 3gb ram 32gb inbuilt,10100000 6 101009200 16 gb good condition lig...,651,2,6,5000
8,1,iphone 732gbcondition new,10100000 7 32gb 10100248300 condition unused ...,42,2,6,21000
9,1,name0 7 128 gb,1010011400 101006100 101006200 available acce...,133,1,3,40000


In [5]:
test_data.head(10)

Unnamed: 0,Brand,Model_Info,Additional_Description,Locality,City,State
0,1,name0 55s66s66s778xxsxsmax etc,good condition 11months old single scratch we...,570,11,4
1,1,slightly used excellent condition name0 5 sale,101008700 1010030600 1010034300 10100192200 1...,762,8,2
2,1,name0 sx ios12 top letast model bill call,1010017300 delivery,60,13,5
3,1,name87 name0 x 64gb going lowest 41900,phone 1010023400 64 gb excellent condition sale,640,15,5
4,1,name0 5s proper condition one handedly used,full kit available 10100248300 condition 4gb ...,816,2,6
5,1,name0 7 plus 128 gb name75 gold,101006600 galaxy advance hai ok ram 512 call,552,13,5
6,1,brand new rosegold name87 iphone name234 64gb...,office gurgaon karol bagh new 101008700 10100...,389,8,2
7,1,name0 se 32gb,101008700 iphone 4s new brand refurbished pac...,926,8,2
8,1,name0 6s name753,101006200 bill good battery backup good looking,850,2,6
9,1,apple phone 8 offer price,brand new 101006600 galaxy s10 plus box best ...,404,2,6


In [6]:
# shape of the data tables 

print(train_data.shape)
print(test_data.shape)

(2326, 7)
(997, 6)


## Check for Null values 

In [7]:
train_data.isnull().sum()

Brand                     0
Model_Info                0
Additional_Description    0
Locality                  0
City                      0
State                     0
Price                     0
dtype: int64

In [8]:
test_data.isnull().sum()

Brand                     0
Model_Info                0
Additional_Description    0
Locality                  0
City                      0
State                     0
dtype: int64

## Train-Validation split

In [23]:
y_train = train_data['Price']
X_train = train_data.drop('Price', axis = 1)
X_test = test_data # a seperate test data is already provided 

from sklearn.model_selection import train_test_split
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size = 0.35, random_state = 42)
print(X_train.shape, y_train.shape)
print(X_cv.shape, y_cv.shape)
print(X_test.shape)

(1511, 6) (1511,)
(815, 6) (815,)
(997, 6)


## Data cleaning and Pre-processing

In [10]:
X_train['Brand'].value_counts()

1    1353
2      75
0      58
3      25
Name: Brand, dtype: int64

In [11]:
X_train['Locality'].value_counts()

193    34
640    34
534    24
328    18
132    17
       ..
706     1
704     1
703     1
699     1
568     1
Name: Locality, Length: 753, dtype: int64

In [12]:
print(X_train['City'].value_counts())
print(X_train['City'].value_counts().shape)

15    213
2     213
8     176
11    175
0     169
4     168
13    136
1     133
10    108
17     14
12      4
16      1
7       1
Name: City, dtype: int64
(13,)


In [13]:
X_train['State'].value_counts()

5    349
6    213
2    176
4    175
1    170
0    168
3    133
7    126
8      1
Name: State, dtype: int64

In [24]:
# check some random 'Model_info' rows

for i in [1, 12, 45, 87, 123, 264, 387, 444, 545, 669, 729, 812, 901, 1021, 1103, 1234, 1231, 1422, 1500]:
    print(X_train['Model_Info'].iloc[i])

 name87 name0 8 64gb black black name1588 name239
 name0 6 64gb
 name87 name0 excellent performance latest models name1202 name85 c
 name66 galaxy name108 30 name242 name243 11 months name114
 name87 name0 7 name92
 name0 10 256gb space gray pristine condition
 name87 iphone 6s 16 gb mint condition
 name0 6s 32 gigs storage
 name0 6 32gb pakka condition
 name87 8 64gb gold may 2020 warranty available
 name87 name0 8 name103 64gb
 name54 name120 name66 galaxy name578 1 year name114
 name54 2month pic name0 6s 32gb name1406 name61
 new box packed iphone name234 64gb brand new
 name0 x original new
 name87 iphone 7 plus 128gb
 used xs 64gb bought name734 bill
 phone 7plus 128gb
 name0 7 128gb


Some observations: 

name0 stands for iphone

In [25]:
# function to correct spellings (english)
def spell_check(sentence):
  ''' This function is used to correct the words in the text'''
  word_list = []
  for word in sentence.strip().split():
      word = spell(word) # correct each word in a sentence
      word_list.append(word)
        
  return " ".join(word_list) # return corrected sentence

In [26]:
# applying spell corrector on the Model_Info feature 
spell = Speller(lang = 'en')

preprocessed_model_info = []
for sent in tqdm(X_train['Model_Info'].values):
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    sent = spell_check(sent)
    preprocessed_model_info.append(sent.strip())

X_train['preprocessed_model_info'] = preprocessed_model_info

100%|██████████| 1511/1511 [00:22<00:00, 66.18it/s]


In [27]:
# as per our observation we replace the phone (iphone after spell correction becomes phone) with name 
X_train['preprocessed_model_info'] = X_train['preprocessed_model_info'].str.replace('phone', 'name0')

In [28]:
# after above preprocessing steps - 
for i in [1, 12, 45, 87, 123, 264, 387, 444, 545, 669, 729, 812, 901, 1021, 1103, 1234, 1231, 1422, 1500]:
    print(X_train['preprocessed_model_info'].iloc[i])

name87 name0 8 64gb black black name1588 name239
name0 6 64gb
name87 name0 excellent performance latest models name1202 name85 c
name66 galaxy name108 30 name242 name243 11 months name114
name87 name0 7 name92
name0 10 256gb space gray pristine condition
name87 name0 6s 16 gb mint condition
name0 6s 32 gigs storage
name0 6 32gb pakka condition
name87 8 64gb gold may 2020 warranty available
name87 name0 8 name103 64gb
name54 name120 name66 galaxy name578 1 year name114
name54 2month pic name0 6s 32gb name1406 name61
new box packed name0 name234 64gb brand new
name0 x original new
name87 name0 7 plus 128gb
used xs 64gb bought name734 bill
name0 7plus 128gb
name0 7 128gb


In [29]:
# apply above preprocessing on the validation and test datsets

preprocessed_model_info = []

for sent in tqdm(X_cv['Model_Info'].values):
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    sent = spell_check(sent)
    preprocessed_model_info.append(sent.strip())

X_cv['preprocessed_model_info'] = preprocessed_model_info

100%|██████████| 815/815 [00:11<00:00, 69.31it/s]


In [30]:
X_cv['preprocessed_model_info'] = X_cv['preprocessed_model_info'].str.replace('phone', 'name0')

In [31]:
preprocessed_model_info = []

for sent in tqdm(X_test['Model_Info'].values):
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    sent = spell_check(sent)
    preprocessed_model_info.append(sent.strip())

X_test['preprocessed_model_info'] = preprocessed_model_info

100%|██████████| 997/997 [00:05<00:00, 169.28it/s]


In [32]:
X_test['preprocessed_model_info'] = X_test['preprocessed_model_info'].str.replace('phone', 'name0')

In [33]:
# check some random samples of feature 'Additional_Description'

for i in [1, 12, 45, 87, 123, 264, 387, 444, 545, 669, 729, 812, 901, 1021, 1103, 1234, 1231, 1422, 1500]:
    print(X_train['Additional_Description'].iloc[i])

 looks brand new 101006800 10100179200 101006900 12gb ram 256gb 101001500 instant 1010010200 ur used 10100102300 please click see profile check devices 1010011300 available 1010095300 card accepted pavit 1010023900 17 10100157400 road opp nike showroom near karachi sweets woodland store camp pune 411001
 purchased 101004000 2018 1010099900 phone fantastic fabulous condition doesnt single scratch didnt even peel plastic thin cover phone back side till u observe 10100121400 101008900 101005100 101006100 charger clear view sensored flip cover avilable bargain n 1010011300 plz cause really know condition 101005100 10100171400 1010079100 101006600 sensored flipcover costs 3500 rupees 10100219100 till didnt even play single game note used professionally thank 10100200 10100200 reach 10100219200
 10100000 6 16gb phone charger erd 100 condition 1010045600 sall
 perfect condition scratches
 6gb 101004600 64gb memory 1010074600 101004200 2 months used excellent condition negotiable
 10100000 10 

In [34]:
# apply the preprocessing steps on this feature just like the previous feature 
preprocessed_add_desc = []

for sent in tqdm(X_train['Additional_Description'].values):
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    sent = " ".join(filter(lambda x:x[:5]!='10100', sent.split())) # removes all encoded words starting with '10100'
    sent = spell_check(sent)
    preprocessed_add_desc.append(sent.strip())

X_train['preprocessed_add_desc'] = preprocessed_add_desc

100%|██████████| 1511/1511 [02:02<00:00, 12.29it/s]


In [35]:
# after preprocessing 
for i in [1, 12, 45, 87, 123, 264, 387, 444, 545, 669, 729, 812, 901, 1021, 1103, 1234, 1231, 1422, 1500]:
    print(X_train['preprocessed_add_desc'].iloc[i])

looks brand new 12gb ram 256gb instant ur used please click see profile check devices available card accepted spavit 17 road opp nike showroom near karachi sweets woodland store camp pure 411001
purchased 2018 phone fantastic fabulous condition doesnt single scratch didnt even peel plastic thin cover phone back side till u observe charger clear view censored flip cover available bargain n ply cause really know condition censored slipcover costs 3500 rupees till didnt even play single game note used professionally thank reach
6 16gb phone charger erd 100 condition sall
perfect condition scratches
6gb 64gb memory 2 months used excellent condition negotiable
10 256gb space gray box factory accessories mint condition always used case exchange giveaway marble case along phone worth 10000 rupees serious buyers contact price slightly negotiable
galaxy runs 90 55inch thd 2560x1440p pixel display latest variant mobile galaxy phone powered 23 ghz outscore 8890 processor 4gb ram galaxy 12 primary

In [36]:
# apply above preprocessing steps to additional_description in cv and test dataset as well

preprocessed_add_desc = []

for sent in tqdm(X_cv['Additional_Description'].values):
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    sent = " ".join(filter(lambda x:x[:5]!='10100', sent.split()))
    sent = spell_check(sent)
    preprocessed_add_desc.append(sent.strip())

X_cv['preprocessed_add_desc'] = preprocessed_add_desc

100%|██████████| 815/815 [00:57<00:00, 14.25it/s]


In [37]:
preprocessed_add_desc = []

for sent in tqdm(X_test['Additional_Description'].values):
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    sent = " ".join(filter(lambda x:x[:5]!='10100', sent.split()))
    sent = spell_check(sent)
    preprocessed_add_desc.append(sent.strip())

X_test['preprocessed_add_desc'] = preprocessed_add_desc

100%|██████████| 997/997 [00:58<00:00, 16.92it/s]


## Text Feature encoding 

#### One hot encoding categorical variables

In [38]:
X_train.columns

Index(['Brand', 'Model_Info', 'Additional_Description', 'Locality', 'City',
       'State', 'preprocessed_model_info', 'preprocessed_add_desc'],
      dtype='object')

In [39]:
# one-hot encoding 'Brand'

encoder = OneHotEncoder()
encoder.fit(X_train['Brand'].values.reshape(-1, 1))

X_train_brand = encoder.transform(X_train['Brand'].values.reshape(-1, 1))
X_cv_brand = encoder.transform(X_cv['Brand'].values.reshape(-1, 1))
X_test_brand = encoder.transform(X_test['Brand'].values.reshape(-1, 1))

print("After encoding")
print(X_train_brand.shape, y_train.shape)
print(X_cv_brand.shape, y_cv.shape)
print(X_test_brand.shape)

After encoding
(1511, 4) (1511,)
(815, 4) (815,)
(997, 4)


In [40]:
# One-hot encoding 'Locality'

encoder = OneHotEncoder(handle_unknown = 'ignore')
encoder.fit(X_train['Locality'].values.reshape(-1, 1))

X_train_locality = encoder.transform(X_train['Locality'].values.reshape(-1, 1))
X_cv_locality = encoder.transform(X_cv['Locality'].values.reshape(-1, 1))
X_test_locality = encoder.transform(X_test['Locality'].values.reshape(-1, 1))

print("After encoding")
print(X_train_locality.shape, y_train.shape)
print(X_cv_locality.shape, y_cv.shape)
print(X_test_locality.shape)

After encoding
(1511, 753) (1511,)
(815, 753) (815,)
(997, 753)


In [41]:
# 'City'

encoder = OneHotEncoder(handle_unknown = 'ignore')
encoder.fit(X_train['City'].values.reshape(-1, 1))

X_train_city = encoder.transform(X_train['City'].values.reshape(-1, 1))
X_cv_city = encoder.transform(X_cv['City'].values.reshape(-1, 1))
X_test_city = encoder.transform(X_test['City'].values.reshape(-1, 1))

print('- '*50)
print("After encoding")
print(X_train_city.shape, y_train.shape)
print(X_cv_city.shape, y_cv.shape)
print(X_test_city.shape)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After encoding
(1511, 13) (1511,)
(815, 13) (815,)
(997, 13)


In [42]:
# 'State'

encoder = OneHotEncoder(handle_unknown = 'ignore')
encoder.fit(X_train['State'].values.reshape(-1, 1))

X_train_state = encoder.transform(X_train['State'].values.reshape(-1, 1))
X_cv_state = encoder.transform(X_cv['State'].values.reshape(-1, 1))
X_test_state = encoder.transform(X_test['State'].values.reshape(-1, 1))

print('- '*50)
print("After encoding")
print(X_train_state.shape, y_train.shape)
print(X_cv_state.shape, y_cv.shape)
print(X_test_state.shape)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After encoding
(1511, 9) (1511,)
(815, 9) (815,)
(997, 9)


In [43]:
# 'preprocessed_model_info' BOW encoding

vectorizer = CountVectorizer(min_df = 3, ngram_range = (1,4))
vectorizer.fit(X_train['preprocessed_model_info'].values) # fit has to happen only on train data
print(vectorizer.get_feature_names()[:10]) 
print(vectorizer.get_feature_names()[-10:])

# we use the fitted CountVectorizer to convert the text to vector
X_train_model_info_bow = vectorizer.transform(X_train['preprocessed_model_info'].values)
X_cv_model_info_bow = vectorizer.transform(X_cv['preprocessed_model_info'].values)
X_test_model_info_bow = vectorizer.transform(X_test['preprocessed_model_info'].values)

print('- '*50)
print("After vectorization")
print(X_train_model_info_bow.shape, y_train.shape)
print(X_cv_model_info_bow.shape, y_cv.shape)
print(X_test_model_info_bow.shape)

['10', '100', '11', '11 months', '11 name49', '11 pro', '11 pro max', '12', '128', '128 gb']
['xs name229 256 gb', 'xs name229 256gb', 'xs name229 64gb', 'year', 'year apple', 'year old', 'year warranty', 'years', 'years old', 'zu']
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After vectorization
(1511, 999) (1511,)
(815, 999) (815,)
(997, 999)


In [44]:
# 'preprocessed_model_info' TFIDF encoding

vectorizer = TfidfVectorizer(min_df = 3, ngram_range = (1,4))
vectorizer.fit(X_train['preprocessed_model_info'].values) # fit has to happen only on train data
print(vectorizer.get_feature_names()[:10]) 
print(vectorizer.get_feature_names()[-10:])

# we use the fitted CountVectorizer to convert the text to vector
X_train_model_info_tfidf = vectorizer.transform(X_train['preprocessed_model_info'].values)
X_cv_model_info_tfidf = vectorizer.transform(X_cv['preprocessed_model_info'].values)
X_test_model_info_tfidf = vectorizer.transform(X_test['preprocessed_model_info'].values)

print('- '*50)
print("After vectorization")
print(X_train_model_info_tfidf.shape, y_train.shape)
print(X_cv_model_info_tfidf.shape, y_cv.shape)
print(X_test_model_info_tfidf.shape)

['10', '100', '11', '11 months', '11 name49', '11 pro', '11 pro max', '12', '128', '128 gb']
['xs name229 256 gb', 'xs name229 256gb', 'xs name229 64gb', 'year', 'year apple', 'year old', 'year warranty', 'years', 'years old', 'zu']
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After vectorization
(1511, 999) (1511,)
(815, 999) (815,)
(997, 999)


In [45]:
# 'preprocessed_add_desc' BOW encoding

vectorizer = CountVectorizer(min_df = 5, ngram_range = (1,4))
vectorizer.fit(X_train['preprocessed_add_desc'].values) # fit has to happen only on train data
print(vectorizer.get_feature_names()[:10]) 
print(vectorizer.get_feature_names()[-10:])

# we use the fitted CountVectorizer to convert the text to vector
X_train_add_desc_bow = vectorizer.transform(X_train['preprocessed_add_desc'].values)
X_cv_add_desc_bow = vectorizer.transform(X_cv['preprocessed_add_desc'].values)
X_test_add_desc_bow = vectorizer.transform(X_test['preprocessed_add_desc'].values)

print('- '*50)
print("After vectorization")
print(X_train_add_desc_bow.shape, y_train.shape)
print(X_cv_add_desc_bow.shape, y_cv.shape)
print(X_test_add_desc_bow.shape)

['01', '01 call', '01 call 8oo', '01 call 8oo 8oo', '01 call us', '01 call us 8oo', '10', '10 month', '10 months', '100']
['year old', 'year used', 'year warranty', 'years', 'years old', 'yes', 'yet', 'yet used', 'yet used interested', 'yet used interested call']
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After vectorization
(1511, 1672) (1511,)
(815, 1672) (815,)
(997, 1672)


In [46]:
# 'preprocessed_add_desc' TFIDF encoding

vectorizer = TfidfVectorizer(min_df = 5, ngram_range = (1,4))
vectorizer.fit(X_train['preprocessed_add_desc'].values) # fit has to happen only on train data
print(vectorizer.get_feature_names()[:10]) 
print(vectorizer.get_feature_names()[-10:])

# we use the fitted CountVectorizer to convert the text to vector
X_train_add_desc_tfidf = vectorizer.transform(X_train['preprocessed_add_desc'].values)
X_cv_add_desc_tfidf = vectorizer.transform(X_cv['preprocessed_add_desc'].values)
X_test_add_desc_tfidf = vectorizer.transform(X_test['preprocessed_add_desc'].values)

print('- '*50)
print("After vectorization")
print(X_train_add_desc_tfidf.shape, y_train.shape)
print(X_cv_add_desc_tfidf.shape, y_cv.shape)
print(X_test_add_desc_tfidf.shape)

['01', '01 call', '01 call 8oo', '01 call 8oo 8oo', '01 call us', '01 call us 8oo', '10', '10 month', '10 months', '100']
['year old', 'year used', 'year warranty', 'years', 'years old', 'yes', 'yet', 'yet used', 'yet used interested', 'yet used interested call']
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After vectorization
(1511, 1672) (1511,)
(815, 1672) (815,)
(997, 1672)


## Stack all the features together

In [47]:
# bow encoded text features 
# drop locality from the list due to very high variation 

from scipy.sparse import hstack

X_train_bow = hstack((X_train_brand, X_train_city, X_train_state, 
                      X_train_model_info_bow, X_train_add_desc_bow))

X_cv_bow = hstack((X_cv_brand, X_cv_city, X_cv_state, 
                      X_cv_model_info_bow, X_cv_add_desc_bow))
                     
X_test_bow = hstack((X_test_brand, X_test_city, X_test_state, 
                      X_test_model_info_bow, X_test_add_desc_bow))

print(X_train_bow.shape, y_train.shape)
print(X_cv_bow.shape, y_cv.shape)
print(X_test_bow.shape)

(1511, 2697) (1511,)
(815, 2697) (815,)
(997, 2697)


In [48]:
# Tfidf encoded text features 
# drop locality from the list due to very high variation 

X_train_tfidf = hstack((X_train_brand, X_train_city, X_train_state, 
                      X_train_model_info_tfidf, X_train_add_desc_tfidf))

X_cv_tfidf = hstack((X_cv_brand, X_cv_city, X_cv_state, 
                      X_cv_model_info_tfidf, X_cv_add_desc_tfidf))

X_test_tfidf = hstack((X_test_brand, X_test_city, X_test_state, 
                      X_test_model_info_tfidf, X_test_add_desc_tfidf))

print(X_train_tfidf.shape, y_train.shape)
print(X_cv_tfidf.shape, y_cv.shape)
print(X_test_tfidf.shape)

(1511, 2697) (1511,)
(815, 2697) (815,)
(997, 2697)


## Training ML Models 

### 1. SGD Regressor

In [49]:
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_log_error 

for i in [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    sgd = SGDRegressor(alpha = i, random_state = 42)
    sgd.fit(X_train_bow, y_train)
    y_pred_train = sgd.predict(X_train_bow)
    y_pred_cv = sgd.predict(X_cv_bow)

    y_pred_train = [0 if m < 0 else m for m in y_pred_train]
    y_pred_cv = [0 if n < 0 else n for n in y_pred_cv]

    train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
    cv_loss = np.sqrt(mean_squared_log_error(y_cv, y_pred_cv))
    print('alpha :', i, 'train loss:', train_loss, 'cv loss:', cv_loss)

alpha : 1e-05 train loss: 0.6814291305941317 cv loss: 1.500500577302415
alpha : 0.0001 train loss: 0.6814585607444258 cv loss: 1.4977174104735234
alpha : 0.001 train loss: 0.6556896185061114 cv loss: 1.424707822247343
alpha : 0.01 train loss: 0.7598247191164191 cv loss: 0.9411936278636065
alpha : 0.1 train loss: 0.6320306570258495 cv loss: 0.650844332229113
alpha : 1 train loss: 0.6890591154875175 cv loss: 0.7096164066517379
alpha : 10 train loss: 0.8232061541620261 cv loss: 0.8040885639776103
alpha : 100 train loss: 0.8795602886800882 cv loss: 0.8526524984791323
alpha : 1000 train loss: 0.8808725277015917 cv loss: 0.8537126667687314


best alpha = 0.1  Validation RMSLE = 0.651

### 2. Linear SVM Regressor

In [50]:
from sklearn.svm import SVR
 

for i in [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    svr = SVR(kernel = 'linear', C = i)
    svr.fit(X_train_bow, y_train)
    y_pred_train = svr.predict(X_train_bow)
    y_pred_cv = svr.predict(X_cv_bow)

    y_pred_train = [0 if m < 0 else m for m in y_pred_train]
    y_pred_cv = [0 if n < 0 else n for n in y_pred_cv]

    train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
    cv_loss = np.sqrt(mean_squared_log_error(y_cv, y_pred_cv))
    print('alpha :', i, 'train loss:', train_loss, 'cv loss:', cv_loss)

alpha : 1e-05 train loss: 0.8453922283191411 cv loss: 0.8233353815221391
alpha : 0.0001 train loss: 0.8453914352563741 cv loss: 0.8233347704287839
alpha : 0.001 train loss: 0.8453835202305863 cv loss: 0.8233285960959753
alpha : 0.01 train loss: 0.8453047137975856 cv loss: 0.8232656965040939
alpha : 0.1 train loss: 0.8445173842243038 cv loss: 0.822637655846669
alpha : 1 train loss: 0.8367231406453428 cv loss: 0.8165256441253097
alpha : 10 train loss: 0.7668793945752427 cv loss: 0.7630296726169268
alpha : 100 train loss: 0.5390641554149683 cv loss: 0.6042214022292979
alpha : 1000 train loss: 0.41014978175608685 cv loss: 0.7144983695519462


best alpha = 100     Validation RMSLE = 0.604

### 3. XGB Regressor

> Note: Taking into consideration the time left to complete hackathon and current availibility of computational resources, I decided to go for manual hyperparameter tuning instead of GridSearchCV/RandomizedSearchCV as we normally do. 

In [52]:
# Fine-tuning the number of estimators 
from xgboost.sklearn import XGBRegressor
 
for i in [10, 50, 100, 150, 250, 500, 700]:
    xgb = XGBRegressor(objective ='reg:squarederror', n_estimators = i, random_state = 42)
    xgb.fit(X_train_bow, y_train)
    y_pred_train = xgb.predict(X_train_bow)
    y_pred_cv = xgb.predict(X_cv_bow)

    y_pred_train = [0 if m < 0 else m for m in y_pred_train]
    y_pred_cv = [0 if n < 0 else n for n in y_pred_cv]

    train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
    cv_loss = np.sqrt(mean_squared_log_error(y_cv, y_pred_cv))
    print('n_estimators:', i,'train loss:', train_loss, 'cv loss:', cv_loss)

n_estimators: 10 train loss: 0.698583207432532 cv loss: 0.7080222810021619
n_estimators: 50 train loss: 0.5797496814082188 cv loss: 0.6011568418210969
n_estimators: 100 train loss: 0.5271889510250836 cv loss: 0.5731493803574109
n_estimators: 150 train loss: 0.4939911927973297 cv loss: 0.5607844060361328
n_estimators: 250 train loss: 0.4522854888169123 cv loss: 0.5524689655780366
n_estimators: 500 train loss: 0.391517039729832 cv loss: 0.5468075085169569
n_estimators: 700 train loss: 0.3602038388497608 cv loss: 0.619415959593049


n_estimators = 500 Validation loss = 0.547

Next, with n_estimators = 500 we fine-tune the max_depth 

In [54]:
# fine tuning max-depth 

for i in [2, 3, 5, 7]:
    xgb = XGBRegressor(n_estimators = 500, objective ='reg:squarederror', max_depth = i, random_state = 42)
    xgb.fit(X_train_bow, y_train)
    y_pred_train = xgb.predict(X_train_bow)
    y_pred_cv = xgb.predict(X_cv_bow)

    y_pred_train = [0 if m < 0 else m for m in y_pred_train]
    y_pred_cv = [0 if n < 0 else n for n in y_pred_cv]

    train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
    cv_loss = np.sqrt(mean_squared_log_error(y_cv, y_pred_cv))
    print('max_depth:', i,'train loss:', train_loss, 'cv loss:', cv_loss)

max_depth: 2 train loss: 0.45320159362942775 cv loss: 0.5591004216423104
max_depth: 3 train loss: 0.391517039729832 cv loss: 0.5468075085169569
max_depth: 5 train loss: 0.30766295348246187 cv loss: 0.6196092133291659
max_depth: 7 train loss: 0.2471809523596218 cv loss: 0.6313522162930615


Thus, optimal max_depth = 3

Next, with n_estimator = 500 and max_depth = 3, we fine-tune 'colsample_bytree'

In [57]:
# fine tuning colsample_bytree

for i in [0.3, 0.4, 0.6, 0.8, 1]:
    xgb = XGBRegressor(colsample_bytree = i, n_estimators = 250, objective ='reg:squarederror', max_depth = 3, random_state = 42)
    xgb.fit(X_train_bow, y_train)
    y_pred_train = xgb.predict(X_train_bow)
    y_pred_cv = xgb.predict(X_cv_bow)

    y_pred_train = [0 if m < 0 else m for m in y_pred_train]
    y_pred_cv = [0 if n < 0 else n for n in y_pred_cv]

    train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
    cv_loss = np.sqrt(mean_squared_log_error(y_cv, y_pred_cv))
    print('col_sample_by_tree:', i,'train loss:', train_loss, 'cv loss:', cv_loss)

col_sample_by_tree: 0.3 train loss: 0.5573379952933728 cv loss: 0.613271276901688
col_sample_by_tree: 0.4 train loss: 0.4579210376138219 cv loss: 0.5519761605544066
col_sample_by_tree: 0.6 train loss: 0.456568406143203 cv loss: 0.5497182364692842
col_sample_by_tree: 0.8 train loss: 0.45342562215978754 cv loss: 0.5486401142401411
col_sample_by_tree: 1 train loss: 0.4522854888169123 cv loss: 0.5524689655780366


colsample_by_tree: 0.8

Check again for n_estimators using colsample_bytree = 0.8, max_depth = 3 to see if we get better performance 

In [58]:
from xgboost.sklearn import XGBRegressor
 
for i in [10, 50, 100, 150, 250, 500, 700]:
    xgb = XGBRegressor(colsample_bytree = 0.4,  n_estimators = i, objective ='reg:squarederror', max_depth = 3, random_state = 42)
    xgb.fit(X_train_bow, y_train)
    y_pred_train = xgb.predict(X_train_bow)
    y_pred_cv = xgb.predict(X_cv_bow)

    y_pred_train = [0 if m < 0 else m for m in y_pred_train]
    y_pred_cv = [0 if n < 0 else n for n in y_pred_cv]

    train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
    cv_loss = np.sqrt(mean_squared_log_error(y_cv, y_pred_cv))
    print('n_estimator:', i,'train loss:', train_loss, 'cv loss:', cv_loss)

n_estimator: 10 train loss: 0.6839034145304946 cv loss: 0.6860601178529161
n_estimator: 50 train loss: 0.5808605839541485 cv loss: 0.5999064585058881
n_estimator: 100 train loss: 0.5291280072897985 cv loss: 0.5710494904432875
n_estimator: 150 train loss: 0.49768520249697024 cv loss: 0.5612920783880736
n_estimator: 250 train loss: 0.4579210376138219 cv loss: 0.5519761605544066
n_estimator: 500 train loss: 0.40217276580720845 cv loss: 0.6066409769472102
n_estimator: 700 train loss: 0.3740074831951074 cv loss: 0.6165513219908468


Thus, n_estimators = 250 gives best score on validation data

with n_estimators as 250, colsample_bytree = 0.8, vary max depth and check for the performance 

In [60]:
# again tuning max-depth using n_estimators = 250 and colsample_bytree = 0.8
 
for i in [2, 3, 5, 7]:
    xgb = XGBRegressor(colsample_bytree = 0.8,  n_estimators = 250, objective ='reg:squarederror', max_depth = i, random_state = 42)
    xgb.fit(X_train_bow, y_train)
    y_pred_train = xgb.predict(X_train_bow)
    y_pred_cv = xgb.predict(X_cv_bow)

    y_pred_train = [0 if m < 0 else m for m in y_pred_train]
    y_pred_cv = [0 if n < 0 else n for n in y_pred_cv]

    train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
    cv_loss = np.sqrt(mean_squared_log_error(y_cv, y_pred_cv))
    print('max_depth:', i,'train loss:', train_loss, 'cv loss:', cv_loss)

max_depth: 2 train loss: 0.5019919826754776 cv loss: 0.5627084186481026
max_depth: 3 train loss: 0.45342562215978754 cv loss: 0.5486401142401411
max_depth: 5 train loss: 0.3780682012880054 cv loss: 0.5429486756748865
max_depth: 7 train loss: 0.3273621713892148 cv loss: 0.5523136647475682


Thus, now we select max_depth = 5


Next, with the above found hyperparameters, check the performance with Tf-IDF encoded text features 

In [63]:
xgb = XGBRegressor(colsample_bytree = 0.8,  n_estimators = 250, objective ='reg:squarederror', max_depth = 5, random_state = 42)
xgb.fit(X_train_tfidf, y_train)
y_pred_train = xgb.predict(X_train_tfidf)
y_pred_cv = xgb.predict(X_cv_tfidf)

y_pred_train = [0 if m < 0 else m for m in y_pred_train]
y_pred_cv = [0 if n < 0 else n for n in y_pred_cv]

train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
cv_loss = np.sqrt(mean_squared_log_error(y_cv, y_pred_cv))
print('train loss:', train_loss, 'cv loss:', cv_loss)

train loss: 0.36958254468413926 cv loss: 0.5552682965937445


Final optimal hyperparameters we got upon fine-tuning:

* n_estimators = 250
* max_depth = 5
* colsample_bytree = 0.8
* subsample = 1 (default)

Validation RMSLE = 0.543


Now, we'll train the model on the entire Train data to make the predictions on the Test data

### Reading Data 

In [64]:
# reading data from google drive 

mypath = '/content/gdrive/My Drive/MachineHack/Used Electronics Price Prediction/'
train_data = pd.read_csv(mypath + 'Train.csv')
test_data = pd.read_csv(mypath + 'Test.csv')

In [65]:
y_train = train_data['Price']
X_train = train_data.drop('Price', axis = 1)
X_test = test_data

print(X_train.shape, y_train.shape)
print(X_test.shape)

(2326, 6) (2326,)
(997, 6)


### Data cleaning and Preprocessing 

In [66]:
spell = Speller(lang='en')

preprocessed_model_info = []

for sent in tqdm(X_train['Model_Info'].values):
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    sent = spell_check(sent)
    preprocessed_model_info.append(sent.strip())

X_train['preprocessed_model_info'] = preprocessed_model_info

100%|██████████| 2326/2326 [00:33<00:00, 69.93it/s]


In [67]:
X_train['preprocessed_model_info'] = X_train['preprocessed_model_info'].str.replace('phone', 'name0')

In [68]:
preprocessed_model_info = []

for sent in tqdm(X_test['Model_Info'].values):
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    sent = spell_check(sent)
    preprocessed_model_info.append(sent.strip())

X_test['preprocessed_model_info'] = preprocessed_model_info

100%|██████████| 997/997 [00:05<00:00, 175.13it/s]


In [69]:
X_test['preprocessed_model_info'] = X_test['preprocessed_model_info'].str.replace('phone', 'name0')

In [70]:
preprocessed_add_desc = []

for sent in tqdm(X_train['Additional_Description'].values):
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    sent = " ".join(filter(lambda x:x[:5]!='10100', sent.split()))
    sent = spell_check(sent)
    preprocessed_add_desc.append(sent.strip())

X_train['preprocessed_add_desc'] = preprocessed_add_desc

100%|██████████| 2326/2326 [02:57<00:00, 13.09it/s]


In [72]:
preprocessed_add_desc = []

for sent in tqdm(X_test['Additional_Description'].values):
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    sent = " ".join(filter(lambda x:x[:5]!='10100', sent.split()))
    sent = spell_check(sent)
    preprocessed_add_desc.append(sent.strip())

X_test['preprocessed_add_desc'] = preprocessed_add_desc

100%|██████████| 997/997 [00:59<00:00, 16.62it/s]


In [73]:
# brand

encoder = OneHotEncoder()
encoder.fit(X_train['Brand'].values.reshape(-1, 1))

X_train_brand = encoder.transform(X_train['Brand'].values.reshape(-1, 1))
X_test_brand = encoder.transform(X_test['Brand'].values.reshape(-1, 1))

print('- '*50)
print("After encoding")
print(X_train_brand.shape, y_train.shape)
print(X_test_brand.shape)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After encoding
(2326, 4) (2326,)
(997, 4)


In [74]:
# locality

encoder = OneHotEncoder(handle_unknown = 'ignore')
encoder.fit(X_train['Locality'].values.reshape(-1, 1))

X_train_locality = encoder.transform(X_train['Locality'].values.reshape(-1, 1))
X_test_locality = encoder.transform(X_test['Locality'].values.reshape(-1, 1))

print('- '*50)
print("After encoding")
print(X_train_locality.shape, y_train.shape)
print(X_test_locality.shape)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After encoding
(2326, 970) (2326,)
(997, 970)


In [75]:
# City

encoder = OneHotEncoder(handle_unknown = 'ignore')
encoder.fit(X_train['City'].values.reshape(-1, 1))

X_train_city = encoder.transform(X_train['City'].values.reshape(-1, 1))
X_test_city = encoder.transform(X_test['City'].values.reshape(-1, 1))

print('- '*50)
print("After encoding")
print(X_train_city.shape, y_train.shape)
print(X_test_city.shape)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After encoding
(2326, 16) (2326,)
(997, 16)


In [76]:
# State

encoder = OneHotEncoder(handle_unknown = 'ignore')
encoder.fit(X_train['State'].values.reshape(-1, 1))

X_train_state = encoder.transform(X_train['State'].values.reshape(-1, 1))
X_test_state = encoder.transform(X_test['State'].values.reshape(-1, 1))

print('- '*50)
print("After encoding")
print(X_train_state.shape, y_train.shape)
print(X_test_state.shape)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After encoding
(2326, 9) (2326,)
(997, 9)


In [79]:
# preprocessed_model_info BOW encoding

vectorizer = CountVectorizer(min_df = 3, ngram_range = (1,4))
vectorizer.fit(X_train['preprocessed_model_info'].values) # fit has to happen only on train data
print(vectorizer.get_feature_names()[:10]) 
print(vectorizer.get_feature_names()[-10:])

# we use the fitted CountVectorizer to convert the text to vector
X_train_model_info_bow = vectorizer.transform(X_train['preprocessed_model_info'].values)
X_test_model_info_bow = vectorizer.transform(X_test['preprocessed_model_info'].values)

print('- '*50)
print("After vectorization")
print(X_train_model_info_bow.shape, y_train.shape)
print(X_test_model_info_bow.shape)

['10', '10 months', '100', '100 condition', '11', '11 64', '11 64 gb', '11 64gb', '11 months', '11 name49']
['year apple', 'year name243', 'year name87', 'year old', 'year warranty', 'years', 'years old', 'z2', 'z2 plus', 'zu']
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After vectorization
(2326, 1544) (2326,)
(997, 1544)


In [80]:
# preprocessed_add_desc BOW encoding

vectorizer = CountVectorizer(min_df = 3, ngram_range = (1, 4))
vectorizer.fit(X_train['preprocessed_add_desc'].values) # fit has to happen only on train data
print(vectorizer.get_feature_names()[:10]) 
print(vectorizer.get_feature_names()[-10:])

# we use the fitted CountVectorizer to convert the text to vector
X_train_add_desc_bow = vectorizer.transform(X_train['preprocessed_add_desc'].values)
X_test_add_desc_bow = vectorizer.transform(X_test['preprocessed_add_desc'].values)

print('- '*50)
print("After vectorization")
print(X_train_add_desc_bow.shape, y_train.shape)
print(X_test_add_desc_bow.shape)

['01', '01 call', '01 call 8oo', '01 call 8oo 8oo', '01 call us', '01 call us 8oo', '10', '10 10', '10 10 100', '10 10 100 fixed']
['you', 'yr', 'yr old', 'yrs', 'ysame', 'ysame day', 'ysame day week', 'ysame day week return', 'zero', 'zoom']
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After vectorization
(2326, 5414) (2326,)
(997, 5414)


In [81]:
# preprocessed_model_info TF-IDF encoding

vectorizer = TfidfVectorizer(min_df = 3, ngram_range = (1, 4))
vectorizer.fit(X_train['preprocessed_model_info'].values) # fit has to happen only on train data
print(vectorizer.get_feature_names()[:10]) 
print(vectorizer.get_feature_names()[-10:])

# we use the fitted CountVectorizer to convert the text to vector
X_train_model_info_tfidf = vectorizer.transform(X_train['preprocessed_model_info'].values)
X_test_model_info_tfidf = vectorizer.transform(X_test['preprocessed_model_info'].values)

print('- '*50)
print("After vectorization")
print(X_train_model_info_tfidf.shape, y_train.shape)
print(X_test_model_info_tfidf.shape)

['10', '10 months', '100', '100 condition', '11', '11 64', '11 64 gb', '11 64gb', '11 months', '11 name49']
['year apple', 'year name243', 'year name87', 'year old', 'year warranty', 'years', 'years old', 'z2', 'z2 plus', 'zu']
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After vectorization
(2326, 1544) (2326,)
(997, 1544)


In [82]:
# preprocessed_add_desc TF-IDF encoding

vectorizer = TfidfVectorizer(min_df = 3, ngram_range = (1,4))
vectorizer.fit(X_train['preprocessed_add_desc'].values) # fit has to happen only on train data
print(vectorizer.get_feature_names()[:10]) 
print(vectorizer.get_feature_names()[-10:])

# we use the fitted CountVectorizer to convert the text to vector
X_train_add_desc_tfidf = vectorizer.transform(X_train['preprocessed_add_desc'].values)
X_test_add_desc_tfidf = vectorizer.transform(X_test['preprocessed_add_desc'].values)

print('- '*50)
print("After vectorization")
print(X_train_add_desc_tfidf.shape, y_train.shape)
print(X_test_add_desc_tfidf.shape)

['01', '01 call', '01 call 8oo', '01 call 8oo 8oo', '01 call us', '01 call us 8oo', '10', '10 10', '10 10 100', '10 10 100 fixed']
['you', 'yr', 'yr old', 'yrs', 'ysame', 'ysame day', 'ysame day week', 'ysame day week return', 'zero', 'zoom']
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
After vectorization
(2326, 5414) (2326,)
(997, 5414)


### Stacking the above perprocessed model 

In [83]:
# set 1: with BOW encoded text features 

X_train_bow = hstack((X_train_brand, X_train_locality, X_train_city, X_train_state, 
                      X_train_model_info_bow, X_train_add_desc_bow))
                     
X_test_bow = hstack((X_test_brand, X_test_locality, X_test_city, X_test_state, 
                      X_test_model_info_bow, X_test_add_desc_bow))

print(X_train_bow.shape, y_train.shape)
print(X_test_bow.shape)

(2326, 7957) (2326,)
(997, 7957)


In [84]:
# set 2: with TFIDF encoded text features 

X_train_tfidf = hstack((X_train_brand, X_train_locality, X_train_city, X_train_state, 
                      X_train_model_info_tfidf, X_train_add_desc_tfidf))
                     
X_test_tfidf = hstack((X_test_brand, X_test_locality, X_test_city, X_test_state, 
                      X_test_model_info_tfidf, X_test_add_desc_tfidf))

print(X_train_tfidf.shape, y_train.shape)
print(X_test_tfidf.shape)

(2326, 7957) (2326,)
(997, 7957)


### Model 1: Linear Support Vector Regressor 

In [None]:
for i in [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    svr = SVR(kernel = 'linear', C = i)
    svr.fit(X_train_bow, y_train)
    y_pred_train = svr.predict(X_train_bow)
    y_pred_cv = svr.predict(X_cv_bow)

    y_pred_train = [0 if m < 0 else m for m in y_pred_train]
    y_pred_cv = [0 if n < 0 else n for n in y_pred_cv]

    train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
    cv_loss = np.sqrt(mean_squared_log_error(y_cv, y_pred_cv))
    print('alpha :', i, 'train loss:', train_loss, 'cv loss:', cv_loss)

In [None]:
svr = SVR(kernel = 'linear', C = 100)
svr.fit(X_train_tfidf, y_train)
y_pred_train = xgb.predict(X_train_tfidf)
y_pred_test = xgb.predict(X_test_tfidf)

y_pred_train = [0 if m < 0 else m for m in y_pred_train]
y_pred_test = [0 if n < 0 else n for n in y_pred_test]

train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
print('train loss:', train_loss)

train loss: 0.5404383896436994


In [None]:
svr_df1 = pd.DataFrame(data = y_pred_test, columns = ['Price'])
svr_df1 .head()

Unnamed: 0,Price
0,21468.050781
1,20871.742188
2,20871.742188
3,27101.177734
4,13008.355469


In [None]:
# SUBMISSION 1

svr_df1.to_excel(mypath + 'svr_1.xlsx', index = False)

### Model 2: XGBoost Regressor with BOW encoded features 

In [86]:
# Machine Learning model: ensemble model XGBoost regressor with best hyperparameter found above 
from xgboost.sklearn import XGBRegressor
 
xgb = XGBRegressor(colsample_bytree = 0.8, n_estimators = 250, objective ='reg:squarederror', max_depth = 5, random_state = 42)
xgb.fit(X_train_bow, y_train)
y_pred_train = xgb.predict(X_train_bow)
y_pred_test = xgb.predict(X_test_bow)

y_pred_train = [0 if m < 0 else m for m in y_pred_train]
y_pred_test = [0 if n < 0 else n for n in y_pred_test]

train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
print('train loss:', train_loss)

train loss: 0.4221324284214125


In [87]:
xgb_df = pd.DataFrame(data = y_pred_test, columns = ['Price'])
xgb_df.head()

Unnamed: 0,Price
0,19834.861328
1,17818.236328
2,23766.871094
3,25269.828125
4,9485.232422


In [None]:
# SUBMISSION 2

xgb_df.to_excel(mypath + 'xgb_tuned_1.xlsx', index = False)

### Model 3: XGBoost Regressor with TF-IDF encode features

In [88]:
from xgboost.sklearn import XGBRegressor
 
xgb = XGBRegressor(colsample_bytree = 0.8, n_estimators = 250, objective ='reg:squarederror', max_depth = 5, random_state = 42)
xgb.fit(X_train_tfidf, y_train)
y_pred_train = xgb.predict(X_train_tfidf)
y_pred_test = xgb.predict(X_test_tfidf)

y_pred_train = [0 if m < 0 else m for m in y_pred_train]
y_pred_test = [0 if n < 0 else n for n in y_pred_test]

train_loss = np.sqrt(mean_squared_log_error(y_train, y_pred_train))
print('train loss:', train_loss)

train loss: 0.4054309571299996


In [89]:
xgb_df2 = pd.DataFrame(data = y_pred_test, columns = ['Price'])
xgb_df2.head()

Unnamed: 0,Price
0,20688.111328
1,17241.326172
2,17697.835938
3,27364.873047
4,8990.277344


In [None]:
# SUBMISSION 3
xgb_df2.to_excel(mypath + 'xgb_tuned_tfidf.xlsx', index = False)