## Problem Statement
In this project, I try to predict the price of various Airbnbs, using **text feature, categorical features, and numeric features about the listings**. To be more specific, the text feature is the name and decription of the Airbnb. On the other hand, the categorical features include features, such as city of the listing and country of the listing. The numeric features include features, such as service fee.  
  
In order to utilize both text features and all the other features to come up with a prediction model, I plan to first **create a price prediction model using only the text features**. Then, I will **create another prediction model using the final prediction of the previous model as one of the feature and all the other features to predict the final price**. The final model will work similar as a **stacking model**.

## Data Source
This is a public data provided by Airbnb. You can find the data source from the link below. Within the dataset, every single row represents a Airbnb listing. In the original data source, there are **26 columns and 102599 rows**.   
  
In the dataset, the **price** column will be the outcome and independent variable.
  
**Data Source: https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata**

## 1. Import data
In this section, I import the raw data and rename the outcome variable as y.

In [1]:
import pandas as pd
import numpy as np
import os
os.getcwd()
os.chdir('/Users/haochunniu/Desktop/Kaggle Compatition/Airbnb Open Data')

In [2]:
raw = pd.read_csv('Airbnb_Open_Data.csv',header=0,low_memory=False)
raw = raw.rename(columns={'price':'y'})
raw.head(3)

Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,...,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,...,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,...,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,...,$124,3.0,0.0,,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and...",


## 2. Data pre-processing and cleaning
In this section, I carefully checked the data quality of the original dataset. To be more specific, I did the following inspections.  
  
**a.   Fix the price columns (y, service fee)**  
**b.   Inspect and keep only the informative variables**  
**c.   Keep records with no NA values in remaining columns**  
**d.   Add the text of neighbourhood and room type into description**  
**e.   Train, validation, and test data split**  
**f .   Data vizualization on train data**


In [3]:
# a. Fix the price columns (y, service fee)
# Originally all price were strings and started with a dollar sign. Thus, I would have to strip the dollar sign and turn into numeric data type
raw['y']=raw['y'].replace([np.inf, -np.inf],np.nan)
raw['y']=raw['y'].str.replace("$","")
raw['y']=raw['y'].str.replace(",","")
raw['y']=raw['y'].astype(float)

raw['service fee']=raw['service fee'].replace([np.inf, -np.inf],np.nan)
raw['service fee']=raw['service fee'].str.replace("$","")
raw['service fee']=raw['service fee'].str.replace(",","")
raw['service fee']=raw['service fee'].astype(float)


  raw['y']=raw['y'].str.replace("$","")
  raw['service fee']=raw['service fee'].str.replace("$","")


In [4]:
# b. Inspect and keep only the informative columns
# In the end, I only kept 8 useful features
raw['Construction year']=raw['Construction year'].astype('Int64')
df1=raw[['y','NAME','host_identity_verified','neighbourhood group','instant_bookable','cancellation_policy','room type','Construction year']]
df1=df1.rename(columns={'NAME':'describtion',
                        'neighbourhood group':'neighbourhood_group',
                        'room type':'room_type',
                        'Construction year':'construction_year'})
df1['neighbourhood_group']=np.where(df1['neighbourhood_group']=='brookln','Brooklyn',np.where(df1['neighbourhood_group']=='manhatan','Manhattan',df1['neighbourhood_group']))
df1.head(3)


Unnamed: 0,y,describtion,host_identity_verified,neighbourhood_group,instant_bookable,cancellation_policy,room_type,construction_year
0,966.0,Clean & quiet apt home by the park,unconfirmed,Brooklyn,False,strict,Private room,2020
1,142.0,Skylit Midtown Castle,verified,Manhattan,False,moderate,Entire home/apt,2007
2,620.0,THE VILLAGE OF HARLEM....NEW YORK !,,Manhattan,True,flexible,Private room,2005


In [5]:
# c. Keep records with no NA values in remaining columns
# In the remaining columns, all columns had less than 0.3% of NA
print('Before dropping the NAs, there are {} rows of data.'.format(len(df1)))
print('----------------------------------------------------')
print(round(df1.isna().sum()/len(df1)*100,2))

Before dropping the NAs, there are 102599 rows of data.
----------------------------------------------------
y                         0.24
describtion               0.24
host_identity_verified    0.28
neighbourhood_group       0.03
instant_bookable          0.10
cancellation_policy       0.07
room_type                 0.00
construction_year         0.21
dtype: float64


In [6]:
#Now, there's no NA values within the data
df2=df1.dropna()
df2=df2.reset_index(drop=True)
n=round((1-(len(df2)/len(df1)))*100,2)
print('After dropping the NAs, there are {} rows of data.\nAbout {}% of rows are dropped.'.format(len(df2),n))
print('-----------------------------------------------------')
print(round(df2.isna().sum()/len(df2)*100,2))

After dropping the NAs, there are 101544 rows of data.
About 1.03% of rows are dropped.
-----------------------------------------------------
y                         0.0
describtion               0.0
host_identity_verified    0.0
neighbourhood_group       0.0
instant_bookable          0.0
cancellation_policy       0.0
room_type                 0.0
construction_year         0.0
dtype: float64


In [7]:
# d. Add the text of neighbourhood and room type into description
# After sevral times of test and trial, I notice that if the desscribtion of the listing did not include the location and room type, 
# the first stage prediction model, the model that predict price with only the text describtion, will not have good performance. 
# Hence, I will add the text of neighbour hood and room type into the describtion 
x=[]
for i in range(len(df2)):
    tem="{}. This listing is a {} at {}.".format(df2['describtion'][i],df2['room_type'][i],df2['neighbourhood_group'][i])
    x.append(tem)
df2['describtion']=x

In [8]:
# e. Train, validation, and test data split
# I used 72% of the data as train data, 18% of the data as validation data, and 10% as test data.
from sklearn.model_selection import train_test_split
x_train_val, x_test, y_train_val, y_test = train_test_split(df2.drop(columns=['y']),
                                                            df2['y'],
                                                            test_size=0.1,
                                                            random_state=9)
train_val=pd.concat([x_train_val,y_train_val],axis=1)
x_train, x_val, y_train, y_val = train_test_split(train_val.drop(columns=['y']),
                                                  train_val['y'],
                                                  test_size=0.2,
                                                  random_state=99)

print('There are totally {} rows of data in train data, {} rows of data in validation data, and {} rows of data in test data.'.format(len(y_train),len(y_val),len(y_test)))
                                            

There are totally 73111 rows of data in train data, 18278 rows of data in validation data, and 10155 rows of data in test data.


In [9]:
x_train=x_train.reset_index(drop=True)
x_val=x_val.reset_index(drop=True)
x_test=x_test.reset_index(drop=True)
y_train=y_train.reset_index(drop=True)
y_val=y_val.reset_index(drop=True)
y_test=y_test.reset_index(drop=True)

In [10]:
# f. Data visualization on train data
# In this project, I use Tableau for visualization. In order to use Tableau, I output the full train data.
# The link to the Tableau dashboard is https://public.tableau.com/views/AirbnbOpenData/Visualization?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link .
# Feel free to view the dashboard!!!
train_all=pd.concat([x_train,y_train],axis=1)
train_all.to_csv('train_all.csv')


## 3. Model1: Price prediction model using listing describtion text  
In this project, I would use the concept of stacking ensemble model. I will first create a price prediction model that will use only the describtion of the listing as feature.  
To be more specific, I will try two different RNN model, simple RNN and LSTM, for the price prediction model. 

In [11]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.callbacks import EarlyStopping
from kerastuner.tuners import RandomSearch

  from kerastuner.tuners import RandomSearch


In [12]:
# Step1. Extract only the text field
train_text=np.array(x_train['describtion'])
val_text=np.array(x_val['describtion'])
test_text=np.array(x_test['describtion'])

In [13]:
#2-1.Text pre-processing for train data
vectorize_layer = keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens = None,
    standardize = 'lower_and_strip_punctuation',
    split = 'whitespace',
    ngrams = None,
    output_mode = 'int',
    output_sequence_length = None)

2022-09-20 12:19:14.670841: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [14]:
#2-2. Apply it to the text data with "adapt". 
vectorize_layer.adapt(train_text)
print('After word embedding, there are {} words.'.format(len(vectorize_layer.get_vocabulary(10))))

After word embedding, there are 13238 words.


### a. Simple RNN model  
First, we try the most basic RNN model and use randomized search to find the best hyper-parameters.

In [20]:
#Create the RNN Random Search structure
def build_model(hp):
    model_rnn = keras.Sequential()

    model_rnn.add(vectorize_layer)

    model_rnn.add(keras.layers.Embedding(
    input_dim = len(vectorize_layer.get_vocabulary()),
    output_dim = 256,
    mask_zero = True
    ))

    model_rnn.add(keras.layers.SimpleRNN(units=hp.Int('units',
                                                      min_value=10,
                                                      max_value=400,
                                                      step=10),
                                         return_sequences=True,
                                         dropout=0.2))
    
    model_rnn.add(keras.layers.SimpleRNN(units=hp.Int('units',
                                                      min_value=10,
                                                      max_value=400,
                                                      step=10),
                                         return_sequences=True,
                                         dropout=0.2))

    model_rnn.add(keras.layers.SimpleRNN(units=hp.Int('units',
                                                      min_value=10,
                                                      max_value=400,
                                                      step=10),
                                         dropout=0.2))

    model_rnn.add(keras.layers.Dense(1,activation='linear'))

    model_rnn.compile(optimizer=keras.optimizers.Adam(hp.Choice('learning_rate',
                                                                values=[0.01,0.001,0.0001])),
                      loss='mse',
                      metrics=['mse'])
    return model_rnn

tuner=RandomSearch(build_model,
                   objective='mse',
                   max_trials=2,
                   overwrite=True, #Always remember to add this
                   executions_per_trial=1)

In [21]:
#Start the Search
tuner.search(x=train_text,y=y_train,epochs=50,batch_size=256,validation_data=(val_text,y_val))

Trial 2 Complete [01h 11m 46s]
mse: 123975.3984375

Best mse So Far: 109862.1484375
Total elapsed time: 02h 27m 46s
INFO:tensorflow:Oracle triggered exit


In [22]:
# Result
result=tuner.get_best_hyperparameters()[0].values
print('The best 3 layers RNN parameters would be {} neurons and {} learning rate.'.format(result['units'],result['learning_rate']))
print('------------------------------------------')
print(tuner.results_summary())

The best 3 layers RNN parameters would be 380 neurons and 0.01 learning rate.
------------------------------------------
Results summary
Results in ./untitled_project
Showing 10 best trials
<keras_tuner.engine.objective.Objective object at 0x1594d0df0>
Trial summary
Hyperparameters:
units: 380
learning_rate: 0.01
Score: 109862.1484375
Trial summary
Hyperparameters:
units: 360
learning_rate: 0.0001
Score: 123975.3984375
None


In [23]:
#Get the final model
from keras.models import load_model
RNNmodel=tuner.get_best_models()[0]

In [24]:
#Re-train and save the best model - Before saving the model, remember to re-train the model
RNNmodel.fit(train_text,y_train)
RNNmodel.save("RNNmodel")
#RNNmodel = load_model("RNNmodel")

INFO:tensorflow:Assets written to: RNNmodel/assets


In [17]:
#Predict with the final model
RNN_prediction_test=RNNmodel.predict(test_text)
RNN_prediction_test=[i[0]for i in RNN_prediction_test]
RNN_prediction_val=RNNmodel.predict(val_text)
RNN_prediction_val=[i[0]for i in RNN_prediction_val]
RNN_prediction_train=RNNmodel.predict(train_text)
RNN_prediction_train=[i[0]for i in RNN_prediction_train]



In [18]:
#RMSE on test data
from sklearn import metrics
import math
mse=metrics.mean_squared_error(y_test,RNN_prediction_test)
rmse=math.sqrt(mse)
print('The RMSE of test data with the LSTM prediction model is {}'.format(round(rmse,2)))

The RMSE of test data with the LSTM prediction model is 332.71


In [19]:
#Save the final prediction of the RNN model
RNN_prediction_test=pd.DataFrame({'RNN_prediction_test':RNN_prediction_test})
RNN_prediction_test.to_csv('RNN Prediction on Test data.csv')
RNN_prediction_val=pd.DataFrame({'RNN_prediction_val':RNN_prediction_val})
RNN_prediction_val.to_csv('RNN Prediction on Validation data.csv')
RNN_prediction_train=pd.DataFrame({'RNN_prediction_train':RNN_prediction_train})
RNN_prediction_train.to_csv('RNN Prediction on Train data.csv')

### b. LSTM model  
Next, given that the performance of the simple RNN model is not as good as expected, I try the more advance method, the LSTM model.

In [34]:
#Create the LSTM Random Search structure
def build_model(hp):
    model_lstm = keras.Sequential()

    model_lstm.add(vectorize_layer)

    model_lstm.add(keras.layers.Embedding(
    input_dim = len(vectorize_layer.get_vocabulary()),
    output_dim = 256,
    mask_zero = True
    ))

    model_lstm.add(keras.layers.LSTM(units=hp.Int('units',
                                                  min_value=10,
                                                  max_value=400,
                                                  step=10),
                                     return_sequences=True,
                                     dropout=0.2))
    
    model_lstm.add(keras.layers.LSTM(units=hp.Int('units',
                                                  min_value=10,
                                                  max_value=400,
                                                  step=10),
                                     return_sequences=True,
                                     dropout=0.2))

    model_lstm.add(keras.layers.LSTM(units=hp.Int('units',
                                                  min_value=10,
                                                  max_value=400,
                                                  step=10),
                                     dropout=0.2))

    model_lstm.add(keras.layers.Dense(1,activation='linear'))

    model_lstm.compile(optimizer=keras.optimizers.Adam(hp.Choice('learning_rate',
                                                                values=[0.01,0.001,0.0001])),
                       loss='mse',
                       metrics=['mse'])
    return model_lstm

tuner=RandomSearch(build_model,
                   objective='mse',
                   max_trials=2,
                   overwrite=True, #Always remember to add this
                   executions_per_trial=1)

In [35]:
#Start the Search
tuner.search(x=train_text,y=y_train,epochs=50,batch_size=256,validation_data=(val_text,y_val))

Trial 2 Complete [01h 40m 25s]
mse: 207848.34375

Best mse So Far: 108563.9375
Total elapsed time: 03h 41m 22s
INFO:tensorflow:Oracle triggered exit


In [36]:
# Result
result=tuner.get_best_hyperparameters()[0].values
print('The best 3 layers LSTM parameters would be {} neurons and {} learning rate.'.format(result['units'],result['learning_rate']))
print('------------------------------------------')
print(tuner.results_summary())

The best 3 layers LSTM parameters would be 260 neurons and 0.01 learning rate.
------------------------------------------
Results summary
Results in ./untitled_project
Showing 10 best trials
<keras_tuner.engine.objective.Objective object at 0x15966d0f0>
Trial summary
Hyperparameters:
units: 260
learning_rate: 0.01
Score: 108563.9375
Trial summary
Hyperparameters:
units: 220
learning_rate: 0.0001
Score: 207848.34375
None


In [37]:
#Get the final model
from keras.models import load_model
LSTMmodel=tuner.get_best_models()[0]

#Re-train and save the best model - Before saving the model, remember to re-train the model
LSTMmodel.fit(train_text,y_train)
LSTMmodel.save("LSTMmodel")
#LSTMmodel = load_model("LSTMmodel") Load Model
#LSTMmodel.predict(...) Use the model, it is already trained

#Predict with the final model
LSTM_prediction_test=LSTMmodel.predict(test_text)
LSTM_prediction_test=[i[0]for i in LSTM_prediction_test]
LSTM_prediction_val=LSTMmodel.predict(val_text)
LSTM_prediction_val=[i[0]for i in LSTM_prediction_val]
LSTM_prediction_train=LSTMmodel.predict(train_text)
LSTM_prediction_train=[i[0]for i in LSTM_prediction_train]






INFO:tensorflow:Assets written to: LSTMmodel/assets


INFO:tensorflow:Assets written to: LSTMmodel/assets




In [21]:
#RMSE on test data
from sklearn import metrics
import math
mse=metrics.mean_squared_error(y_test,LSTM_prediction_test)
rmse=math.sqrt(mse)
print('The RMSE of test data with the LSTM prediction model is {}'.format(round(rmse,2)))


The RMSE of test data with the LSTM prediction model is 331.23


In [22]:
#Save the final prediction of the RNN model
LSTM_prediction_test=pd.DataFrame({'LSTM_prediction_test':LSTM_prediction_test})
LSTM_prediction_test.to_csv('LSTM Prediction on Test data.csv')
LSTM_prediction_val=pd.DataFrame({'LSTM_prediction_val':LSTM_prediction_val})
LSTM_prediction_val.to_csv('LSTM Prediction on Validation data.csv')
LSTM_prediction_train=pd.DataFrame({'LSTM_prediction_train':LSTM_prediction_train})
LSTM_prediction_train.to_csv('LSTM Prediction on Train data.csv')

## 4. Model2: Stacking price prediction model using the prediction from the previous model and the rest of the features
After finishing the first stage model, I will build a stacking model. The model will use the prediction of the previous model and the rest of the features as input. Here, we will use nested random search CV to help me find the best performing model and use random search to find the best hyper-parameter.

In [23]:
import matplotlib.pyplot as plt
from sklearn import metrics
from matplotlib.pyplot import figure
from sklearn.model_selection import cross_val_score,KFold,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
import lightgbm as lgb

In [22]:
#Predict with the final model
LSTM_prediction_test=LSTMmodel.predict(test_text)
LSTM_prediction_test=[i[0]for i in LSTM_prediction_test]
LSTM_prediction_val=LSTMmodel.predict(val_text)
LSTM_prediction_val=[i[0]for i in LSTM_prediction_val]
LSTM_prediction_train=LSTMmodel.predict(train_text)
LSTM_prediction_train=[i[0]for i in LSTM_prediction_train]



In [24]:
x_train['first_stage_pred']=LSTM_prediction_train
x_val['first_stage_pred']=LSTM_prediction_val
x_test['first_stage_pred']=LSTM_prediction_test
x_train=x_train.drop(columns=['describtion'])
x_val=x_val.drop(columns=['describtion'])
x_test=x_test.drop(columns=['describtion'])

In [25]:
#Convert all the categorical variables to dummy variables
x_train['instant_bookable']=np.where(x_train['instant_bookable'],1,0)
x_train['instant_bookable']=x_train['instant_bookable'].astype('uint8')
x_train['construction_year']=x_train['construction_year'].astype(int)
x_train=pd.get_dummies(x_train,columns=['host_identity_verified','neighbourhood_group','cancellation_policy','room_type'])
x_val['instant_bookable']=np.where(x_val['instant_bookable'],1,0)
x_val['instant_bookable']=x_val['instant_bookable'].astype('uint8')
x_val['construction_year']=x_val['construction_year'].astype(int)
x_val=pd.get_dummies(x_val,columns=['host_identity_verified','neighbourhood_group','cancellation_policy','room_type'])
x_test['instant_bookable']=np.where(x_test['instant_bookable'],1,0)
x_test['instant_bookable']=x_test['instant_bookable'].astype('uint8')
x_test['construction_year']=x_test['construction_year'].astype(int)
x_test=pd.get_dummies(x_test,columns=['host_identity_verified','neighbourhood_group','cancellation_policy','room_type'])

In [26]:
#For the second model, because I use cross-validation, I don't need to seperate the validation data out anymore. I need to merge the train and validation data.
x_train_val=pd.concat([x_train,x_val])
x_train_val=x_train_val.reset_index(drop=True)
y_train_val=pd.concat([y_train,y_val])
y_train_val=y_train_val.reset_index(drop=True)

### a. Nested random search to find the best model

In [27]:
# 1. Create the Classifier
rf=RandomForestRegressor(random_state=9)
xgb=XGBRegressor(seed=9,objective='reg:squarederror',use_label_encoder =False,verbosity = 0)
lgbm=lgb.LGBMRegressor(objective='regression',random_state=9)

##############################################################
# 2. Create the parameter grid
rf_grid={'n_estimators':list(range(100,1100,100)),
         'max_depth':list(range(3,11))}
xgb_grid={'eta':np.arange(0.1,0.6,0.1),
          'max_depth':list(range(3,16)),
          'n_estimators':list(range(10,310,10)),
          'gamma':list(range(1,6))}
lgbm_grid={'learning_rate':np.arange(0.1,0.6,0.1),
           'max_depth':list(range(3,16)),
           'n_estimators':list(range(10,310,10))}

##############################################################
# 3. Create the CV
inner_cv = KFold(n_splits=3, shuffle=True, random_state=9)
outer_cv = KFold(n_splits=3, shuffle=True, random_state=9)

##############################################################
# 4-1-1. Random-search CV for Random Forest
clf = RandomizedSearchCV(rf,rf_grid,cv=inner_cv,scoring='neg_root_mean_squared_error',n_iter=15,random_state=9)

# 4-1-2. Nested CV for Random Forest
nested_score = cross_val_score(clf,X=x_train_val, y=y_train_val, cv=outer_cv,scoring='neg_root_mean_squared_error')

# 4-1-3. Result for Nested CV
rf_result=nested_score.mean()

##############################################################
# 4-2-1. Random-search CV for XGBoost Classifier
clf = RandomizedSearchCV(xgb,xgb_grid,cv=inner_cv,scoring='neg_root_mean_squared_error',n_iter=15,random_state=9)

# 4-2-2. Nested CV for XGBoost Classifier
nested_score = cross_val_score(clf,X=x_train_val, y=y_train_val, cv=outer_cv,scoring='neg_root_mean_squared_error')

# 4-2-3. Result for Nested CV
xgb_result=nested_score.mean()

##############################################################
# 4-3-1. Random-search CV for LightGBM Classifier
clf = RandomizedSearchCV(lgbm,lgbm_grid,cv=inner_cv,scoring='neg_root_mean_squared_error',n_iter=15,random_state=9)

# 4-3-2. Nested CV for LightGBM Classifier
nested_score = cross_val_score(clf,X=x_train_val, y=y_train_val, cv=outer_cv,scoring='neg_root_mean_squared_error')

# 4-3-3. Result for Nested CV
lgbm_result=nested_score.mean()

In [28]:
print('Average RMSE of Random Forest Classifier: {}'.format(round(-1*rf_result,2)))
print('Average RMSE of XGBoost Classifier: {}'.format(round(-1*xgb_result,2)))
print('Average RMSE of LightGBM Classifier: {}'.format(round(-1*lgbm_result,2)))

Average RMSE of Random Forest Classifier: 328.85
Average RMSE of XGBoost Classifier: 323.04
Average RMSE of LightGBM Classifier: 330.2


### b. Use Random search to find the best hyper-parameter for the XGBoost model 

In [30]:
# 1. Create the Classifier
xgb=XGBRegressor(seed=9,objective='reg:squarederror',use_label_encoder =False,verbosity = 0)

# 2. Create the parameter grid
xgb_grid={'eta':np.arange(0.1,0.6,0.1),
          'max_depth':list(range(3,16)),
          'n_estimators':list(range(10,310,10)),
          'gamma':list(range(1,6))}

# 3. Create the CV
inner_cv = KFold(n_splits=3, shuffle=True, random_state=9)

# 4. Grid-search
xgbmodel = RandomizedSearchCV(xgb,xgb_grid,cv=5,scoring='neg_root_mean_squared_error',n_iter=15,random_state=9)

# 4. Fit the model
xgbmodel.fit(x_train_val,y_train_val)

# 5. Predict
y_pred=xgbmodel.predict(x_test)


In [32]:
# 6. Result
print ("With CV random search, I found the best hyperparameter is eta (learning rate) ={}, max_depth={}, gamma={}, and n_estimators={}.".format(xgbmodel.best_params_['eta'],
                                                                                                                                         xgbmodel.best_params_['max_depth'],
                                                                                                                                         xgbmodel.best_params_['gamma'],
                                                                                                                                         xgbmodel.best_params_['n_estimators'],))
print("Prediction MSE on Test Data: {}".format(round(metrics.mean_squared_error(y_test,y_pred),2)))


With CV random search, I found the best hyperparameter is eta (learning rate) =0.2, max_depth=13, gamma=5, and n_estimators=250.
Prediction MSE on Test Data: 92919.64


In [35]:
# 7. Save the final model
import joblib
joblib.dump(xgbmodel.best_estimator_, 'final_XGB_model.pkl') #Save the model
#XGBmodel = joblib.load('final_XGB_model.pkl') Load the model
#XGBmodel.predict(...) Use the model, the model is already trained

['final_XGB_model.pkl']

## 5. Conclusion
By inspecting the performance of the model on test data, I noticed that the model is not performing as good as expected. The mean absolut percentage error is around 40%. Yet, I believe the main cause is not the model itself or the techniques. Via visualization, it is clear that most of the features do not have predictive ability. Further more, even by taking the text feaure into account, there isn't too much improvement in model's performance. Hence, for further analysis and improve the prediction model, I would recommend to collect some more effective features first before focusing more on the modeling part.