## part 6 - Constructing an Optimal Model

##### Model Selection

- The model I will be using for this project is Random Forest Regression over Decision Tree Regression. We chose a regression model because our target variable is Price ( because its no a cstegorical value). The reason I chose Random Forest over Decision Trees is that Random Forest Regression is less prone to overfitting and works well with a large number of features, which are present in our cleaned dataset.

##### Model training

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

import joblib
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score

In [2]:
## get data

from sqlalchemy import create_engine
import pandas as pd
import seaborn as sns
#connect database
# Note:: The make sure you use the information from your specific PostgreSQL installation
host = r'127.0.0.1' # denotes that the db in a local installation
db = r'MSDS610' # db we just created
user = r'postgres' # using the postgres user for this demo
pw = r'****' # this is the password established during installation
port = r'5432' # default port estabalished during install
schema = r'cleaned' # schema we just created

In [3]:
db_conn = create_engine(f"postgresql://{user}:{pw}@{host}:{port}/{db}")

In [4]:
sql="select tables.table_name from information_schema.tables where (table_schema ='"+schema+"')order by 1;"
tbl_df = pd.read_sql(sql, db_conn, index_col=None)
tbl_df

Unnamed: 0,table_name
0,chicago_airbnb_cleaned
1,chicago_airbnb_cleaned_2
2,global_warming2
3,global_warming_Wth_Risk_Level
4,global_warming_cleaned


In [5]:
table_name = r'chicago_airbnb_cleaned_2'

In [6]:
sql=r'SELECT * FROM ' + schema + '.' + table_name
df = pd.read_sql(sql, db_conn, index_col=None)

In [7]:
df.head()

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,room_type_Entire home/apt,room_type_Hotel room,...,neighbourhood_West Elsdon,neighbourhood_West Englewood,neighbourhood_West Garfield Park,neighbourhood_West Lawn,neighbourhood_West Pullman,neighbourhood_West Ridge,neighbourhood_West Town,neighbourhood_Woodlawn,booking_frequency,price_per_room
0,41.7879,-87.5878,60,2,178,2.56,1,353,False,False,...,False,False,False,False,False,False,False,False,0.504249,30.0
1,41.85495,-87.69696,105,2,395,2.81,1,155,True,False,...,False,False,False,False,False,False,False,False,2.548387,52.5
2,41.90289,-87.68182,60,2,384,2.81,1,321,True,False,...,False,False,False,False,False,False,True,False,1.196262,30.0
3,41.91769,-87.63788,65,4,49,0.63,9,300,True,False,...,False,False,False,False,False,False,False,False,0.163333,16.25
4,41.79612,-87.59261,21,1,44,0.61,5,168,False,False,...,False,False,False,False,False,False,False,False,0.261905,21.0


In [8]:
# prep & split data 

cols = df.columns 

prediction_col = 'price'
feature_cols = [c for c in cols if c != prediction_col]

X = df[feature_cols]
y = df[prediction_col]

In [10]:
# Train-Validation-Test Split - First split
# remove stratify=y only used in calssfication
X_train, X_temp, y_train, y_temp = train_test_split(X, y, random_state=42, test_size=0.3)

In [11]:
X_temp.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1428 entries, 2313 to 1840
Data columns (total 89 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   latitude                              1428 non-null   float64
 1   longitude                             1428 non-null   float64
 2   minimum_nights                        1428 non-null   int64  
 3   number_of_reviews                     1428 non-null   int64  
 4   reviews_per_month                     1428 non-null   float64
 5   calculated_host_listings_count        1428 non-null   int64  
 6   availability_365                      1428 non-null   int64  
 7   room_type_Entire home/apt             1428 non-null   bool   
 8   room_type_Hotel room                  1428 non-null   bool   
 9   room_type_Private room                1428 non-null   bool   
 10  room_type_Shared room                 1428 non-null   bool   
 11  neighbourhood_Alban

In [12]:
X_temp.head()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,...,neighbourhood_West Elsdon,neighbourhood_West Englewood,neighbourhood_West Garfield Park,neighbourhood_West Lawn,neighbourhood_West Pullman,neighbourhood_West Ridge,neighbourhood_West Town,neighbourhood_Woodlawn,booking_frequency,price_per_room
2313,41.88464,-87.70924,2,42,1.58,4,49,False,False,True,...,False,False,False,False,False,False,False,False,0.857143,15.0
315,41.84911,-87.67924,25,9,0.28,2,139,False,False,True,...,False,False,False,False,False,False,False,False,0.064748,2.0
2328,41.89699,-87.692,1,145,5.5,1,0,True,False,False,...,False,False,False,False,False,False,True,False,inf,75.0
472,41.78632,-87.62237,1,30,0.47,2,359,False,False,True,...,False,False,False,False,False,False,False,False,0.083565,40.0
534,41.93007,-87.69892,2,72,1.17,1,364,True,False,False,...,False,False,False,False,False,False,False,False,0.197802,90.0


In [13]:
y_temp.info()

<class 'pandas.core.series.Series'>
Index: 1428 entries, 2313 to 1840
Series name: price
Non-Null Count  Dtype
--------------  -----
1428 non-null   int64
dtypes: int64(1)
memory usage: 22.3 KB


In [14]:
y_temp.head()

2313     30
315      50
2328     75
472      40
534     180
Name: price, dtype: int64

In [15]:
# Train-Validation-Test Split - Second split
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, random_state=42, test_size=0.5)

In [16]:
print(X_test.shape)
X_test.head()

(714, 89)


Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,...,neighbourhood_West Elsdon,neighbourhood_West Englewood,neighbourhood_West Garfield Park,neighbourhood_West Lawn,neighbourhood_West Pullman,neighbourhood_West Ridge,neighbourhood_West Town,neighbourhood_Woodlawn,booking_frequency,price_per_room
3083,41.80725,-87.62364,2,27,1.48,5,0,False,False,True,...,False,False,False,False,False,False,False,False,inf,25.0
2971,41.94594,-87.6546,1,3,1.67,5,7,False,True,False,...,False,False,False,False,False,False,False,False,0.428571,125.0
683,41.98317,-87.70461,3,87,1.48,1,194,False,False,True,...,False,False,False,False,False,True,False,False,0.448454,20.0
1752,41.94027,-87.66565,2,29,0.83,37,125,False,False,True,...,False,False,False,False,False,False,False,False,0.232,17.5
191,41.89826,-87.70336,1,552,7.16,2,147,True,False,False,...,False,False,False,False,False,False,False,False,3.755102,45.0


In [17]:
print(X_val.shape)
X_val.head()

(714, 89)


Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,...,neighbourhood_West Elsdon,neighbourhood_West Englewood,neighbourhood_West Garfield Park,neighbourhood_West Lawn,neighbourhood_West Pullman,neighbourhood_West Ridge,neighbourhood_West Town,neighbourhood_Woodlawn,booking_frequency,price_per_room
3749,41.77314,-87.69052,3,3,0.27,2,155,False,False,True,...,False,False,False,False,False,False,False,False,0.019355,13.0
2886,41.91816,-87.68476,1,58,3.06,1,0,True,False,False,...,False,False,False,False,False,False,False,False,inf,103.0
4695,41.87674,-87.65295,1,2,1.36,5,362,True,False,False,...,False,False,False,False,False,False,False,False,0.005525,130.0
70,41.99244,-87.75569,3,66,0.71,2,363,False,False,True,...,False,False,False,False,False,False,False,False,0.181818,20.0
599,41.90184,-87.64516,1,90,1.7,2,365,False,False,True,...,False,False,False,False,False,False,False,False,0.246575,100.0


In [18]:
print(y_test.shape)
y_test.head()

(714,)


3083     50
2971    125
683      60
1752     35
191      45
Name: price, dtype: int64

In [19]:
print(y_val.shape)
y_val.head()

(714,)


3749     39
2886    103
4695    130
70       60
599     100
Name: price, dtype: int64

In [20]:
print(f"Training set size: {X_train.shape[0]}")
print(f"Validation set size: {X_val.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Training set size: 3330
Validation set size: 714
Test set size: 714


In [22]:
X_train

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,...,neighbourhood_West Elsdon,neighbourhood_West Englewood,neighbourhood_West Garfield Park,neighbourhood_West Lawn,neighbourhood_West Pullman,neighbourhood_West Ridge,neighbourhood_West Town,neighbourhood_Woodlawn,booking_frequency,price_per_room
2487,41.76630,-87.56638,2,15,0.59,12,365,True,False,False,...,False,False,False,False,False,False,False,False,0.041096,37.500000
1786,41.92054,-87.71558,1,173,5.00,1,209,True,False,False,...,False,False,False,False,False,False,False,False,0.827751,79.000000
3093,41.93924,-87.64225,2,22,2.08,3,322,True,False,False,...,False,False,False,False,False,False,False,False,0.068323,119.000000
1705,41.92796,-87.70536,2,68,3.66,1,0,True,False,False,...,False,False,False,False,False,False,False,False,inf,45.500000
1402,41.89652,-87.63249,21,5,0.13,26,365,True,False,False,...,False,False,False,False,False,False,False,False,0.013699,7.095238
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4426,41.85464,-87.62422,2,1,0.29,4,243,True,False,False,...,False,False,False,False,False,False,False,False,0.004115,70.000000
466,41.85568,-87.64582,10,82,1.29,1,49,True,False,False,...,False,False,False,False,False,False,False,False,1.673469,9.100000
3092,41.74542,-87.61527,2,11,0.90,1,0,True,False,False,...,False,False,False,False,False,False,False,False,inf,32.500000
3772,41.87281,-87.63033,7,9,0.86,47,102,True,False,False,...,False,False,False,False,False,False,False,False,0.088235,24.571429



### Building the Model 1

`Error: ValueError: Input X contains infinity or a value too large for dtype('float32').`

This error occurred in the model due to the following calculation:

`df_cleaned['booking_frequency'] = df_cleaned['number_of_reviews'] / df_cleaned['availability_365']
`
If `availability_365` is 0, the division results in infinity (inf), which causes the error.
To fix this, we will replace the inf values with NaN, and then replace the NaN values with 0.

In [27]:
# error: ValueError: Input X contains infinity or a value too large for dtype('float32').

print(np.isinf(X_train).sum())  
print(np.isinf(X_test).sum())
    
# Check for extremely large values
print(X_train.max())  
print(X_test.max())

latitude                      0
longitude                     0
minimum_nights                0
number_of_reviews             0
reviews_per_month             0
                           ... 
neighbourhood_West Ridge      0
neighbourhood_West Town       0
neighbourhood_Woodlawn        0
booking_frequency           650
price_per_room                0
Length: 89, dtype: int64
latitude                      0
longitude                     0
minimum_nights                0
number_of_reviews             0
reviews_per_month             0
                           ... 
neighbourhood_West Ridge      0
neighbourhood_West Town       0
neighbourhood_Woodlawn        0
booking_frequency           130
price_per_room                0
Length: 89, dtype: int64
latitude                    42.02251
longitude                  -87.53752
minimum_nights                   365
number_of_reviews                632
reviews_per_month              32.43
                              ...   
neighbourhood_West Ridge

In [29]:
# above we find that booking_frequency is the column witht he inf value
# replace inf vlaue with Nan
df['booking_frequency'] = df['booking_frequency'].replace([np.inf, -np.inf], np.nan)

In [31]:
# replace Nan with 0 for now'
df['booking_frequency'] = df['booking_frequency'].fillna(0)


In [32]:
# prep & split data 

cols = df.columns 

prediction_col = 'price'
feature_cols = [c for c in cols if c != prediction_col]

X = df[feature_cols]
y = df[prediction_col]

X_train, X_temp, y_train, y_temp = train_test_split(X, y, random_state=42, test_size=0.3)

# Train-Validation-Test Split - Second split
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, random_state=42, test_size=0.5)


print(f"Training set size: {X_train.shape[0]}")
print(f"Validation set size: {X_val.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Training set size: 3330
Validation set size: 714
Test set size: 714


In [33]:
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [34]:
pred_X_test = model.predict(X_test)

In [35]:
# evaluate mode

mse = mean_squared_error(y_test, pred_X_test)
rmse = mse ** 0.5 # measures the average difference between predicted and actual values
r2 = r2_score(y_test, pred_X_test)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared Score (R2): {r2:.4f}")

Mean Squared Error (MSE): 50.14
Root Mean Squared Error (RMSE): 7.08
R-squared Score (R²): 0.9843


- The MSE value of 50.14 represents the average squared difference between actual and predicted values. A low MSE indicates better model performance.
- The RMSE tells us that the predicted prices differ from the actual prices by approximately 7.08 units (in this case, dollars). This value is reasonable as our price range is between 10 and 283.5.
- The R2 value tells us how well the model explains the variance in the target variable, which is price. With a value close to 1, it indicates that our model explains about 98.43% of the variance in the price data.

In [36]:
# assesing the scal of price for comaprison to MSE and RMSE
price_min = df['price'].min()
price_max = df['price'].max()
price_mean = df['price'].mean()
price_std = df['price'].std()

print(f"Price Range: {price_min} - {price_max}")
print(f"Mean Price: {price_mean}")
print(f"Standard Deviation of Price: {price_std}")


Price Range: 10 - 282
Mean Price: 101.53656998738965
Standard Deviation of Price: 56.72476345023221


In [None]:
## feature importance

In [42]:
# source: https://machinelearningmastery.com/calculate-feature-importance-with-python/
from matplotlib import pyplot as plt


importance = model.feature_importances_

importance_df = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': importance
})

# Sort descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print the sorted feature names and importance values
print("Feature Importance:")
for index, row in importance_df.iterrows(): # itterate through each row in dataframe
    print(f"Feature: {row['Feature']}, Score: {row['Importance']:.5f}")
    



Feature Importance:
Feature: price_per_room, Score: 0.56001
Feature: minimum_nights, Score: 0.26183
Feature: room_type_Entire home/apt, Score: 0.13545
Feature: reviews_per_month, Score: 0.01085
Feature: availability_365, Score: 0.00995
Feature: longitude, Score: 0.00636
Feature: latitude, Score: 0.00499
Feature: calculated_host_listings_count, Score: 0.00237
Feature: booking_frequency, Score: 0.00170
Feature: number_of_reviews, Score: 0.00149
Feature: neighbourhood_Uptown, Score: 0.00124
Feature: neighbourhood_Grand Boulevard, Score: 0.00056
Feature: neighbourhood_Near North Side, Score: 0.00053
Feature: neighbourhood_Lake View, Score: 0.00053
Feature: neighbourhood_Lincoln Park, Score: 0.00026
Feature: neighbourhood_Loop, Score: 0.00021
Feature: neighbourhood_West Town, Score: 0.00021
Feature: neighbourhood_Washington Park, Score: 0.00014
Feature: neighbourhood_Kenwood, Score: 0.00013
Feature: neighbourhood_Lincoln Square, Score: 0.00011
Feature: neighbourhood_Albany Park, Score: 0.00

## New model 2

- To train a new optimal model, we will be selecting only the important features identified from the previous model.

Below, we will select the following features for our model:
- Feature: price_per_room
- Feature: minimum_nights
- Feature: room_type_Entire home/apt
- Feature: reviews_per_month
- Feature: availability_365
- Feature: longitude
- Feature: latitude
- Feature: calculated_host_listings_count
- Feature: booking_frequency
- Feature: number_of_reviews
- Feature: neighbourhood_Uptown

In [43]:
cols = df.columns 

prediction_col = 'price'
feature_cols = ['price_per_room', 'minimum_nights', 'room_type_Entire home/apt',
                'reviews_per_month','availability_365','longitude','latitude',
                'calculated_host_listings_count','booking_frequency','number_of_reviews',
                'neighbourhood_Uptown']

X = df[feature_cols]
y = df[prediction_col]

X_train, X_temp, y_train, y_temp = train_test_split(X, y, random_state=42, test_size=0.3)

# Train-Validation-Test Split - Second split
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, random_state=42, test_size=0.5)


print(f"Training set size: {X_train.shape[0]}")
print(f"Validation set size: {X_val.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Training set size: 3330
Validation set size: 714
Test set size: 714


In [57]:
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [58]:
pred_X_test = model.predict(X_test)

In [59]:
# evaluate mode

mse = mean_squared_error(y_test, pred_X_test)
rmse = mse ** 0.5 # measures the average difference between predicted and actual values
r2 = r2_score(y_test, pred_X_test)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared Score (R2): {r2:.4f}")

Mean Squared Error (MSE): 50.16
Root Mean Squared Error (RMSE): 7.08
R-squared Score (R2): 0.9843


- With our second model compared to our first model, we see that there is no difference in our RMSE and R2, except for the MSE, which differs by 0.02 points. 

In [63]:
n_estimators_values = [10,20,30,40, 50,60,70,80,90, 100, 200, 500]

for n in n_estimators_values:
    
    model = RandomForestRegressor(n_estimators=n, random_state=42)
    model.fit(X_train, y_train)  # Fit the model with training data
    
    pred_X_test = model.predict(X_test)
    
    mse = mean_squared_error(y_test, pred_X_test)
    rmse = mse ** 0.5 # measures the average difference between predicted and actual values
    r2 = r2_score(y_test, pred_X_test)
    
    print(f"n value: {n:.2f}")
    print(f"Mean Squared Error (MSE): {mse:.2f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
    print(f"R-squared Score (R2): {r2:.4f}")

n value: 10.00
Mean Squared Error (MSE): 39.42
Root Mean Squared Error (RMSE): 6.28
R-squared Score (R2): 0.9877
n value: 20.00
Mean Squared Error (MSE): 42.49
Root Mean Squared Error (RMSE): 6.52
R-squared Score (R2): 0.9867
n value: 30.00
Mean Squared Error (MSE): 48.30
Root Mean Squared Error (RMSE): 6.95
R-squared Score (R2): 0.9849
n value: 40.00
Mean Squared Error (MSE): 47.56
Root Mean Squared Error (RMSE): 6.90
R-squared Score (R2): 0.9851
n value: 50.00
Mean Squared Error (MSE): 49.27
Root Mean Squared Error (RMSE): 7.02
R-squared Score (R2): 0.9846
n value: 60.00
Mean Squared Error (MSE): 48.16
Root Mean Squared Error (RMSE): 6.94
R-squared Score (R2): 0.9849
n value: 70.00
Mean Squared Error (MSE): 48.93
Root Mean Squared Error (RMSE): 6.99
R-squared Score (R2): 0.9847
n value: 80.00
Mean Squared Error (MSE): 48.91
Root Mean Squared Error (RMSE): 6.99
R-squared Score (R2): 0.9847
n value: 90.00
Mean Squared Error (MSE): 50.01
Root Mean Squared Error (RMSE): 7.07
R-squared Sc

In [65]:
model = RandomForestRegressor(n_estimators=10, random_state=42)
model.fit(X_train, y_train)  # Fit the model with training data
    
pred_X_test = model.predict(X_test)
    
mse = mean_squared_error(y_test, pred_X_test)
rmse = mse ** 0.5 # measures the average difference between predicted and actual values
r2 = r2_score(y_test, pred_X_test)
    

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared Score (R2): {r2:.4f}")

Mean Squared Error (MSE): 39.42
Root Mean Squared Error (RMSE): 6.28
R-squared Score (R2): 0.9877


- After running a for loop to change the value of n_estimators to find the most optimal model, we found that an n_estimators value of 10 gives us a Mean Squared Error (MSE) of 29.42, a Root Mean Squared Error (RMSE) of 6.29, and an R-squared (R2) of 0.9877. These values are better than our first model, which had the following values:
    - Mean Squared Error (MSE): 50.14
    - Root Mean Squared Error (RMSE): 7.08
    - R-squared Score (R2): 0.9843. Therefore, we choose this model as our optimal model.

In [69]:
# save model to joblib

import joblib

joblib.dump(model, 'optimal_model.joblib')


['optimal_model.joblib']