**Context**

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.

**Content**

This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.

**Acknowledgements**

This public dataset is part of Airbnb, and the original source can be found on this website.

Inspiration
- What can we learn about different hosts and areas?
- What can we learn from predictions? (ex: locations, prices, reviews, etc)
- Which hosts are the busiest and why?
- Is there any noticeable difference of traffic among different areas and what could be the reason for it?

**Column**

    id : listing ID
    name : name of the listing
    host_id : host ID
    host_name : name of the host                     
    neighbourhood_group : location            
    neighbourhood : area                
    latitude : latitude coordinates                     
    longitude : longitude coordinates                      
    room_type : listing space type                     
    price : price in dollars                       
    minimum_nights : amount of nights minimum                
    number_of_reviews : number of reviews               
    last_review : latest review                  
    reviews_per_month : number of reviews per month              
    calculated_host_listings_count : amount of listing per host 
    availability_365 : number of days when listing is available for booking     
    
https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv

Ridge, Lasso, and ElasticNet information : https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net

In [1]:
import numpy as np
import pandas as pd
import geopandas
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from scipy import stats
import statsmodels.api as sm
plt.style.use('fivethirtyeight')


from sklearn.preprocessing import RobustScaler, MinMaxScaler, MaxAbsScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import PolynomialFeatures as PF
from sklearn.pipeline import make_pipeline

def find_outlier(x):
    q1 = x.describe()['25%']
    q3 = x.describe()['75%']
    iqr = abs(q1-q3)
    bttm_threshold = q1 - (iqr*1.5)
    top_threshold = q3 + (iqr*1.5)
    outlier = [i for i in x if i < bttm_threshold or i > top_threshold]
    outlier_array = np.array(outlier)
    return outlier_array

In [2]:
df = pd.read_csv('AB_NYC_2019.csv')
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


# Full Dataset

## Preparing Dataset 

In [3]:
df1 = df.copy()
df1 = df1.drop(columns=['name','id','host_name','last_review'])

In [4]:
df1['neighbourhood_group_enc'] = LabelEncoder().fit_transform(df['neighbourhood_group'])
df1['neighbourhood_enc'] = LabelEncoder().fit_transform(df['neighbourhood'])
df1['room_type_enc'] = LabelEncoder().fit_transform(df['room_type'])
df1 = df1.drop(columns=['neighbourhood', 'neighbourhood_group', 'room_type'])
df1

Unnamed: 0,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood_group_enc,neighbourhood_enc,room_type_enc
0,2787,40.64749,-73.97237,149,1,9,0.21,6,365,1,108,1
1,2845,40.75362,-73.98377,225,1,45,0.38,2,355,2,127,0
2,4632,40.80902,-73.94190,150,3,0,,1,365,2,94,1
3,4869,40.68514,-73.95976,89,1,270,4.64,1,194,1,41,0
4,7192,40.79851,-73.94399,80,10,9,0.10,1,0,2,61,0
...,...,...,...,...,...,...,...,...,...,...,...,...
48890,8232441,40.67853,-73.94995,70,2,0,,2,9,1,13,1
48891,6570630,40.70184,-73.93317,40,4,0,,2,36,1,28,1
48892,23492952,40.81475,-73.94867,115,10,0,,1,27,2,94,0
48893,30985759,40.75751,-73.99112,55,1,0,,6,2,2,95,2


In [5]:
df1 = df1.fillna(0)

In [6]:
df1.isna().sum()

host_id                           0
latitude                          0
longitude                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
neighbourhood_group_enc           0
neighbourhood_enc                 0
room_type_enc                     0
dtype: int64

In [7]:
df1


Unnamed: 0,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood_group_enc,neighbourhood_enc,room_type_enc
0,2787,40.64749,-73.97237,149,1,9,0.21,6,365,1,108,1
1,2845,40.75362,-73.98377,225,1,45,0.38,2,355,2,127,0
2,4632,40.80902,-73.94190,150,3,0,0.00,1,365,2,94,1
3,4869,40.68514,-73.95976,89,1,270,4.64,1,194,1,41,0
4,7192,40.79851,-73.94399,80,10,9,0.10,1,0,2,61,0
...,...,...,...,...,...,...,...,...,...,...,...,...
48890,8232441,40.67853,-73.94995,70,2,0,0.00,2,9,1,13,1
48891,6570630,40.70184,-73.93317,40,4,0,0.00,2,36,1,28,1
48892,23492952,40.81475,-73.94867,115,10,0,0.00,1,27,2,94,0
48893,30985759,40.75751,-73.99112,55,1,0,0.00,6,2,2,95,2


---

## Split Dataset

In [8]:
x1 = df1.drop(columns=['price'])
y1 = df1['price']

In [9]:
x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1, test_size=0.2, random_state=0)

## Testing Model 

### Linear Regression, Lasso, Ridge, ElasticNet

In [10]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    model.fit(x1_train, y1_train)
    y1_pred = model.predict(x1_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y1_test, y1_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y1_test,y1_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y1_test,y1_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 222.08
MAE	 = 75.06
R2 Score = 10.39


Model	 = Lasso()
RMSE	 = 223.64
MAE	 = 76.91
R2 Score = 9.13


Model	 = Ridge()
RMSE	 = 222.09
MAE	 = 75.03
R2 Score = 10.38


Model	 = ElasticNet()
RMSE	 = 227.49
MAE	 = 83.04
R2 Score = 5.97




  return linalg.solve(A, Xy, sym_pos=True,


### Polynomial 2

In [11]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    poly_reg = make_pipeline(
        PF(2, include_bias=False),
        i
    )

    poly_reg.fit(x1_train, y1_train)
    # predict
    y_pred = poly_reg.predict(x1_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y1_test, y_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y1_test,y_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y1_test,y_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 261.94
MAE	 = 131.48
R2 Score = -24.67




  model = cd_fast.enet_coordinate_descent(
  return linalg.solve(A, Xy, sym_pos=True,


Model	 = Lasso()
RMSE	 = 219.95
MAE	 = 73.08
R2 Score = 12.1


Model	 = Ridge()
RMSE	 = 219.51
MAE	 = 72.35
R2 Score = 12.46


Model	 = ElasticNet()
RMSE	 = 220.45
MAE	 = 73.99
R2 Score = 11.7




  model = cd_fast.enet_coordinate_descent(


### Polynomial 3

In [12]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    poly_reg = make_pipeline(
        PF(3, include_bias=False),
        i
    )

    poly_reg.fit(x1_train, y1_train)
    # predict
    y_pred = poly_reg.predict(x1_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y1_test, y_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y1_test,y_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y1_test,y_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 252.4
MAE	 = 102.9
R2 Score = -15.75




  model = cd_fast.enet_coordinate_descent(


Model	 = Lasso()
RMSE	 = 219.08
MAE	 = 72.51
R2 Score = 12.8




  return linalg.solve(A, Xy, sym_pos=True,


Model	 = Ridge()
RMSE	 = 217.86
MAE	 = 71.45
R2 Score = 13.76


Model	 = ElasticNet()
RMSE	 = 219.37
MAE	 = 72.97
R2 Score = 12.57




  model = cd_fast.enet_coordinate_descent(


### OLS for LinearModel

In [13]:
X_stat = df1.drop(columns=['price']).values
y_stat = df1['price'].values

X_stat = sm.add_constant(X_stat) # adding a constant

model = sm.OLS(y_stat, X_stat).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.091
Model:                            OLS   Adj. R-squared:                  0.091
Method:                 Least Squares   F-statistic:                     443.7
Date:                Thu, 06 Aug 2020   Prob (F-statistic):               0.00
Time:                        08:41:13   Log-Likelihood:            -3.3506e+05
No. Observations:               48895   AIC:                         6.701e+05
Df Residuals:                   48883   BIC:                         6.702e+05
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -5.102e+04   2002.162    -25.483      0.0

---

# Try to Drop Another Column

In [14]:
df2 = df.copy()
df2.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


## Preparing Dataset

In [15]:
df2.fillna(0, inplace=True)
df2.drop(columns=['id', 'name', 'host_id', 'host_name', 'latitude', 'longitude', 'last_review'], inplace=True)
df2['neighbourhood_group_enc'] = LabelEncoder().fit_transform(df2['neighbourhood_group'])
df2['neighbourhood_enc'] = LabelEncoder().fit_transform(df2['neighbourhood'])
df2['room_type_enc'] = LabelEncoder().fit_transform(df2['room_type'])
df2.drop(columns=['neighbourhood', 'neighbourhood_group', 'room_type'], inplace=True)
df2

Unnamed: 0,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood_group_enc,neighbourhood_enc,room_type_enc
0,149,1,9,0.21,6,365,1,108,1
1,225,1,45,0.38,2,355,2,127,0
2,150,3,0,0.00,1,365,2,94,1
3,89,1,270,4.64,1,194,1,41,0
4,80,10,9,0.10,1,0,2,61,0
...,...,...,...,...,...,...,...,...,...
48890,70,2,0,0.00,2,9,1,13,1
48891,40,4,0,0.00,2,36,1,28,1
48892,115,10,0,0.00,1,27,2,94,0
48893,55,1,0,0.00,6,2,2,95,2


## Split Dataset

In [16]:
x2 = df2.drop(columns=['price'])
y2 = df2['price']

In [17]:
x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2, test_size=0.2, random_state=0)

## Testing Model

### Linear Regression, Lasso, Ridge, ElasticNet

In [18]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    model.fit(x2_train, y2_train)
    y2_pred = model.predict(x2_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y2_test, y2_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y2_test,y2_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y2_test,y2_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 224.07
MAE	 = 77.84
R2 Score = 8.78


Model	 = Lasso()
RMSE	 = 224.1
MAE	 = 77.75
R2 Score = 8.75


Model	 = Ridge()
RMSE	 = 224.07
MAE	 = 77.84
R2 Score = 8.78


Model	 = ElasticNet()
RMSE	 = 227.5
MAE	 = 83.07
R2 Score = 5.96




### Polynomial 2

In [19]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    poly_reg = make_pipeline(
        PF(2, include_bias=False),
        i
    )

    poly_reg.fit(x2_train, y2_train)
    # predict
    y_pred = poly_reg.predict(x2_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y2_test, y_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y2_test,y_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y2_test,y_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 220.99
MAE	 = 74.04
R2 Score = 11.27




  model = cd_fast.enet_coordinate_descent(


Model	 = Lasso()
RMSE	 = 221.32
MAE	 = 74.41
R2 Score = 11.0


Model	 = Ridge()
RMSE	 = 220.99
MAE	 = 74.04
R2 Score = 11.27


Model	 = ElasticNet()
RMSE	 = 223.09
MAE	 = 78.41
R2 Score = 9.58




  model = cd_fast.enet_coordinate_descent(


### Polynomial 3

In [20]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    poly_reg = make_pipeline(
        PF(3, include_bias=False),
        i
    )

    poly_reg.fit(x2_train, y2_train)
    # predict
    y_pred = poly_reg.predict(x2_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y2_test, y_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y2_test,y_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y2_test,y_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 219.79
MAE	 = 73.3
R2 Score = 12.23




  model = cd_fast.enet_coordinate_descent(
  return linalg.solve(A, Xy, sym_pos=True,


Model	 = Lasso()
RMSE	 = 220.37
MAE	 = 74.03
R2 Score = 11.77


Model	 = Ridge()
RMSE	 = 219.79
MAE	 = 73.29
R2 Score = 12.23


Model	 = ElasticNet()
RMSE	 = 220.56
MAE	 = 74.76
R2 Score = 11.61




  model = cd_fast.enet_coordinate_descent(


### OLS for LinearModel

In [21]:
X_stat = df2.drop(columns=['price']).values
y_stat = df2['price'].values

X_stat = sm.add_constant(X_stat) # adding a constant

model = sm.OLS(y_stat, X_stat).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.077
Model:                            OLS   Adj. R-squared:                  0.077
Method:                 Least Squares   F-statistic:                     512.7
Date:                Thu, 06 Aug 2020   Prob (F-statistic):               0.00
Time:                        08:41:33   Log-Likelihood:            -3.3542e+05
No. Observations:               48895   AIC:                         6.709e+05
Df Residuals:                   48886   BIC:                         6.709e+05
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        166.4144      3.289     50.600      0.0

---

# Try to Drop Another Column and Get Dummies from `room_type`, `neighbourhood`, and `neighbourhood_group`

## Preparing Dataset

In [22]:
df3 = df.copy()
df3.fillna(0, inplace=True)
df3.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,0,0.0,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [23]:
df3.drop(columns=['id', 'name', 'host_id', 'host_name','latitude', 'longitude', 'last_review'], inplace=True)
df3.head()

Unnamed: 0,neighbourhood_group,neighbourhood,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Kensington,Private room,149,1,9,0.21,6,365
1,Manhattan,Midtown,Entire home/apt,225,1,45,0.38,2,355
2,Manhattan,Harlem,Private room,150,3,0,0.0,1,365
3,Brooklyn,Clinton Hill,Entire home/apt,89,1,270,4.64,1,194
4,Manhattan,East Harlem,Entire home/apt,80,10,9,0.1,1,0


In [24]:
df3 = pd.get_dummies(df3, prefix=['room_type'], columns=['room_type'])
df3 = pd.get_dummies(df3, prefix=['neighbourhood_group'], columns=['neighbourhood_group'])
df3 = pd.get_dummies(df3, prefix=['neighbourhood'], columns=['neighbourhood'])
df3

Unnamed: 0,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,neighbourhood_group_Bronx,...,neighbourhood_Westerleigh,neighbourhood_Whitestone,neighbourhood_Williamsbridge,neighbourhood_Williamsburg,neighbourhood_Willowbrook,neighbourhood_Windsor Terrace,neighbourhood_Woodhaven,neighbourhood_Woodlawn,neighbourhood_Woodrow,neighbourhood_Woodside
0,149,1,9,0.21,6,365,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,225,1,45,0.38,2,355,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,150,3,0,0.00,1,365,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,89,1,270,4.64,1,194,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,80,10,9,0.10,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,70,2,0,0.00,2,9,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
48891,40,4,0,0.00,2,36,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
48892,115,10,0,0.00,1,27,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48893,55,1,0,0.00,6,2,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## Split Dataset

In [25]:
x3 = df3.drop(columns=['price'])
y3 = df3['price']

In [26]:
x3_train, x3_test, y3_train, y3_test = train_test_split(x3, y3, test_size=0.2, random_state=0)

## Testing Model

### Linear Regression, Lasso, Ridge, ElasticNet

In [27]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    model.fit(x3_train, y3_train)
    y3_pred = model.predict(x3_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y3_test, y3_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y3_test,y3_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y3_test,y3_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 219.29
MAE	 = 71.41
R2 Score = 12.63


Model	 = Lasso()
RMSE	 = 221.11
MAE	 = 72.93
R2 Score = 11.17


Model	 = Ridge()
RMSE	 = 219.25
MAE	 = 71.37
R2 Score = 12.66


Model	 = ElasticNet()
RMSE	 = 225.3
MAE	 = 78.76
R2 Score = 7.77




### Polynomial 2 (RUNNINGNYA BERAT)

In [28]:
# model = Ridge()
# poly_reg = make_pipeline(
#     PF(2, include_bias=False),
#     model
# )

# poly_reg.fit(x3_train, y3_train)
# # predict
# y_pred = poly_reg.predict(x3_test)
# print(f'Model\t = {model}')
# print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y3_test, y_pred)), 2)}')
# print(f'MAE\t = {round(mean_absolute_error(y3_test,y_pred), 2)}')
# print(f'R2 Score = {round( r2_score(y3_test,y_pred) * 100 , 2)}')
# print('\n')

### Polynomial 3 (RUNNINGNYA BERAT)

### OLS for LinearModel

In [29]:
X_stat = df3.drop(columns=['price']).values
y_stat = df3['price'].values

X_stat = sm.add_constant(X_stat) # adding a constant

model = sm.OLS(y_stat, X_stat).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.117
Model:                            OLS   Adj. R-squared:                  0.113
Method:                 Least Squares   F-statistic:                     28.34
Date:                Thu, 06 Aug 2020   Prob (F-statistic):               0.00
Time:                        08:41:36   Log-Likelihood:            -3.3435e+05
No. Observations:               48895   AIC:                         6.692e+05
Df Residuals:                   48667   BIC:                         6.712e+05
Df Model:                         227                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         60.8533      3.494     17.416      0.0

---

# Try to Handling Outlier in `price` Column (Currently Best)

## Preparing Dataset

In [30]:
df4 = df.copy()
df4.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [31]:
df4.drop(columns=['id', 'name', 'host_id', 'host_name','last_review'], inplace=True)
df4.head()

Unnamed: 0,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
2,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,1,365
3,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
4,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,1,0


In [32]:
find_outlier(df4['price']).min()

335

In [33]:
df4 = df4[df4['price'] < 335]
df4.fillna(0, inplace=True)
df4 # Berkurang 3000 row

Unnamed: 0,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
2,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,0.00,1,365
3,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
4,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...
48890,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,0.00,2,9
48891,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,0.00,2,36
48892,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,0.00,1,27
48893,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,0.00,6,2


In [34]:
df4['neighbourhood_group_enc'] = LabelEncoder().fit_transform(df4['neighbourhood_group'])
df4['neighbourhood_enc'] = LabelEncoder().fit_transform(df4['neighbourhood'])
df4['room_type_enc'] = LabelEncoder().fit_transform(df4['room_type'])
df4 = df4.drop(columns=['neighbourhood', 'neighbourhood_group', 'room_type'])
df4

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood_group_enc,neighbourhood_enc,room_type_enc
0,40.64749,-73.97237,149,1,9,0.21,6,365,1,107,1
1,40.75362,-73.98377,225,1,45,0.38,2,355,2,126,0
2,40.80902,-73.94190,150,3,0,0.00,1,365,2,93,1
3,40.68514,-73.95976,89,1,270,4.64,1,194,1,41,0
4,40.79851,-73.94399,80,10,9,0.10,1,0,2,61,0
...,...,...,...,...,...,...,...,...,...,...,...
48890,40.67853,-73.94995,70,2,0,0.00,2,9,1,13,1
48891,40.70184,-73.93317,40,4,0,0.00,2,36,1,28,1
48892,40.81475,-73.94867,115,10,0,0.00,1,27,2,93,0
48893,40.75751,-73.99112,55,1,0,0.00,6,2,2,94,2


## Split Dataset

In [35]:
x4 = df4.drop(columns=['price'])
y4 = df4['price']

In [36]:
x4_train, x4_test, y4_train, y4_test = train_test_split(x4, y4, test_size=0.2, random_state=0)

## Testing Model

### Linear Regression, Lasso, Ridge, ElasticNet

In [37]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    model.fit(x4_train, y4_train)
    y4_pred = model.predict(x4_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y4_test, y4_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y4_test,y4_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y4_test,y4_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 50.65
MAE	 = 37.98
R2 Score = 44.86


Model	 = Lasso()
RMSE	 = 52.81
MAE	 = 40.37
R2 Score = 40.08


Model	 = Ridge()
RMSE	 = 50.65
MAE	 = 37.98
R2 Score = 44.87


Model	 = ElasticNet()
RMSE	 = 58.88
MAE	 = 46.49
R2 Score = 25.5




### Polynomial 2

In [38]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    poly_reg = make_pipeline(
        PF(2, include_bias=False),
        i
    )

    poly_reg.fit(x4_train, y4_train)
    # predict
    y_pred = poly_reg.predict(x4_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y4_test, y_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y4_test,y_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y4_test,y_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 47.37
MAE	 = 34.66
R2 Score = 51.78




  model = cd_fast.enet_coordinate_descent(


Model	 = Lasso()
RMSE	 = 49.33
MAE	 = 36.45
R2 Score = 47.69


Model	 = Ridge()
RMSE	 = 48.48
MAE	 = 35.65
R2 Score = 49.5


Model	 = ElasticNet()
RMSE	 = 49.86
MAE	 = 37.11
R2 Score = 46.58




  model = cd_fast.enet_coordinate_descent(


### Polynomial 3

In [39]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    poly_reg = make_pipeline(
        PF(3, include_bias=False),
        i
    )

    poly_reg.fit(x4_train, y4_train)
    # predict
    y_pred = poly_reg.predict(x4_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y4_test, y_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y4_test,y_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y4_test,y_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 46.3
MAE	 = 33.27
R2 Score = 53.94




  model = cd_fast.enet_coordinate_descent(


Model	 = Lasso()
RMSE	 = 47.78
MAE	 = 35.09
R2 Score = 50.94




  return linalg.solve(A, Xy, sym_pos=True,


Model	 = Ridge()
RMSE	 = 46.63
MAE	 = 33.61
R2 Score = 53.27


Model	 = ElasticNet()
RMSE	 = 47.92
MAE	 = 35.22
R2 Score = 50.66




  model = cd_fast.enet_coordinate_descent(


### OLS for LinearModel

In [40]:
X_stat = df4.drop(columns=['price']).values
y_stat = df4['price'].values

X_stat = sm.add_constant(X_stat) # adding a constant

model = sm.OLS(y_stat, X_stat).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.448
Model:                            OLS   Adj. R-squared:                  0.448
Method:                 Least Squares   F-statistic:                     3724.
Date:                Thu, 06 Aug 2020   Prob (F-statistic):               0.00
Time:                        08:42:02   Log-Likelihood:            -2.4540e+05
No. Observations:               45923   AIC:                         4.908e+05
Df Residuals:                   45912   BIC:                         4.909e+05
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -2.752e+04    452.577    -60.816      0.0

---

# Try to Drop Another Column and Get Dummies from `room_type`, `neighbourhood`, and `neighbourhood_group` and also Try to Handling Outlier in `price` Column

## Preparing Dataset

In [67]:
df5 = df.copy()
df5.drop(columns=['id', 'name', 'host_id', 'host_name','latitude', 'longitude', 'last_review'], inplace=True)
df5.fillna(0, inplace=True)
df5 = df5[df5['price'] < 335]
df5 = pd.get_dummies(df5, prefix=['room_type'], columns=['room_type'])
df5 = pd.get_dummies(df5, prefix=['neighbourhood_group'], columns=['neighbourhood_group'])
df5 = pd.get_dummies(df5, prefix=['neighbourhood'], columns=['neighbourhood'])
df5

Unnamed: 0,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,neighbourhood_group_Bronx,...,neighbourhood_Westchester Square,neighbourhood_Westerleigh,neighbourhood_Whitestone,neighbourhood_Williamsbridge,neighbourhood_Williamsburg,neighbourhood_Willowbrook,neighbourhood_Windsor Terrace,neighbourhood_Woodhaven,neighbourhood_Woodlawn,neighbourhood_Woodside
0,149,1,9,0.21,6,365,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,225,1,45,0.38,2,355,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,150,3,0,0.00,1,365,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,89,1,270,4.64,1,194,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,80,10,9,0.10,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,70,2,0,0.00,2,9,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
48891,40,4,0,0.00,2,36,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
48892,115,10,0,0.00,1,27,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48893,55,1,0,0.00,6,2,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## Split Dataset

In [68]:
x5 = df5.drop(columns=['price'])
y5 = df5['price']

x5_train, x5_test, y5_train, y5_test = train_test_split(x5, y5, test_size=0.2, random_state=0)

## Testing Model

### Linear Regression, Lasso, Ridge, ElasticNet

In [69]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    model.fit(x5_train, y5_train)
    y5_pred = model.predict(x5_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y5_test, y5_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y5_test,y5_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y5_test,y5_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 19612316.8
MAE	 = 346167.79
R2 Score = -8266149618402.91


Model	 = Lasso()
RMSE	 = 49.91
MAE	 = 37.17
R2 Score = 46.47


Model	 = Ridge()
RMSE	 = 47.23
MAE	 = 34.64
R2 Score = 52.06


Model	 = ElasticNet()
RMSE	 = 55.72
MAE	 = 43.31
R2 Score = 33.29




### OLS for LinearModel

In [66]:
X_stat = df5.drop(columns=['price']).values
y_stat = df5['price'].values

X_stat = sm.add_constant(X_stat) # adding a constant

model = sm.OLS(y_stat, X_stat).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.531
Model:                            OLS   Adj. R-squared:                  0.529
Method:                 Least Squares   F-statistic:                     228.1
Date:                Thu, 06 Aug 2020   Prob (F-statistic):               0.00
Time:                        08:58:43   Log-Likelihood:            -2.4164e+05
No. Observations:               45923   AIC:                         4.837e+05
Df Residuals:                   45695   BIC:                         4.857e+05
Df Model:                         227                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -1.93e+04   1816.581    -10.622      0.0

---

# Try to Handle Outlier in `price` Column with Full Dataset and Dummies Categorical Features (Continue in ML 2)

## Preparing Dataset

In [45]:
df6 = df.copy()
df6.fillna(0, inplace=True)
df6.drop(columns=['id', 'name', 'host_id', 'host_name', 'last_review'], inplace=True) # host_id ga dibuang lagi karena Ridge udah deal sama multikol
df6 = df6[df6['price'] < 335]
df6 = pd.get_dummies(df6, prefix=['room_type'], columns=['room_type'])
df6 = pd.get_dummies(df6, prefix=['neighbourhood_group'], columns=['neighbourhood_group'])
df6 = pd.get_dummies(df6, prefix=['neighbourhood'], columns=['neighbourhood'])
df6

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,room_type_Entire home/apt,room_type_Private room,...,neighbourhood_Westchester Square,neighbourhood_Westerleigh,neighbourhood_Whitestone,neighbourhood_Williamsbridge,neighbourhood_Williamsburg,neighbourhood_Willowbrook,neighbourhood_Windsor Terrace,neighbourhood_Woodhaven,neighbourhood_Woodlawn,neighbourhood_Woodside
0,40.64749,-73.97237,149,1,9,0.21,6,365,0,1,...,0,0,0,0,0,0,0,0,0,0
1,40.75362,-73.98377,225,1,45,0.38,2,355,1,0,...,0,0,0,0,0,0,0,0,0,0
2,40.80902,-73.94190,150,3,0,0.00,1,365,0,1,...,0,0,0,0,0,0,0,0,0,0
3,40.68514,-73.95976,89,1,270,4.64,1,194,1,0,...,0,0,0,0,0,0,0,0,0,0
4,40.79851,-73.94399,80,10,9,0.10,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,40.67853,-73.94995,70,2,0,0.00,2,9,0,1,...,0,0,0,0,0,0,0,0,0,0
48891,40.70184,-73.93317,40,4,0,0.00,2,36,0,1,...,0,0,0,0,0,0,0,0,0,0
48892,40.81475,-73.94867,115,10,0,0.00,1,27,1,0,...,0,0,0,0,0,0,0,0,0,0
48893,40.75751,-73.99112,55,1,0,0.00,6,2,0,0,...,0,0,0,0,0,0,0,0,0,0


## Split Dataset

In [46]:
x6 = df6.drop(columns=['price'])
y6 = df6['price']

x6_train, x6_test, y6_train, y6_test = train_test_split(x6, y6, test_size=0.2, random_state=0)

## Testing Model

### Linear Regression, Lasso, Ridge, ElasticNet

In [47]:
linear = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]
for i in linear:
    model = i
    model.fit(x6_train, y6_train)
    y6_pred = model.predict(x6_test)
    print(f'Model\t = {model}')
    print(f'RMSE\t = {round(np.sqrt(mean_squared_error(y6_test, y6_pred)), 2)}')
    print(f'MAE\t = {round(mean_absolute_error(y6_test,y6_pred), 2)}')
    print(f'R2 Score = {round( r2_score(y6_test,y6_pred) * 100 , 2)}')
    print('\n')

Model	 = LinearRegression()
RMSE	 = 409413474.07
MAE	 = 6639140.0
R2 Score = -3602215784320768.0


Model	 = Lasso()
RMSE	 = 49.91
MAE	 = 37.17
R2 Score = 46.47


Model	 = Ridge()
RMSE	 = 47.14
MAE	 = 34.54
R2 Score = 52.24


Model	 = ElasticNet()
RMSE	 = 55.71
MAE	 = 43.31
R2 Score = 33.3




### OLS for LinearModel

In [48]:
X_stat = df6.drop(columns=['price']).values
y_stat = df6['price'].values

X_stat = sm.add_constant(X_stat) # adding a constant

model = sm.OLS(y_stat, X_stat).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.531
Model:                            OLS   Adj. R-squared:                  0.529
Method:                 Least Squares   F-statistic:                     228.1
Date:                Thu, 06 Aug 2020   Prob (F-statistic):               0.00
Time:                        08:42:07   Log-Likelihood:            -2.4164e+05
No. Observations:               45923   AIC:                         4.837e+05
Df Residuals:                   45695   BIC:                         4.857e+05
Df Model:                         227                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -1.93e+04   1816.581    -10.622      0.0

---