# CS541 Applied Machine Learning Fall 2024 - Class Challenge

In this class challenge assignment, you will be building a machine learning model to predict the price of an Airbnb rental, given the dataset we have provided. Total points: **100 pts**

To submit your solution, you need to submit a python (.py) file named challenge.py on Gradescope.
Final Submission due Dec 10, 2024 (Initial submission due Nov 26)

There will be a Leaderboard for the challenge that can be seen by all students. You can give yourself a nickname on the leaderboard if you'd like to keep your score anonymous.

*If you choose a nickname, you are not allowed to use FULL CREDIT PERFORMANCE, 60 POINT SCORE BASELINE, or RANDOM BASELINE as they all are used by course staff (more on that below)*

To encourage you to get started early on the challenge, you are required to submit an initial submission due on **Nov 26, 11:59 pm**. For this submission, your model needs to be better than the linear model with random weights that we provided. The final submission will be due on **Dec 10, 11:59 pm**.


## Problem and dataset description
Pricing a rental property such as an apartment or house on Airbnb is a difficult challenge. A model that accurately predicts the price can potentially help renters and hosts on the platform make better decisions. In this assignment, your task is to train a model that takes features of a listing as input and predicts the price.

We have provided you with a dataset collected from the Airbnb website for New York, which has a total of 29,985 entries, each with 764 features. You may use the provided data as you wish in development. We will train your submitted code on the same provided dataset, and will evaluate it on 2 other test sets (one public, and one hidden during the challenge).

We have already done some minimal data cleaning for you, such as converting text fields into categorical values and getting rid of the NaN values. To convert text fields into categorical values, we used different strategies depending on the field. For example, sentiment analysis was applied to convert user reviews to numerical values ('comments' column). We added different columns for state names, '1' indicating the location of the property. Column names are included in the data files and are mostly descriptive.

Also in this data cleaning step, the price value that we are trying to predict is calculated by taking the log of original price. Hence, the minimum value for our output price is around 2.302 and maximum value is around 9.21 on the training set.


## Datasets and Codebase

Please download the zip file from the link posted on Piazza/Resources.
In this notebook, we implemented a linear regression model with random weights (**attached in the end**). For datasets, there’re 2 CSV files for features and labels:

    challenge.ipynb (This file: you need to add your code in here, convert it to .py to submit)
    data_cleaned_train_comments_X.csv
    data_cleaned_train_y.csv


## Instructions to build your model
1.  Implement your model in **challenge.ipynb**. You need to modify the *train()* and *predict()* methods of **Model** class (*attached at the end of this notebook*). You can also add other methods/attributes  to the class, or even add new classes in the same file if needed, but do NOT change the signatures of the *train()* and *predict()* as we will call these 2 methods for evaluating your model.

2. To submit, you need to convert your notebook (.ipynb) to a python **(.py)** file. Make sure in the python file, it has a class named **Model**, and in the class, there are two methods: *train* and *predict*. Other experimental code should be removed if needed to avoid time limit exceeded on gradescope.

3.  You can submit your code on gradescope to test your model. You can submit as many times you like. The last submission will count as the final model.

An example linear regression model with random weights is provided to you in this notebook. Please take a look and replace the code with your own.


## Evaluation

We will evaluate your model as follows

    model = Model() # Model class imported from your submission
    X_train = pd.read_csv("data_cleaned_train_comments_X.csv")  # pandas Dataframe
    y_train = pd.read_csv("data_cleaned_train_y.csv")  # pandas Dataframe
    model.train(X_train, y_train) # train your model on the dataset provided to you
    y_pred = model.predict(X_test) # test your model on the hidden test set (pandas Dataframe)
    mse = mean_squared_error(y_test, y_pred) # compute mean squared error


**There will be 2 test sets, one is public which means you can see MSE on this test set on the Leaderboard (denoted as *MSE (PUBLIC TESTSET)*), and the other one is hidden during the challenge (denoted as *MSE (HIDDEN TESTSET)*)**.
Your score on the hidden test set will be your performance measure. So, don’t try to overfit your model on the public test set. Your final grade will depend on the following criteria:


1.  	Is it original code (implemented by you)?
2.  	Does it take a reasonable time to complete?
    Your model needs to finish running in under 40 minutes on our machine. We run the code on a machine with 4 CPUs, 6.0GB RAM.
3.  	Does it achieve a reasonable MSE?
    - **Initial submission (10 pts)**: Your model has to be better than the random weights linear model (denoted as RANDOM BASELINE on Leaderboard) provided in the file. Note this will due on **Nov 26, 11:59pm**.
    - **Final submission (90 pts):** Your last submission will count as the final submission. There are four MSE checkpoints and you will be graded accordingly.
        - Random Chance MSE ~40 and above: Grade=0
        - MSE 0.5: Grade = 30
        - MSE 0.157: Grade = 60 (denoated as 60 POINT SCORE BASELINE on the Leaderboard)
        - MSE 0.143: Grade = 76.5
        - MSE 0.1358 and below: Grade = 90 (denoated as FULL CREDIT PERFORMANCE on the Leaderboard)
    
    The grade will be linearly interpolated for the submissions that lie in between the checkpoints above. We will use MSE on the hidden test set to evaluate your model (lower is better).

    **Bonus**: **Top 3** with the best MSE on the hidden test set will get a 5 point bonus.

**Note 1: This is a regression problem** in which we want to predict the price for an AirBnB property. You should try different models and finetune their hyper parameters.  A little feature engineering can also help to boost the performance.

**Note 2**: You may NOT use additional datasets. This assignment is meant to challenge you to build a better model, not collect more training data, so please only use the data we provided. We tested the code on Python 3.10 and 3.9, thus it’s highly recommended to use these Python versions for the challenge.


In this challenge, you can only use built-in python modules, and these following:
- Numpy
- pandas
- scikit_learn
- matplotlib
- scipy
- torchsummary
- xgboost
- torchmetrics
- lightgbm
- catboost
- torch



In [None]:
### Sample code for the challenge

from typing import Tuple
import numpy as np
import pandas as pd
import torch
import torch.nn as nn

class Model:
    # Modify your model, default is a linear regression model with random weights

    def __init__(self):
        self.theta = None

    def train(self, X_train: pd.DataFrame, y_train: pd.DataFrame) -> None:
        """
        Train model with training data.
        Currently, we use a linear regression with random weights
        You need to modify this function.
        :param X_train: shape (N,d)
        :param y_train: shape (N,1)
            where N is the number of observations, d is feature dimension
        :return: None
        """
        N, d = X_train.shape
        self.theta = np.random.randn(d, 1)
        return None

    def predict(self, X_test: pd.DataFrame) -> np.array:
        """
        Use the trained model to predict on un-seen dataset
        You need to modify this function
        :param X_test: shape (N, d), where N is the number of observations, d is feature dimension
        return: prediction, shape (N,1)
        """
        y_pred = X_test @ self.theta
        return y_pred

In [67]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


In [95]:
!pip install lightgbm



In [1]:
### My implementationfor the challenge

from typing import Tuple
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data.dataloader import DataLoader, Dataset
from sklearn import tree, ensemble
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import MinMaxScaler, Normalizer, PolynomialFeatures
from catboost import CatBoostRegressor, Pool
import lightgbm as lgb

'''

if all else fails, use submission 23 with gradescope MSE of: .1575

Notes about problem:
- regression problem
- train on input data, need to predict based on features
- evaluated on MSE of LOG prices

things to try:
- regression problem
  - linear regression
  - logistic regression
- neural network approaches -> pytorch
  - linear models + activation functions
  - residual connections?
- regression trees? -> sklearn
- boosting?
-

what doesn't make sense:
- recurrent neural networks (not sequential data, so it doesn't make a ton of sense)
- convolutional NNs (the data comes from different features, different features can't necessarily be connected together)

approaches tried:
- sklearn.tree.DecisionTreeRegressor
  - train log MSE: 0.0417093276433341
  - validation log MSE: 0.3586
  - notes: overfitting to training data!!
      this is likely a naive approach to solving this problem (overfitting), let's try to do cross validation
    on the training data to gain a better understanding of how each model is performing and how much it is
    overfitting

- sklearn.ensemble.GradientBoostingRegressor
  - train MSE: 0.1511642710561646
  - val MSE: 0.14980329916397306
  - gradescope: validation log MSE: 0.1674
  - cross validation scores: [-0.13925216 -0.14871137 -0.13613333 -0.1396184  -0.14120412]
  - notes: trying to use gradient boosting to get things done

- linear feedforward neural networks with ReLU activation functions
  - train MSE: 0.216356884204702
  - val MSE: 0.21610146467657598
  - validation log MSE: 0.263 (target is 0.1429)
  - notes: likely overfitting to training data due to such a low training log MSE (we'll see based on validation MSE)
    - can try to reduce overfitting by adding dropout and adding regularization?
  - with L2 regluarization
    - train log MSE: 0.007972340078077735

- Stochastic Gradient Descent: (no feature engineering)
  - train MSE: 0.19393384627986773
  - val MSE: 0.19056238721171054

- feature engineering:
  - limit the features that are selected to hopefully ignore the less useful features
  - feature engineering + GradientBoosting
  - gradescope validation MSE: 0.1688
    - train MSE: 0.13561107597420052
    - val MSE: 0.13823286088844255

  - feature engineering + random forest (def overfitting, but achieves better result on validation?)
    - train MSE: 0.01995328171540822
    - val MSE: 0.1317457920228709
    - gradescope: 0.1752, 0.1733
      -> def overfitting a lot
      -> try this again while reducing overfitting?

  - feature engineering + stochastic gradient descent (no big difference)
    - train MSE: 0.19393384627986773
    - val MSE: 0.19056238721171054

  - feature engineering + FF neural network:
    - train MSE: 0.23402373770303997
    - val MSE: 0.22856279811992486

    - using feature engineering + FF neural network with 4 layers + no regularization:
      - train MSE: 0.1526759606681928
      - val MSE: 0.17155963779852876
      # starting to overfit to the training data

    - fature engineering + FF neural network with 4 layers + regularization:
      - train MSE: 0.23190209938331155 (10 epochs)
      - val MSE: 0.22566287912980657 (10 epochs)
      - notes:
        - promissing??? try more epochs
          - train MSE: 0.2147384080782502
          - val MSE: 0.20974559485949926
        - didn't improve much, try higher learning rate with lower epochs?

  - try with extremely limited set of features
    - random forest:
    - try gradient boosting (with tuned hyperparameters)
        train MSE: 0.1834618478516913
        val MSE: 0.18296780136097118
    - linear feedforward neural network on extremely limited number of features

- feature engineering + gradient boosting

- gradient booster + feature normalization:
  - train MSE: 0.14951480707858728
  - val MSE: 0.14733644514305974

- gradient booster + featuer normalization + feature engineering:
  - train MSE: 0.12852229759815006
  - val MSE: 0.13355599449450226
  - gradescope: 0.1646

- only removing host feature information + feature normalization + GradientBoostinGRegressor lr=.15
  train MSE: 0.12746668737438804
  val MSE: 0.1338171937136844
  gradescope: 0.1587

  hyperparameter tuning for above approach ^
    - n_estimators = 150,
      train MSE: 0.12079034668698367
      val MSE: 0.13124443551691406
      gradescope MSE: 0.1575

    - num_estimators = 150, criterion="squared_error" (expect to be worse than above)
      train MSE: 0.12079034668698364
      val MSE: 0.13127632640704945

    - criterion = "squared_error"
      train MSE: 0.12746668737438804
      val MSE: 0.133809265486341

    - n_estimators=200
      train MSE: 0.11672489373319596
      val MSE: 0.1302090433368369

    - try with n_estimators=150 + lr=.1
      train MSE: 0.1272075582435308
      val MSE: 0.1332393147846317

    - try with polynomial features

CATBOOST -> jk it sucks

  - catboost with default params
    train MSE: 0.6233636572663228
    val MSE: 0.5996787738336086

LGBMRegressor

  - LGBM regressor with default parameters + feature normalization
    train MSE: 0.10422598703208176
    val MSE: 0.12450862208109358
    gradescope MSE: 0.9349

  - LGBM regressor with default parameters without feature normalization
    train MSE: 0.10422598703208176
    val MSE: 0.12450862208109358

  - LGBM with custom parameters -> learning_rate=.15, n_estimators=150
    train MSE: 0.08549646114308068
    val MSE: 0.12359046743679325
    -> overfitting a lot...

  - LGBM with regularization (both L1 and L2)
    train MSE: 0.08546671162184732
    val MSE: 0.124909423022585

'''

class Model:
    def __init__(self):
        self.columns_of_interest = ['latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'instant_bookable', 'is_business_travel_ready', 'Firm_mattress', 'Step-free_access', 'Dishes_and_silverware', '24-hour_check-in', 'Refrigerator', 'First_aid_kit', 'Pocket_wifi', 'Long_term_stays_allowed', 'Mobile_hoist', 'Self_check-in', 'Children’s_dinnerware', 'Beach_essentials', 'Gym', 'Air_purifier', 'Single_level_home', 'Hair_dryer', 'BBQ_grill', 'Changing_table', 'Building_staff', 'Paid_parking_off_premises', 'Hot_water_kettle', 'Suitable_for_events', 'Laptop_friendly_workspace', 'Roll-in_shower', 'Ground_floor_access', 'Cable_TV', 'Ski-in/Ski-out', 'Cooking_basics', 'Buzzer/wireless_intercom', 'Luggage_dropoff_allowed', 'Wide_clearance_to_bed', 'Garden_or_backyard', 'Fixed_grab_bars_for_toilet', 'Dog(s)', 'Keypad', 'Internet', 'Bathtub', 'Baby_bath', 'Flat_path_to_front_door', 'Shower_chair', 'Breakfast', 'Window_guards', 'Family/kid_friendly', 'Ceiling_hoist', 'Room-darkening_shades', 'Pool', 'Waterfront', 'Ethernet_connection', 'Fireplace_guards', 'Oven', 'Indoor_fireplace', 'Coffee_maker', 'Smoke_detector', 'Bed_linens', 'Wide_hallway_clearance', 'Host_greets_you', 'Private_living_room', 'Private_bathroom', 'Wide_entryway', 'Lake_access', 'Other_pet(s)', 'High_chair', 'Carbon_monoxide_detector', 'Safety_card', 'Hangers', 'Essentials', 'Iron', 'Smart_lock', 'Hot_water', 'Heating', 'Pool_with_pool_hoist', 'Pack_’n_Play/travel_crib', 'Well-lit_path_to_entrance', 'Accessible-height_bed', 'EV_charger', '_toilet', 'Dishwasher', 'Children’s_books_and_toys', 'Wide_doorway', 'Beachfront', 'Pets_live_on_this_property', 'Shampoo', 'Table_corner_guards', 'Fixed_grab_bars_for_shower', 'Washer', 'Outlet_covers', 'Bathtub_with_bath_chair', 'Free_parking_on_premises', 'Wifi', 'Lock_on_bedroom_door', 'Electric_profiling_bed', 'Stove', 'Lockbox', 'Wheelchair_accessible', 'Cleaning_before_checkout', 'Handheld_shower_head', 'Cat(s)', 'Fire_extinguisher', 'Air_conditioning', 'TV', 'Smoking_allowed', 'Baby_monitor', 'Extra_pillows_and_blankets', 'Game_console', 'Dryer', 'Disabled_parking_spot', 'Microwave', 'Stair_gates', 'Washer_/_Dryer', 'Patio_or_balcony', 'Paid_parking_on_premises', 'Pets_allowed', 'Babysitter_recommendations', 'Elevator', 'Accessible-height_toilet', 'Free_street_parking', 'Doorman', 'Wide_clearance_to_shower', 'Private_entrance', 'Hot_tub', 'Kitchen', 'Crib', 'Aparthotel', 'Apartment', 'Bed and breakfast', 'Boat', 'Boutique hotel', 'Bungalow', 'Cabin', 'Camper/RV', 'Casa particular (Cuba)', 'Castle', 'Cave', 'Condominium', 'Cottage', 'Guest suite', 'Guesthouse', 'Hostel', 'Hotel', 'House', 'Houseboat', 'Loft', 'Nature lodge', 'Resort', 'Serviced apartment', 'Tent', 'Timeshare', 'Tiny house', 'Townhouse', 'Train', 'Villa', 'Airbed', 'Couch', 'Futon', 'Pull-out Sofa', 'Real Bed', 'Entire home/apt', 'Private room', 'Shared room', 'comments']

        # try again with GradientBoostingRegressor with feature engineering
        self.model = ensemble.GradientBoostingRegressor(learning_rate=0.15, n_estimators=150)
        # self.model = SGDRegressor()
        self.scalar = MinMaxScaler()
        self.poly = PolynomialFeatures(degree=2)
        # self.model = lgb.LGBMRegressor()

    def train(self, X_train: pd.DataFrame, y_train: pd.DataFrame) -> None:
        """
        Train model with training data.
        Currently, we use a linear regression with random weights
        You need to modify this function.
        :param X_train: shape (N,d)
        :param y_train: shape (N,1)
            where N is the number of observations, d is feature dimension
        :return: None
        """
        N, d = X_train.shape

        # train model
        new_data = self.scalar.fit_transform(X_train[self.columns_of_interest])
        self.poly.fit(new_data)
        new_data = self.poly.transform(new_data)
        # new_data = X_train.to_numpy()
        self.model.fit(new_data[:, 6:], y_train)
        return None

    def predict(self, X_test: pd.DataFrame) -> np.array:
        """
        Use the trained model to predict on un-seen dataset
        You need to modify this function
        :param X_test: shape (N, d), where N is the number of observations, d is feature dimension
        return: prediction, shape (N,1)
        """
        data = self.scalar.transform(X_test[self.columns_of_interest])
        data = self.poly.transform(data)
        # data = X_test.to_numpy()
        return self.model.predict(data[:, 6:])
        # data = X_test.to_numpy()
        # test_pool = Pool(data)
        # return self.model.predict(test_pool)

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



everything below this is for testing, not for part of submission

In [None]:
# exact code run to test
model = Model() # Model class imported from your submission
X_train = pd.read_csv("data_cleaned_train_comments_X.csv")  # pandas Dataframe
y_train = pd.read_csv("data_cleaned_train_y.csv")  # pandas Dataframe
model.train(X_train, y_train) # train your model on the dataset provided to you

# y_pred = model.predict(X_test) # test your model on the hidden test set (pandas Dataframe)
# mse = mean_squared_error(y_test, y_pred) # compute mean squared error

train_mse = mean_squared_error(y_train, model.predict(X_train))
print(train_mse)

In [3]:
# locally test with train and validation
X_train = pd.read_csv("data_cleaned_train_comments_X.csv")
y_train = pd.read_csv("data_cleaned_train_y.csv")

print(X_train.shape)
print(y_train.shape)

columns_of_interest = ['latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'instant_bookable', 'is_business_travel_ready', 'Firm_mattress', 'Step-free_access', 'Dishes_and_silverware', '24-hour_check-in', 'Refrigerator', 'First_aid_kit', 'Pocket_wifi', 'Long_term_stays_allowed', 'Mobile_hoist', 'Self_check-in', 'Children’s_dinnerware', 'Beach_essentials', 'Gym', 'Air_purifier', 'Single_level_home', 'Hair_dryer', 'BBQ_grill', 'Changing_table', 'Building_staff', 'Paid_parking_off_premises', 'Hot_water_kettle', 'Suitable_for_events', 'Laptop_friendly_workspace', 'Roll-in_shower', 'Ground_floor_access', 'Cable_TV', 'Ski-in/Ski-out', 'Cooking_basics', 'Buzzer/wireless_intercom', 'Luggage_dropoff_allowed', 'Wide_clearance_to_bed', 'Garden_or_backyard', 'Fixed_grab_bars_for_toilet', 'Dog(s)', 'Keypad', 'Internet', 'Bathtub', 'Baby_bath', 'Flat_path_to_front_door', 'Shower_chair', 'Breakfast', 'Window_guards', 'Family/kid_friendly', 'Ceiling_hoist', 'Room-darkening_shades', 'Pool', 'Waterfront', 'Ethernet_connection', 'Fireplace_guards', 'Oven', 'Indoor_fireplace', 'Coffee_maker', 'Smoke_detector', 'Bed_linens', 'Wide_hallway_clearance', 'Host_greets_you', 'Private_living_room', 'Private_bathroom', 'Wide_entryway', 'Lake_access', 'Other_pet(s)', 'High_chair', 'Carbon_monoxide_detector', 'Safety_card', 'Hangers', 'Essentials', 'Iron', 'Smart_lock', 'Hot_water', 'Heating', 'Pool_with_pool_hoist', 'Pack_’n_Play/travel_crib', 'Well-lit_path_to_entrance', 'Accessible-height_bed', 'EV_charger', '_toilet', 'Dishwasher', 'Children’s_books_and_toys', 'Wide_doorway', 'Beachfront', 'Pets_live_on_this_property', 'Shampoo', 'Table_corner_guards', 'Fixed_grab_bars_for_shower', 'Washer', 'Outlet_covers', 'Bathtub_with_bath_chair', 'Free_parking_on_premises', 'Wifi', 'Lock_on_bedroom_door', 'Electric_profiling_bed', 'Stove', 'Lockbox', 'Wheelchair_accessible', 'Cleaning_before_checkout', 'Handheld_shower_head', 'Cat(s)', 'Fire_extinguisher', 'Air_conditioning', 'TV', 'Smoking_allowed', 'Baby_monitor', 'Extra_pillows_and_blankets', 'Game_console', 'Dryer', 'Disabled_parking_spot', 'Microwave', 'Stair_gates', 'Washer_/_Dryer', 'Patio_or_balcony', 'Paid_parking_on_premises', 'Pets_allowed', 'Babysitter_recommendations', 'Elevator', 'Accessible-height_toilet', 'Free_street_parking', 'Doorman', 'Wide_clearance_to_shower', 'Private_entrance', 'Hot_tub', 'Kitchen', 'Crib', 'Aparthotel', 'Apartment', 'Bed and breakfast', 'Boat', 'Boutique hotel', 'Bungalow', 'Cabin', 'Camper/RV', 'Casa particular (Cuba)', 'Castle', 'Cave', 'Condominium', 'Cottage', 'Guest suite', 'Guesthouse', 'Hostel', 'Hotel', 'House', 'Houseboat', 'Loft', 'Nature lodge', 'Resort', 'Serviced apartment', 'Tent', 'Timeshare', 'Tiny house', 'Townhouse', 'Train', 'Villa', 'Airbed', 'Couch', 'Futon', 'Pull-out Sofa', 'Real Bed', 'Entire home/apt', 'Private room', 'Shared room', 'comments']
print(len(columns_of_interest))

# train_X, val_X, train_y, val_y = train_test_split(X_train[columns_of_interest], y_train, test_size=0.2, random_state=42)
train_X, val_X, train_y, val_y = train_test_split(X_train, y_train, test_size=0.2, random_state=42)


model = Model()
model.train(train_X, train_y)

print(f'train MSE: {mean_squared_error(train_y.to_numpy(),model.predict(train_X))}')
print(f'val MSE: {mean_squared_error(val_y.to_numpy(),model.predict(val_X))}')

(29985, 764)
(29985, 1)
180


  y = column_or_1d(y, warn=True)  # TODO: Is this still required?


KeyboardInterrupt: 

In [23]:
for i, col in enumerate(X_train.columns):
  print(f'{i} {col}')

0 host_since
1 host_response_rate
2 host_is_superhost
3 host_total_listings_count
4 host_has_profile_pic
5 host_identity_verified
6 latitude
7 longitude
8 accommodates
9 bathrooms
10 bedrooms
11 beds
12 security_deposit
13 cleaning_fee
14 guests_included
15 extra_people
16 minimum_nights
17 maximum_nights
18 has_availability
19 number_of_reviews
20 review_scores_rating
21 review_scores_accuracy
22 review_scores_cleanliness
23 review_scores_checkin
24 review_scores_communication
25 review_scores_location
26 review_scores_value
27 instant_bookable
28 is_business_travel_ready
29 require_guest_profile_picture
30 require_guest_phone_verification
31 reviews_per_month
32 kba_verification
33 work_email_verification
34 facebook_verification
35 zhima_selfie_verification
36 weibo_verification
37 jumio_verification
38 email_verification
39 photographer_verification
40 offline_government_id_verification
41 sesame_verification
42 manual_online_verification
43 identity_manual_verification
44 google_v

In [None]:
# print(X_train.head())


(29985, 764)
(29985, 1)


In [None]:
print(X_train.columns.tolist())

['host_since', 'host_response_rate', 'host_is_superhost', 'host_total_listings_count', 'host_has_profile_pic', 'host_identity_verified', 'latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'maximum_nights', 'has_availability', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'instant_bookable', 'is_business_travel_ready', 'require_guest_profile_picture', 'require_guest_phone_verification', 'reviews_per_month', 'kba_verification', 'work_email_verification', 'facebook_verification', 'zhima_selfie_verification', 'weibo_verification', 'jumio_verification', 'email_verification', 'photographer_verification', 'offline_government_id_verification', 'sesame_verification', 'manual_online_verification', 'identity_manual_verification', 

In [None]:
columns_of_interest = ['latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'instant_bookable', 'is_business_travel_ready', 'Firm_mattress', 'Step-free_access', 'Dishes_and_silverware', '24-hour_check-in', 'Refrigerator', 'First_aid_kit', 'Pocket_wifi', 'Long_term_stays_allowed', 'Mobile_hoist', 'Self_check-in', 'Children’s_dinnerware', 'Beach_essentials', 'Gym', 'Air_purifier', 'Single_level_home', 'Hair_dryer', 'BBQ_grill', 'Changing_table', 'Building_staff', 'Paid_parking_off_premises', 'Hot_water_kettle', 'Suitable_for_events', 'Laptop_friendly_workspace', 'Roll-in_shower', 'Ground_floor_access', 'Cable_TV', 'Ski-in/Ski-out', 'Cooking_basics', 'Buzzer/wireless_intercom', 'Luggage_dropoff_allowed', 'Wide_clearance_to_bed', 'Garden_or_backyard', 'Fixed_grab_bars_for_toilet', 'Dog(s)', 'Keypad', 'Internet', 'Bathtub', 'Baby_bath', 'Flat_path_to_front_door', 'Shower_chair', 'Breakfast', 'Window_guards', 'Family/kid_friendly', 'Ceiling_hoist', 'Room-darkening_shades', 'Pool', 'Waterfront', 'Ethernet_connection', 'Fireplace_guards', 'Oven', 'Indoor_fireplace', 'Coffee_maker', 'Smoke_detector', 'Bed_linens', 'Wide_hallway_clearance', 'Host_greets_you', 'Private_living_room', 'Private_bathroom', 'Wide_entryway', 'Lake_access', 'Other_pet(s)', 'High_chair', 'Carbon_monoxide_detector', 'Safety_card', 'Hangers', 'Essentials', 'Iron', 'Smart_lock', 'Hot_water', 'Heating', 'Pool_with_pool_hoist', 'Pack_’n_Play/travel_crib', 'Well-lit_path_to_entrance', 'Accessible-height_bed', 'EV_charger', '_toilet', 'Dishwasher', 'Children’s_books_and_toys', 'Wide_doorway', 'Beachfront', 'Pets_live_on_this_property', 'Shampoo', 'Table_corner_guards', 'Fixed_grab_bars_for_shower', 'Washer', 'Outlet_covers', 'Bathtub_with_bath_chair', 'Free_parking_on_premises', 'Wifi', 'Lock_on_bedroom_door', 'Electric_profiling_bed', 'Stove', 'Lockbox', 'Wheelchair_accessible', 'Cleaning_before_checkout', 'Handheld_shower_head', 'Cat(s)', 'Fire_extinguisher', 'Air_conditioning', 'TV', 'Smoking_allowed', 'Baby_monitor', 'Extra_pillows_and_blankets', 'Game_console', 'Dryer', 'Disabled_parking_spot', 'Microwave', 'Stair_gates', 'Washer_/_Dryer', 'Patio_or_balcony', 'Paid_parking_on_premises', 'Pets_allowed', 'Babysitter_recommendations', 'Elevator', 'Accessible-height_toilet', 'Free_street_parking', 'Doorman', 'Wide_clearance_to_shower', 'Private_entrance', 'Hot_tub', 'Kitchen', 'Crib', 'Aparthotel', 'Apartment', 'Bed and breakfast', 'Boat', 'Boutique hotel', 'Bungalow', 'Cabin', 'Camper/RV', 'Casa particular (Cuba)', 'Castle', 'Cave', 'Condominium', 'Cottage', 'Guest suite', 'Guesthouse', 'Hostel', 'Hotel', 'House', 'Houseboat', 'Loft', 'Nature lodge', 'Resort', 'Serviced apartment', 'Tent', 'Timeshare', 'Tiny house', 'Townhouse', 'Train', 'Villa', 'Airbed', 'Couch', 'Futon', 'Pull-out Sofa', 'Real Bed', 'Entire home/apt', 'Private room', 'Shared room', 'comments']
print(len(columns_of_interest))

180


In [None]:
# split into train and validation
train_X, val_X, train_y, val_y = train_test_split(X_train[columns_of_interest], y_train, test_size=0.2, random_state=42)

In [None]:
print(train_X.shape)
print(val_X.shape)
print(train_y.shape)
print(val_y.shape)

(23988, 180)
(5997, 180)
(23988, 1)
(5997, 1)


In [None]:
# train model
model = Model()
model.train(train_X, train_y)

  y = column_or_1d(y, warn=True)  # TODO: Is this still required?


In [None]:
# update ytrain
# y_train = y_train.to_numpy()

In [None]:
# data_tensor = torch.tensor(X_train.values, dtype=torch.float32)
# print(type(data_tensor))
output = model.predict(X_train)
print(output.shape)
print(y_train.to_numpy().shape)
print(mean_squared_error(y_train.to_numpy(),output))

train MSE: 0.795397393577965
val MSE: 0.7773877919829343


In [None]:
# evaluate model
# output = model.predict(val_X)
print(f'train MSE: {mean_squared_error(train_y.to_numpy(),model.predict(train_X))}')
print(f'val MSE: {mean_squared_error(val_y.to_numpy(),model.predict(val_X))}')
# print(mean_squared_error(val_y.to_numpy(),output))

train MSE: 0.13561107597420052
val MSE: 0.13823286088844255


In [None]:
# test output
index = 1
print(val_X.iloc[index: index+1])
# print(model.predict())
print(val_y[index])

      latitude  longitude  accommodates  bathrooms  bedrooms  beds  \
7568  0.613014   0.460808      0.133333   0.064516  0.066667  0.05   

      security_deposit  cleaning_fee  guests_included  extra_people  ...  \
7568               0.0      0.972243         0.133333           0.0  ...   

      Villa  Airbed  Couch  Futon  Pull-out Sofa  Real Bed  Entire home/apt  \
7568    0.0     0.0    0.0    0.0            0.0       1.0              1.0   

      Private room  Shared room  comments  
7568           0.0          0.0       0.5  

[1 rows x 180 columns]


KeyError: 1

In [None]:
# try cross validation
from sklearn.model_selection import cross_val_score

model = ensemble.RandomForestRegressor()
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
print(scores)

  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)


[-0.13935535 -0.14185557 -0.13202761 -0.1397198  -0.13793902]


**GOOD LUCK!**
