# NYC Airbnb Price Prediction - Lightning Flash model training

Use dataset published by Kaggle - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data - to train a simple Lightning Flash model to predict prices for Airbnb properties.

This notebook contains the code to train the model from the dataset prepared in the [data cleanup](https://github.com/ryanmark1867/fastai_basics/blob/master/notebooks/data_cleanup.ipynb) notebook. It is adapted from the [fastai model training notebook](https://github.com/ryanmark1867/fastai_basics/blob/master/notebooks/model_training.ipynb) trained on the same dataset.


# Links to key parts of the notebook <a name='linkanchor' />
<a href=#ingestdash>Ingest data</a>

<a href=#buildpipe>Build pipeline</a>

<a href=#modelfit>Define and fit model</a>



# Common imports and global variable definitions

In [4]:

''' check to see if the notebook is being run in Colab, and if so, set the current directory appropriately'''
if 'google.colab' in str(get_ipython()):
  from google.colab import drive
  drive.mount('/content/drive')
  %cd /content/drive/MyDrive/machine_learning_tabular_book/code/lightning_flash_basics/notebooks

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/machine_learning_tabular_book/code/lightning_flash_basics/notebooks


In [42]:
import time
start_time = time.time()

In [13]:
# to get Flash to work in Colab you need this exact set of installs, in this order - see https://stackoverflow.com/questions/69323496/cant-install-pytorch-lightning-flash-on-google-colab
!pip install torch==1.8.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html
!pip install icevision #==0.9.0a1
!pip install effdet 
!pip install lightning-flash[image]
!pip install git+https://github.com/PyTorchLightning/lightning-flash.git
!pip install torchtext==0.9.1
!pip uninstall fastai -y
#There is a bug in the latest release of icevision. Manually apply the fix.
!curl https://raw.githubusercontent.com/airctic/icevision/944b47c5694243ba3f3c8c11a6ef56f05fb111eb/icevision/core/record_components.py --output /usr/local/lib/python3.7/dist-packages/icevision/core/record_components.py
#Restart the kernel

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.8.1+cu102
  Using cached https://download.pytorch.org/whl/cu102/torch-1.8.1%2Bcu102-cp37-cp37m-linux_x86_64.whl (804.1 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.9.0
    Uninstalling torch-1.9.0:
      Successfully uninstalled torch-1.9.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.10.0 requires torch==1.9.0, but you have torch 1.8.1+cu102 which is incompatible.
torchaudio 0.12.1+cu113 requires torch==1.12.1, but you have torch 1.8.1+cu102 which is incompatible.
lightning-lite 1.8.0 requires torch>=1.9.*, but you have torch 1.8.1+cu102 which is incompatible.
lightning-bolts 0.5.0 requir

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch<1.11,>=1.9.0
  Using cached torch-1.10.2-cp37-cp37m-manylinux1_x86_64.whl (881.9 MB)
  Using cached torch-1.9.0-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.8.1+cu102
    Uninstalling torch-1.8.1+cu102:
      Successfully uninstalled torch-1.8.1+cu102
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.9.1 requires torch==1.8.1, but you have torch 1.9.0 which is incompatible.
torchaudio 0.12.1+cu113 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.
lightning-bolts 0.5.0 requires pytorch-lightning>=1.4.0, but you have pytorch-lightning 1.3.6 which is incompatible.[0m
Successfully installed torch-1.9.0


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch-lightning>=1.3.6
  Using cached pytorch_lightning-1.8.0-py3-none-any.whl (795 kB)
Installing collected packages: pytorch-lightning
  Attempting uninstall: pytorch-lightning
    Found existing installation: pytorch-lightning 1.3.6
    Uninstalling pytorch-lightning-1.3.6:
      Successfully uninstalled pytorch-lightning-1.3.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytorch-tabular 0.7.0 requires pytorch-lightning==1.3.6, but you have pytorch-lightning 1.8.0 which is incompatible.[0m
Successfully installed pytorch-lightning-1.8.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collectin

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15992  100 15992    0     0  14498      0  0:00:01  0:00:01 --:--:-- 14485


In [3]:
# need specific level of torchvision or Flash imports will fail
!pip install --upgrade torchvision==0.10.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.9.0
  Using cached torch-1.9.0-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.8.1
    Uninstalling torch-1.8.1:
      Successfully uninstalled torch-1.8.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.9.1 requires torch==1.8.1, but you have torch 1.9.0 which is incompatible.
torchaudio 0.12.1+cu113 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.
pytorch-tabular 0.7.0 requires pytorch-lightning==1.3.6, but you have pytorch-lightning 1.8.0 which is incompatible.[0m
Successfully installed torch-1.9.0


In [1]:
# Lightning Flash imports
import torch
import flash
#from flash.tabular import TabularRegressor, TabularRegressionData
#import flash.core.data.utils
#from flash.core.data.utils import download_data
from flash.tabular import TabularClassificationData, TabularClassifier

In [2]:
# common imports
import zipfile
import pandas as pd
import numpy as np
import time
import seaborn as sns
from matplotlib import pyplot
# import datetime, timedelta
import datetime
import pydotplus
from datetime import datetime, timedelta
from datetime import date
from dateutil import relativedelta
from io import StringIO
import pandas as pd
import pickle
from pickle import dump
from pickle import load
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
# DSX code to import uploaded documents
from io import StringIO
import requests
import json
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
import os
import yaml
import math
import sys
from subprocess import check_output
from IPython.display import display
#model libraries

#from datetime import date
from sklearn import metrics



In [5]:
# load config file
current_path = os.getcwd()
print("current directory is: "+current_path)

path_to_yaml = os.path.join(current_path, 'model_training_config.yml')
print("path_to_yaml "+path_to_yaml)
try:
    with open (path_to_yaml, 'r') as c_file:
        config = yaml.safe_load(c_file)
except Exception as e:
    print('Error reading the config file')


current directory is: /content/drive/MyDrive/machine_learning_tabular_book/code/lightning_flash_basics/notebooks
path_to_yaml /content/drive/MyDrive/machine_learning_tabular_book/code/lightning_flash_basics/notebooks/model_training_config.yml


In [6]:
# load parameters

repeatable_run = config['test_parms']['repeatable_run']
# fix seeds to get identical results on mulitiple runs
if repeatable_run:
    from numpy.random import seed
    seed(4)
    tf.random.set_seed(7)


testproportion = config['test_parms']['testproportion'] # proportion of data reserved for test set
trainproportion = config['test_parms']['trainproportion'] # proportion of non-test data dedicated to training (vs. validation)
get_test_train_acc = config['test_parms']['get_test_train_acc']
verboseout = config['general']['verboseout']
includetext = config['general']['includetext'] # switch to determine whether text columns are included in the model
save_model_plot = config['general']['save_model_plot'] # switch to determine whether to generate plot with plot_model
tensorboard_callback = config['general']['tensorboard_callback'] # switch to determine if tensorboard callback defined

presaved = config['general']['presaved']
savemodel = config['general']['savemodel']
picklemodel = config['general']['picklemodel']
hctextmax = config['general']['hctextmax']
maxwords = config['general']['maxwords']
textmax = config['general']['textmax']

targetthresh = config['general']['targetthresh']
targetcontinuous = config['general']['targetcontinuous']
target_col = config['general']['target_col']

#time of day thresholds
time_of_day = {'overnight':{'start':0,'end':5},'morning_rush':{'start':5,'end':10},
              'midday':{'start':10,'end':15},'aft_rush':{'start':15,'end':19},'evening':{'start':19,'end':24}}



emptythresh = config['general']['emptythresh']
zero_weight = config['general']['zero_weight']
one_weight = config['general']['one_weight']
one_weight_offset = config['general']['one_weight_offset']
patience_threshold = config['general']['patience_threshold']


# modifier for saved model elements
modifier = config['general']['modifier']

# control whether training controlled by early stop
early_stop = True

# default hyperparameter values
learning_rate = config['hyperparameters']['learning_rate']
dropout_rate = config['hyperparameters']['dropout_rate']
l2_lambda = config['hyperparameters']['l2_lambda']
loss_func = config['hyperparameters']['loss_func']
output_activation = config['hyperparameters']['output_activation']
batch_size = config['hyperparameters']['batch_size']
epochs = config['hyperparameters']['epochs']

# date values
date_today = datetime.now()
print("date today",date_today)

# pickled original dataset and post-preprocessing dataset
pickled_data_file = config['general']['pickled_data_file']
pickled_dataframe = config['general']['pickled_dataframe']

# experiment parameter

current_experiment = config['test_parms']['current_experiment']

# load lists of column categories
collist = config['categorical']
textcols = config['text']
continuouscols = config['continuous']
excludefromcolist = config['excluded']

date today 2022-11-02 01:07:20.516367


# Helper functions

In [7]:
# time_of_day = {'overnight':{'start':0,'end':5},'morning_rush':{'start':5,'end':10},
#              'midday':{'start':10,'end':15},'aft_rush':{'start':15,'end':19},'evening':{'start':19,'end':23}}


def get_time(hour):
    for tod in time_of_day:
        if (hour >= time_of_day[tod]['start']) and (hour < time_of_day[tod]['end']):
            tod_out = tod
    return(tod_out)

def weekend_time(day, tod):
    if (day=='Saturday') or (day=='Sunday'):
        return('w'+tod)
    else:
        return(tod)




In [8]:
# get the paths required

def get_path():
    '''get the path for data files

    Returns:
        path: path for data files
    '''
    rawpath = os.getcwd()
    # data is in a directory called "data" that is a sibling to the directory containing the notebook
    path = os.path.abspath(os.path.join(rawpath, '..', 'data'))
    return(path)

def get_pipeline_path():
    '''get the path for data files
    
    Returns:
        path: path for pipeline files
    '''
    rawpath = os.getcwd()
    # data is in a directory called "data" that is a sibling to the directory containing the notebook
    path = os.path.abspath(os.path.join(rawpath, '..', 'pipelines'))
    return(path)

def get_model_path():
    '''get the path for data files
    
    Returns:
        path: path for model files
    '''
    rawpath = os.getcwd()
    # data is in a directory called "data" that is a sibling to the directory containing the notebook
    path = os.path.abspath(os.path.join(rawpath, '..', 'models'))
    return(path)

In [9]:
def set_experiment_parameters(experiment_number, count_no_delay, count_delay):
    ''' set the appropriate parameters for the experiment 
    Args:
        experiment_number: filename containing config parameters
        count_no_delay: count of negative outcomes in the dataset
        count_delay: count of positive outcomes in the dataset

    Returns:
        early_stop: whether the experiment includes an early stop callback
        one_weight: weight applied to positive outcomes
        epochs: number of epochs in the experiment
        es_monitor: performance measurement tracked in callbacks
        es_mod: direction of performance being tracked in callbacks
    
    '''
    print("setting parameters for experiment ", experiment_number)
    # default settings for early stopping:
    es_monitor = "val_loss"
    es_mode = "min"
    if experiment_number == 0:
        #
        early_stop = False
        #
        one_weight = 1.0
        #
        epochs = 1
    elif experiment_number == 9:
        #
        early_stop = True
        es_monitor="val_accuracy"
        es_mode = "max"
        #
        one_weight = (count_no_delay/count_delay) + one_weight_offset
        #
        get_test_train_acc = False
        #
        epochs = 20    
    elif experiment_number == 1:
        #
        early_stop = False
        #
        one_weight = 1.0
        #
        epochs = 10
    elif experiment_number == 2:
        #
        early_stop = False
        #
        one_weight = 1.0
        #
        epochs = 50
    elif experiment_number == 3:
        #
        early_stop = False
        #
        one_weight = (count_no_delay/count_delay) + one_weight_offset
        #
        epochs = 50
    elif experiment_number == 4:
        #
        early_stop = True
        es_monitor = "val_loss"
        es_mode = "min"
        #
        one_weight = (count_no_delay/count_delay) + one_weight_offset
        #
        epochs = 50
    elif experiment_number == 5:
        #
        early_stop = True
        # if early stopping fails because the level of TensorFlow/Python, comment out the following
        # line and uncomment the subsequent if statement
        es_monitor="val_accuracy"
        '''
        if sys.version_info >= (3,7):
            es_monitor="val_accuracy"
        else:
            es_monitor = "val_acc"
        '''
        es_mode = "max"
        #
        one_weight = (count_no_delay/count_delay) + one_weight_offset
        #
        epochs = 50
    else:
        early_stop = True
    return(early_stop, one_weight, epochs,es_monitor,es_mode)






# Ingest data and create refactored dataframe <a name='ingestdash' />
- Ingest data for route information and delay information
- Create refactored dataframe with one row per route / direction / timeslot combination


<a href=#linkanchor>Back to link list</a>

In [10]:
def ingest_data(path):
    '''load list of valid routes and directions into dataframe
    Args:
        path: path for data files
    
    Returns:
        merged_data: dataframe loaded from pickle file
    '''
    file_name = os.path.join(path,pickled_dataframe)
    merged_data = pd.read_pickle(file_name)
    merged_data.head()
    return(merged_data)

In [11]:
def prep_merged_data(merged_data,target_col):
    '''add derived columns to merged_data dataframe
    Args:
        merged_data: input dataframe
        target_col: column that is the target
    
    Returns:
        merged_data: dataframe with derived columns added
    '''
    if targetcontinuous:
        merged_data['target'] = merged_data[target_col]
    else:
        merged_data['target'] = np.where(merged_data[target_col] >= merged_data[target_col].mean(), 1, 0 )
    return(merged_data)

# Master Prep Cell
Contains calls to functions to load data, prep input dataframes, and create refactored dataframe

In [12]:
# master calls

path = get_path()
print("path is",path)
# load route direction and delay data datframes
#full_df = ingest_data(path)
#full_df = prep_merged_data(full_df,target_col)
#merged_data = prep_merged_data(merged_data,target_col)


'''
print("shape of pre refactored dataset", merged_data.shape)
#merged_data['year'].value_counts()
#merged_data.groupby(['Route','Direction']).size().reset_index().rename(columns={0:'count'}).tail(50)
# create refactored dataframe with one row for each route / direction / timeslot combination
print("shape of refactored dataset", merged_data.shape)
count_no_delay = merged_data[merged_data['target']==0].shape[0]
count_delay = merged_data[merged_data['target']==1].shape[0]
print("count under mean ",count_no_delay)
print("count over mean ",count_delay)
# define parameters for the current experiment
experiment_number = current_experiment
early_stop, one_weight, epochs,es_monitor,es_mode = set_experiment_parameters(experiment_number, count_no_delay, count_delay)
print("early_stop is ",early_stop)
print("one_weight is ",one_weight)
print("epochs is ",epochs)
print("es_monitor is ",es_monitor)
print("es_mode is ",es_mode)
'''

path is /content/drive/MyDrive/machine_learning_tabular_book/code/lightning_flash_basics/data


'\nprint("shape of pre refactored dataset", merged_data.shape)\n#merged_data[\'year\'].value_counts()\n#merged_data.groupby([\'Route\',\'Direction\']).size().reset_index().rename(columns={0:\'count\'}).tail(50)\n# create refactored dataframe with one row for each route / direction / timeslot combination\nprint("shape of refactored dataset", merged_data.shape)\ncount_no_delay = merged_data[merged_data[\'target\']==0].shape[0]\ncount_delay = merged_data[merged_data[\'target\']==1].shape[0]\nprint("count under mean ",count_no_delay)\nprint("count over mean ",count_delay)\n# define parameters for the current experiment\nexperiment_number = current_experiment\nearly_stop, one_weight, epochs,es_monitor,es_mode = set_experiment_parameters(experiment_number, count_no_delay, count_delay)\nprint("early_stop is ",early_stop)\nprint("one_weight is ",one_weight)\nprint("epochs is ",epochs)\nprint("es_monitor is ",es_monitor)\nprint("es_mode is ",es_mode)\n'

In [None]:
# experiment with removing all but the features used to train the model
# Features to train with are:
# neighbourhood_group - 4
# neighbourhood - 5
# room_type - 8
# minimum_nights - 10
# number_of_reviews - 11
# reviews_per_month - 13
# calculated_host_listings_count - 14
#merged_data.drop(merged_data.columns[[0,1,2,3,6,7,9,12,15,16]],axis=1,inplace=True)

# Define training, validation, and test subsets of the dataset

In [13]:
def get_train_validation_test(dataset):
    '''get training and test data set
    Args:
        dataset: input dataframe
    
    Returns:
        dtrain: training subset of dataset
        dvalid: validation subset of dataset
        dtest: test subset of dataset
    '''
    train, test = train_test_split(dataset, test_size = testproportion)
    dtrain, dvalid = train_test_split(train, random_state=123, train_size=trainproportion)
    print("Through train test split. Test proportion:")
    print(testproportion)
    return(dtrain,dvalid,test)



# Build Pipeline <a name='buildpipe' />

Create pipeline objects to perform final data preparation steps for training and inference.

Note that cleanup on the training dataset is completed upstream in the [data cleanup notebook](https://github.com/ryanmark1867/end_to_end_deep_learning_liveproject/blob/master/notebooks/data_cleanup.ipynb). 
- The pipelines only accomplish the subset of preparation that is required for both training and inference
- Because the scoring data coming in for inference is forced by the web deployment to avoid the invalid values that the data cleanup notebook deals with, the pipelines don't have to deal with those problems.

<a href=#linkanchor>Back to link list</a>

In [None]:
# Features are
# neighbourhood_group
# neighbourhood
# room_type
# minimum_nights
# number_of_reviews
# reviews_per_month
# calculated_host_listings_count



In [14]:
# explicitly define cont and cat
# Features are
# neighbourhood_group
# neighbourhood
# room_type
# minimum_nights
# number_of_reviews
# reviews_per_month
# calculated_host_listings_count
dep_var = 'target'
cat = ['neighbourhood_group','neighbourhood','room_type']
cont = ['minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count']
print("continuous columns are: ",cont)
print("categorical columns are: ",cat)

continuous columns are:  ['minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count']
categorical columns are:  ['neighbourhood_group', 'neighbourhood', 'room_type']


##Define and fit model <a name='modelfit' />
- use the unique fastai tabular data capabilities

<a href=#linkanchor>Back to link list</a>

In [44]:
# define TabularClassificationData object using the CSV created from the dataframe,  the categorical and continuous lists
# important to pick the right object - TabularRegressionData will generate oblique errors with a classification target
# for simplicity, use pre-canned train, validation, and test sets as CSVs

datamodule = TabularClassificationData.from_csv(
    categorical_fields=cat,
    numerical_fields=cont,
    target_fields="target",
    train_file='../data/train.csv',
    val_file='../data/valid.csv',
    predict_file='../data/test.csv',
    batch_size=64
)



  exec(code_obj, self.user_global_ns, self.user_ns)


In [32]:
# define the model
model = TabularClassifier.from_data(datamodule, learning_rate=0.1)

INFO:pytorch_lightning.utilities.rank_zero:Using 'tabnet' provided by manujosephv/PyTorch Tabular (https://github.com/manujosephv/pytorch_tabular).


In [39]:
trainer = flash.Trainer(max_epochs=3, gpus=torch.cuda.device_count())


  f"Setting `Trainer(gpus={gpus!r})` is deprecated in v1.7 and will be removed"
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [40]:
# train the model
trainer.fit(model, datamodule=datamodule)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name          | Type                  | Params
--------------------------------------------------------
0 | train_metrics | ModuleDict            | 0     
1 | val_metrics   | ModuleDict            | 0     
2 | test_metrics  | ModuleDict            | 0     
3 | adapter       | PytorchTabularAdapter | 12.7 K
--------------------------------------------------------
12.7 K    Trainable params
0         Non-trainable params
12.7 K    Total params
0.051     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


In [43]:

# print elapsed time to run the notebook
print("--- %s seconds ---" % (time.time() - start_time))

--- 14.501873254776001 seconds ---
