# Modelling
<u>Tests using the following models :</u>
* Linear regression
* Random forest regressor
* Ridge and Lasso Regularization (add on to linear modelling?)

<u> Tests using the following variables:</u>
* Weather variables (rain, temperature, windspeed)
* Time variables (Day of week, month, year, time of day, public holiday)
* Sensor environment variables:
    * Sensor_id
    * Betweenness of the street 
    * Buildings in proximity to the sensor
    * Landmarks in proximity to the sensor  
    * Furniture in proximity to the sensor    
    * Lights in proximity to the sensor   


Normalise variables: should this be with MinMax or StandardScaler??


Process:
* Keep only data from sensor's with relatively complete data
* Split data into training ( 75%) and test (25%)
* Define the models to use in testing (linear regression, random forest, xgboost)
* Define the error metrics to use in evaluating the model performance

In [1]:
import copy
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, mean_squared_error,r2_score, accuracy_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.preprocessing import MinMaxScaler
import time as thetime
from sklearn.model_selection import cross_validate
from xgboost import XGBClassifier, XGBRegressor
from time import time
from sklearn.inspection import permutation_importance
from scipy import stats

from eli5.sklearn import PermutationImportance
from sklearn.model_selection import cross_val_predict

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

from Functions import *
import joblib

In [2]:
buffer_size_m = 50

## Read in formatted data

In [21]:
data = pd.read_csv("../Cleaned_data/FormattedDataForModelling/formatted_data_for_modelling_allsensors_{}.csv".format(buffer_size_m), index_col = False)

### Delete unneeded columns
We currently include data from all sensors (even incomplete ones)

In [24]:
data = data.drop(['sensor_id'],axis=1) # don't want this included
# Get rid of columns in which none of the sensors have a value
for column in data.columns:
    if np.nanmax(data[column]) ==0:
        del data[column]

In [25]:
# Filter columns using the regex pattern in function input
regex_pattern = 'buildings$|furniture$|landmarks$'
data = data[data.columns.drop(list(data.filter(regex=regex_pattern)))].copy()

### Add a random variable (to compare performance of other variables against)

In [26]:
rng = np.random.RandomState(seed=42)
data['random'] = np.random.random(size=len(data))
data["random_cat"] = rng.randint(3, size=data.shape[0])

In [7]:
# print(len(data['random'].unique()))
# print(len(data))

### Date based variables: Option 1 - Use the Cos/sin cyclical variable versions
Store the (non Sin/Cos) time columns and then remove them
Need them later to segment the results by hour of the day

In [9]:
# data_time_columns = data[['day_of_month_num', 'time', 'weekday_num', 'time_of_day']]
# data = data.drop(['day_of_month_num', 'time', 'weekday_num', 'time_of_day','year', 'month','day', 'datetime', 'month_num'],axis=1)

### Date based variables: Option 2 - Create Dummy Variables

In [27]:
for date_col in ['day', 'month',]:
    date_col_dummy =  pd.get_dummies(data[date_col], drop_first = True)
    if date_col =='month':
        date_col_dummy.columns= prepend(date_col_dummy.columns.values, 'month_')
#     if date_col =='year':
#         date_col_dummy.columns= prepend(date_col_dummy.columns.values, 'year_')
    data = pd.concat([data, date_col_dummy],axis=1)
    del data[date_col]
# data_time_columns = data[['time']]
data = data.drop(['datetime', 'time_of_day',  'time', 'weekday_num','month_num','Sin_month_num', 'Cos_month_num',
       'Sin_weekday_num', 'Cos_weekday_num', "day_of_month_num" ],axis=1)

In [29]:
data['year_normalised']=data['year']-2010
del data["year"]

## Prepare data for modelling 
### Split into predictor/predictand variables

In [30]:
# The predictor variables
Xfull = data.drop(['hourly_counts'], axis =1)
# The variable to be predicted
Yfull = data['hourly_counts'].values

## Fit the final model
Random Forest was the best performing model from CV  
For this, we use all of the data

In [32]:
## Fit the final model --- 1
print("fitting model 1")
rf_model_pipeline1 = Pipeline(steps=[['scaler',StandardScaler()],
                                    ['rf_regressor',RandomForestRegressor(random_state = 1, n_jobs = 32)]])
rf_model_pipeline1.fit(Xfull, Yfull)
print("saving pickled file")
# Save to pickled file
filename = 'PickleFiles/FinalModels/rf_model_pipeline1_combined_features_{}.pkl'.format(buffer_size_m)
joblib.dump(rf_model_pipeline1, filename)

In [19]:
Xfull.to_csv('PickleFiles/FinalModels/Xfull_rf_model_pipeline1_combined_features_{}.csv'.format(buffer_size_m), index=False)
Yfull_df=pd.DataFrame(Yfull)
Yfull_df.to_csv('PickleFiles/FinalModels/Yfull_rf_model_pipeline1_combined_features_{}.csv'.format(buffer_size_m), index=False)