# 4 Modeling Part 1

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Modeling](#4_Modeling)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
    * [4.2.1 Recap of Data Science Problem](#4.2.1_Recap_of_Data_Science_Problem)
    * [4.2.2 Introduction To Notebook](#4.2.2_Introduction_To_Notebook)
  * [4.3 Objectives](#4.3_Objectives)
  * [4.4 Imports](#4.4_Imports)
  * [4.5 Load Data](#4.5_Load_Data)
  * [4.6_Prepare Data for Modeling](#4.6_Prepare_Data_for_Modeling)
  * [4.7 Train the model](#4.7_Train_the_model)
    * [4.7.1 Dummy regressor](#4.7.1_Dummy_regressor)
      * [4.7.1.1 Fit the model](#4.7.1.1_Fit_the_model)
      * [4.7.1.2 Make prediction](#4.7.1.1_Make_prediction)
      * [4.7.1.2 Measure Accuracy](#4.7.1.1_Measure_Accuracy)

## 4.2 Introduction<a id='4.2_Introduction'></a>

### 4.2.1 Recap of Data Science Problem<a id='4.2.1_Recap_of_Data_Science_Problem'></a>

The goal of this project is to predict the energy baseline for a building.

### 4.2.2 Introduction to Notebook<a id='4.2.2_Introduction_to_Notebook'></a>

## 4.3 Objectives<a id='4.3_Objectives'></a>

The objective of part 1 of this notebook is to train a dummy regressor model, which is simply the mean of all electricity meter readings.

The metrics I will use to assess the model are the mean average error and the root mean square error.

## 4.4 Imports<a id='4.2_Imports'></a>

In [3]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

## 4.5 Load Data<a id='4.5_Load_Data'></a>

In [4]:
# the supplied CSV data files are in the data/processed directory
#load data
data = pd.read_csv('../data/processed/merged_data_clean.csv')
data.head(5)

Unnamed: 0.1,Unnamed: 0,building_id,meter,timestamp,meter_reading,meter_reading_log,month,day,dayofweek,hour,...,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed,precip_depth_1_hr_log,cloud_coverage_log,wind_speed_log
0,0,0,0,2015-12-31 19:00:00,0.0,0.0,12,31,3,19,...,25.0,6.0,20.0,,1019.7,0.0,0.0,,1.94591,0.0
1,1,1,0,2015-12-31 19:00:00,0.0,0.0,12,31,3,19,...,25.0,6.0,20.0,,1019.7,0.0,0.0,,1.94591,0.0
2,2,2,0,2015-12-31 19:00:00,0.0,0.0,12,31,3,19,...,25.0,6.0,20.0,,1019.7,0.0,0.0,,1.94591,0.0
3,3,3,0,2015-12-31 19:00:00,0.0,0.0,12,31,3,19,...,25.0,6.0,20.0,,1019.7,0.0,0.0,,1.94591,0.0
4,4,4,0,2015-12-31 19:00:00,0.0,0.0,12,31,3,19,...,25.0,6.0,20.0,,1019.7,0.0,0.0,,1.94591,0.0


## 4.6 Prepare Data for Modeling<a id='4.6_Prepare_Data_for_Modeling'></a>

In [5]:
# select electricity data
electricity_data=data[data['meter']==0]

In [7]:
electricity_data.columns

Index(['Unnamed: 0', 'building_id', 'meter', 'timestamp', 'meter_reading',
       'meter_reading_log', 'month', 'day', 'dayofweek', 'hour', 'site_id',
       'primary_use', 'square_feet', 'year_built', 'floor_count',
       'floor_count_log', 'square_feet_log', 'air_temperature',
       'cloud_coverage', 'dew_temperature', 'precip_depth_1_hr',
       'sea_level_pressure', 'wind_direction', 'wind_speed',
       'precip_depth_1_hr_log', 'cloud_coverage_log', 'wind_speed_log'],
      dtype='object')

In [9]:
#drop some columns
elec_data_tr=electricity_data.drop(['Unnamed: 0', 'month', 'day', 'dayofweek', 'hour',
       'floor_count_log', 'square_feet_log',
       'precip_depth_1_hr_log', 'cloud_coverage_log', 'wind_speed_log','meter_reading_log'],axis=1)

In [10]:
#train/test split the data
X=elec_data_tr.drop(columns='meter_reading')
y=elec_data_tr.meter_reading
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7, random_state=47)

In [11]:
categ_list = ['site_id','building_id','primary_use','meter','timestamp']
categ_train = X_train[['site_id','building_id','primary_use','meter','timestamp']]
categ_test = X_test[['site_id','building_id','primary_use','meter','timestamp']]
X_train.drop(columns=categ_list, inplace=True)
X_test.drop(columns=categ_list, inplace=True)
X_train.shape, X_test.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


((8412185, 10), (3605223, 10))

## 4.7 Train the model<a id='4.7_Train_the_model'></a>

### 4.7.1 Dummy regressor<a id='4.7.1_Dummy_regressor'></a>

#### 4.7.1.1 Fit the model<a id='4.7.1.1_Fit_the_model'></a>

In [12]:
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train, y_train)
train_mean=dumb_reg.constant_[0][0]
train_mean

170.52851156558026

#### 4.7.1.2 Make prediction<a id='4.7.1.1_Make_prediction'></a>

In [13]:
y_tr_pred = dumb_reg.predict(X_train)

In [14]:
y_te_pred = train_mean * np.ones(len(y_test))

#### 4.7.1.2 Measure Accuracy<a id='4.7.1.1_Measure_Accuracy'></a>

Here we measure the accuracy of the model by calculating the mean absolute error and root mean squared error for the training set and the test set.

In [15]:
(mean_squared_error(y_train, y_tr_pred))**(1/2), (mean_squared_error(y_test, y_te_pred))**(1/2)

(377.9482054111115, 383.2159040478373)

In [16]:
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

(179.5645605032982, 179.91815067942179)

The RMSE and MAE values for the future models will be compared against the RMSE and MAE of the dummy regressor. 