# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**Team 21**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [None]:
conda install -c conda-forge xgboostconda install -c anaconda py-xgboost

In [None]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np # for linear algebra
import pandas as pd # Data processing, CSV file importation
# Libraries for data preparation and model building
##Accuracy packages
from sklearn.metrics import mean_squared_error
from sklearn import *
from sklearn.linear_model import *
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rc

import xgboost as xgb

from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.ensemble import AdaBoostRegressor
from xgboost import XGBRegressor
from math import sqrt
from numpy import hstack
from numpy import vstack
from numpy import asarray
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm
import math
import random


# Setting global constants to ensure notebook results are reproducible
#PARAMETER_CONSTANT = ###


<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

To load your data, first ensure that the raw data and the notebook file are in the same folder on your local machine. The code below will load both the train and test data set into your notebook. If the files are not in the same folder, you will have to point to the directory in your machine or cloud location where the file is located. After loading your data, it is good practice to call up the loaded data just to verify that the data actually loaded as it should.

In [None]:
# load the train data
train_data = df_train = pd.read_csv('df_train.csv') 
df_train

In [None]:
# Load the test data
test_data = df_test = pd.read_csv('df_test.csv') 
df_test

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


### Check the "Shape" of the data-sets
Looking at the shape of both datasets, it is clear that the data has been split into 2 sets. 75% of the data is designated as the train data while 25% of the data is designated as the test data. The shape also shows that the training set has 49 columns while the test data set has only 48 coulmns. The missing column from the test set is the column that our model is to predict. We can identify that particular entity by simply identifying the entity(Column) that is missing from the test data set. From examining both datasets, that column can be identified as the load_shortfall_3hr column.

In [None]:
df_train.shape, df_test.shape

### Use the ".column" function to view the columns in your data set;
While the .shape function has revealed the number of columns and rows which exists in your data set the, .columns function lists out the actual names of all the columns that exists in the dataframe. Find above and below the output of the column names of df_train and df_test respectively. 

In [None]:
df_train.columns

In [None]:
df_test.columns

### The "describe" function
This function shows the summary statistics of the data. The  count feature shows the values in the represented columns that do not feature any null entries. The mean, Standards dev, minimum, maximum and quantile values are also featured in the summary stats shown by the describe funtion.

In [None]:
# look at data statistics for df_train
df_train.describe()

In [None]:
# look at data statistics for df_test
df_test.describe()

### The "isnull" function
It is important to identify the columns that have null entries as null values can affect the performance of our model. The "isnull" function shows the number of null values that are contained in each column of the dataset. This data set is relatively clean as this function shows that only the column "Valencia_pressure" features null values.

In [None]:
# Identify colunm(s) that contain null values in df_train
df_train.isnull().sum()

In [None]:
# Identify colunm(s) that contain null values in df_test
df_test.isnull().sum()

### Evaluate the correlation between the columns of the dataset
It is neccesary to evaluate the columns to see how the values within the columns correlate. if multiple columns show strong correlation, the correlating columns will have to be removed from the data set before it is used for model creation as these columns may not add any additional functionality or advantage to the model but will only serve to increase it's size and lead to slow performance. a correlation value of 1 represents a perfect positive correlation while a value equal to -1 indicates a perfect negative correlation. The further away the value is from 1 or -1, the weaker the correlation.

In [None]:
# evaluate correlation for df_train
df_train.corr()

In [None]:
# evaluate correlation for df_test
df_test.corr()

In [None]:
# have a look at feature distributions

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

### Merge the data sets
Observe that so far, we have had to repeat each step we take for the Test and the Train data set. We can simplify the work done in EDA, Data cleaning and pre-processing by merging the Test and Train Data-sets. When you merge your data, any action carried out on one set affects the other data set as well, so you can carry out the actions neccesary for data cleaning only once.

In [None]:
# # Merge the test and train data set to simplify your work 
# df_train = pd.concat([df_train, df_test])
df_train.describe()

In [None]:
df_train.shape

### Check for null Values in df_train
What we expect to see is the sum of the null values contained in the df_train and the df_test data frames. We also expect the load_shortfall_3hr data column from the entire df_train data set to return null values. This is consistent with what we get from the merged data set.

In [None]:
df_train.isnull().sum()
df_test.isnull().sum()

### Check the tail of the merged dataset
Now let us check out the tail of df_train. We want to check to confirm that the data represented at the tail is consitent with the data we get from the df_test data set. Recall that the lower 25% of df_train is the df_test data.

In [None]:
df_train.tail(5)

In [None]:
df_test.tail(5)

### Fix null entries (train)
Looking at the  dataset, everything seems to be in order except the "Valencia_pressure" data set which records 2,522 null entries. Note that the load_shortfall_3h has null entries because it is the value that is to be predicted by the model. After merging the train and  the test data-sets to simplify the work of pre-processing and data cleaning, any action carried out on the merged data set affects bothe the df_train and the df_test data sets as they are both one single dataframe now.So, you can carry out the actions neccesary for data cleaning only once on the merged data set.

In [None]:
sns.boxplot(df_train['Valencia_pressure'])

In [None]:
print('Mode')
print(df_train['Valencia_pressure'].mode())
print('Mean')
print(df_train['Valencia_pressure'].mean())
print('Median')
print(df_train['Valencia_pressure'].median())

### Fix null entries (test)
Looking at the  dataset, everything seems to be in order except the "Valencia_pressure" data set which records 2,522 null entries. Note that the load_shortfall_3h has null entries because it is the value that is to be predicted by the model. After merging the train and  the test data-sets to simplify the work of pre-processing and data cleaning, any action carried out on the merged data set affects bothe the df_train and the df_test data sets as they are both one single dataframe now.So, you can carry out the actions neccesary for data cleaning only once on the merged data set.

In [None]:
print('Mode')
print(df_test['Valencia_pressure'].mode())
print('Mean')
print(df_test['Valencia_pressure'].mean())
print('Median')
print(df_test['Valencia_pressure'].median())

### Choosing the value to replace null values
To fix the null value problem, you can choose to either remove the entries with these nulls from your data set or fill in the  the null values by replacing the nulls with the mean, median or mode. It wouldn't really make too much of a difference which entry you choose to go with as these features are actually quite simmilar looking at the measures of centralization of the "Valencia_pressure" from the data set and the box-plot shown above. For the purpose of this model, we shalll go with the last option and fill out the Null values in "Valencia pressure" with the mode.

In [None]:
#Save the cleaned data
df_train = df_train
df_train['Valencia_pressure'] = df_train['Valencia_pressure'].fillna(df_train['Valencia_pressure'].mode()[0])

In [None]:
#Save the cleaned data
df_test = df_test
df_test['Valencia_pressure'] = df_test['Valencia_pressure'].fillna(df_test['Valencia_pressure'].mode()[0])

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

### Check data types
Now that we have taken care of the null values, we can check to identify the data types contained in the dataset. Machine learning models only work with numeric data, which means the data types for the models must be floats or integers in order to get the best predictions out of the models built from our data.The code bellow reveals the data types of the data contained in our data set. Note that the "time" data, the "Valencia_wind_deg" data as well as the "Seville_pressure" are all object data (also known as strings). These have to be converted to floats or integers for the model to be able to make use of them as inputs. Like the null values, you can handle this problem by simply dropping the colunms. This is not recommended as everytime you drop data, you are loosing pottentially valuable information that may be very useful for your model building efforts. A more beneficial approach will be to process this data by Transforming it to numeric form or encoding it to a form that the model can utilize. 3 non-numeric objects are observed from the df_train data set. they are "time", "Valencia_wind_deg" and "Seville_pressure".

In [None]:
df_train.dtypes

In [None]:
df_test.dtypes

### The time column
Let us take a more thorough look at the object elements in our data frame. We will start with the time column. We need to take that column and map it into a date_time format which is the form that is usable by models

In [None]:
df_train['time']

In [None]:
df_test['time']

To convert to datetime, you use the .to_date_time function in the pandas library. Notice that when you output the code to convert the object to .to_date_time, it looks at first glance as though nothing has changed. When you observe the last row though, the data type (dtype) that was represented as an "object" is now represented as "datetime64[ns]". This format is readable by some machine learning models while objects can not be read by any model. Linear regression models fall among the models that cannot read even this time format though. visit https://www.analyticsvidhya.com/blog/2020/05/datetime-variables-python-pandas/ to link to a resource with detailed instruction on how to adapt the date_time data for use by a regression model.

In [None]:
df_train['time'] = pd.to_datetime(df_train['time'])
df_train.time

In [None]:
df_test['time'] = pd.to_datetime(df_test['time'])
df_test.time

In [None]:

from datetime import datetime

From the dataset it seems we will rely hevily on the weather. We can in turn extract the week, time of day and year from the time column as they may be usefull predictors.

### The week column

In [None]:
# df_train['week'] = df_train['time'].apply(lambda x: x.isocalendar()[1])

In [None]:
df_train['hour'] = df_train['time'].apply(lambda x: x.hour)

In [None]:
df_test['hour'] = df_test['time'].apply(lambda x: x.hour)

In [None]:
import re
def add_datepart(df, fldname, drop=True, time=False):
    "Helper function that adds columns relevant to a date."
    fld = df[fldname]
    fld_dtype = fld.dtype
    if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
        fld_dtype = np.datetime64

    if not np.issubdtype(fld_dtype, np.datetime64):
        df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True)
    targ_pre = re.sub('[Dd]ate$', '', fldname)
    attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
            'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
    if time: attr = attr + ['Hour', 'Minute', 'Second']
    for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
    df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
    if drop: df.drop(fldname, axis=1, inplace=True)

In [None]:
add_datepart(df_train, "time", drop=True, time=False)

In [None]:
add_datepart(df_test, "time", drop=True, time=False)

In [None]:
# df_test['week'] = df_test['time'].apply(lambda x: x.isocalendar()[1])

In [None]:
# df_train['week'].unique()

In [None]:
# df_test['week'].unique()

### The time_of_day column

In [None]:
# df_train['hour'].unique()

In [None]:
# df_test['hour'].unique()

### The year column

In [None]:
# df_train['year'] = df_train['time'].apply(lambda x: x.year)

In [None]:
# df_test['year'] = df_test['time'].apply(lambda x: x.year)

In [None]:
# df_train['year'].unique()

In [None]:
# df_test['year'].unique()

### The Quarter column

In [None]:
# df_train['quarter'] = df_train['time'].dt.quarter

In [None]:
# df_test['quarter'] = df_test['time'].dt.quarter

In [None]:
# df_train['quarter'].unique()

In [None]:
# df_test['quarter'].unique()

### The Valencia_wind_deg Column
The next object in the dataset is the "Valencia_wind_deg". This is recorded as levels which are denoted by the string "level" followed by a number which describes that particular level. To encode this, we can simply extract the number from the column that identifies that level. This line of code can help us to achieve that.

In [None]:
df_train['Valencia_wind_deg']

In [None]:
df_test['Valencia_wind_deg']

In [None]:
df_train['Valencia_wind_deg'] = df_train['Valencia_wind_deg'].str.extract('(\d+)')
df_train['Valencia_wind_deg']

In [None]:
df_test['Valencia_wind_deg'] = df_test['Valencia_wind_deg'].str.extract('(\d+)')

In [None]:
df_test['Valencia_wind_deg']

As you can see above, the data has been reduced to a number without the string "level" to define it but there is still a problem. The data type is still an object. we can convert the object to numeric form by using the pandas numeric function. 

In [None]:
df_train['Valencia_wind_deg'] = pd.to_numeric(df_train['Valencia_wind_deg'])

In [None]:
df_train.Valencia_wind_deg

In [None]:
df_test['Valencia_wind_deg'] = pd.to_numeric(df_test['Valencia_wind_deg'])
df_test.Valencia_wind_deg

Repeat the process for Seville_pressure

In [None]:
df_train['Seville_pressure'] = df_train['Seville_pressure'].str.extract('(\d+)')

In [None]:
df_test['Seville_pressure'] = df_test['Seville_pressure'].str.extract('(\d+)')

In [None]:
df_train['Seville_pressure']

In [None]:
df_test['Seville_pressure']

In [None]:
df_train['Seville_pressure'] = pd.to_numeric(df_train['Seville_pressure'])
df_train.Seville_pressure

In [None]:
df_test['Seville_pressure'] = pd.to_numeric(df_test['Seville_pressure'])
df_test.Seville_pressure

### Variable Selection by Correlation and Significance

The code below will create a new DataFrame and store the correlation coefficents and p-values in that DataFrame for reference.

In [None]:
# Calculate correlations between predictor variables and the response variable
corrs = df_train[df_train['load_shortfall_3h'].notnull()].corr()['load_shortfall_3h'].sort_values(ascending=False)

In [None]:
from scipy.stats import pearsonr

# Build a dictionary of correlation coefficients and p-values
dict_cp = {}

column_titles = [col for col in corrs.index if col!= 'load_shortfall_3h']
for col in column_titles:
    p_val = round(pearsonr(df_train[df_train['load_shortfall_3h'].notnull()][col], df_train[df_train['load_shortfall_3h'].notnull()]['load_shortfall_3h'])[1],6)
    dict_cp[col] = {'Correlation_Coefficient':corrs[col],
                    'P_Value':p_val}
    
df_cp = pd.DataFrame(dict_cp).T
df_cp_sorted = df_cp.sort_values('P_Value')
df_cp_sorted[df_cp_sorted['P_Value']<0.1]

All the features seem to be statistically significant

However, we also need to look for predictor variable pairs which have a high correlation with each other to avoid autocorrelation.

In [None]:
fig = plt.figure(figsize=(15,15));
ax = fig.add_subplot(111);
plot_corr(df_train.corr(), xnames = df_train.corr().columns, ax = ax);

In [None]:
X_names = [col for col in df_train.columns if col != 'load_shortfall_3h']

In [None]:
dfCorr = df_train[X_names].corr()
filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
plt.figure(figsize=(30,10))
sns.heatmap(filteredDf, annot=True, cmap="Reds")
plt.show()

From the plot we can see that there are quite a number of variables with high correlation

In [None]:
# get the correlation matrix
corr = df_train[X_names].corr()

# mask away the lower triangle and diagonal
mask = np.triu(np.ones_like(corr),1) == 1

# get the upper triangle (excluding diagonal) by masking and stack:
corr = corr.where(mask).stack()

# 10 largest by absolute values
max10 = corr.abs().nlargest(40)
max10

### Dropping the noise
We need to drop columns that are not useful to the model. For now that would be the "Unnamed:0" and the "time" columns. Note that you can opt to add the time column to your model but that would require encoding the time data untill it is in a form that is usable by the model. We see by running the .head function that the unwanted columns have indeed been dropped from our df. Your data is now clean and ready for use in model building.

### 2. Feature Selection- With Correlation
In this step we will be removing the features which are highly correlated 

In [None]:
#importing libraries
from sklearn.datasets import load_boston
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
X_names = [col for col in df_train.columns if col != 'load_shortfall_3h']
X = df_train[X_names]
y = df_train['load_shortfall_3h']



In [None]:
# separate dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=0)

X_train.shape, X_test.shape

In [None]:
X_train.corr()

In [None]:
import seaborn as sns
#Using Pearson Correlation
plt.figure(figsize=(15,15))
cor = X_train.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.CMRmap_r)
plt.show()

In [None]:
# with the following function we can select highly correlated features
# it will remove the first feature that is correlated with anything other feature

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

In [None]:
corr_features = correlation(X_train, 0.85)
len(set(corr_features))

In [None]:
corr_features

In [None]:
# X_train.drop(corr_features,axis=1)
# X_test.drop(corr_features,axis=1)

df_train = df_train.drop(
    ['Unnamed: 0' , 'time', 'year'], axis = 1)

In [None]:
# df_train = df_train.drop(
#     ['Unnamed: 0' , 'Madrid_temp_max', 'Bilbao_temp_max', 'Barcelona_temp_max', 'Valencia_temp_max', 'Seville_temp_max', 
#     'Madrid_temp_min', 'Bilbao_temp_min', 'Barcelona_temp_min', 'Valencia_temp_min', 'Seville_temp_min'], axis = 1)

df_train = df_train.drop(['Unnamed: 0' ,"Barcelona_temp",
 'Barcelona_temp_min',
 'Bilbao_temp',
 'Bilbao_temp_max',
 'Madrid_temp',
 'Madrid_temp_min',
 'Seville_temp_min',
 'Valencia_temp',
 'Valencia_temp_min',
 'timeDayofyear',
 'timeElapsed',
 'timeWeek',
 'timeYear'],axis =1)

In [None]:
# df_test = df_test.drop(
#     ['Unnamed: 0' , 'Madrid_temp_max', 'Bilbao_temp_max', 'Barcelona_temp_max', 'Valencia_temp_max', 'Seville_temp_max', 
#     'Madrid_temp_min', 'Bilbao_temp_min', 'Barcelona_temp_min', 'Valencia_temp_min', 'Seville_temp_min'], axis = 1)
df_test = df_test.drop(['Unnamed: 0' ,"Barcelona_temp",
 'Barcelona_temp_min',
 'Bilbao_temp',
 'Bilbao_temp_max',
 'Madrid_temp',
 'Madrid_temp_min',
 'Seville_temp_min',
 'Valencia_temp',
 'Valencia_temp_min',
 'timeDayofyear',
 'timeElapsed',
 'timeWeek',
 'timeYear'],axis =1)

In [None]:
df_train.shape

In [None]:
df_train.head()

### Regularisation

In [None]:
# #Applying it on the data set
# clean_dataset(df_train)

In [None]:
df_train.head()

In [None]:
# Separate the features from the response
X_names = [col for col in df_train.columns if col != 'load_shortfall_3h']
X = df_train[X_names]
y = df_train['load_shortfall_3h']



In [None]:
# from sklearn.linear_model import Lasso
# from sklearn.model_selection import GridSearchCV
# lasso=Lasso()
# parameters={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,0.01,1,5,10,20,30,35,40,45,50,55,100]}
# lasso_regressor=GridSearchCV(lasso,parameters,scoring='neg_mean_squared_error',cv=5)

# lasso_regressor.fit(X,y)
# print(lasso_regressor.best_params_)
# print(lasso_regressor.best_score_)

In [None]:
# Import the scaling module
from sklearn.preprocessing import StandardScaler

In [None]:
# Create standardization object
scaler = StandardScaler()

In [None]:
# Save standardized features into new variable
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X_names)
X_scaled.head()

In [None]:
# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, 
                                                    y, 
                                                    test_size=0.20,
                                                    random_state=1,
                                                    shuffle=False)

In [None]:
# get the correlation matrix
corr = X_scaled.corr()

# mask away the lower triangle and diagonal
mask = np.triu(np.ones_like(corr),1) == 1

# get the upper triangle (excluding diagonal) by masking and stack:
corr = corr.where(mask).stack()

# 10 largest by absolute values
max10 = corr.abs().nlargest(60)
max10

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

### Data splitting
Recall that we merged our train and test data sets in order to make cleaning and preprocessing easier. Now that we are ready to build our model, it is time to split this back into the distinct data sets. The code below splits our data into the train and test data sets. 

In [None]:
# split data
y = df_train[:len(df_train)][['load_shortfall_3h']]
x = X_scaled[:len(df_train)]

x_train = df_train[:len(df_train)].drop('load_shortfall_3h',axis=1)

#Ignore for now. Will be used when model is built and ready to be tested
x_test = df_train[len(df_train):].drop('load_shortfall_3h',axis=1) 
# x_test = df_test 

In [None]:
x.head()

In [None]:
y.head()

### Load Your Model
you are now ready to load up your model. For the purpose of this work, we shall be using the most basic type of model, a linear regression model. Note that other models will significantly improve your model performance and you are encouraged to try out other models to see how they perform compared to this one so as to choose the model that performs best. Note that a quick google search will return other models which you may use and applying the model to your data set is as easy as replacing the code in the cell below with the code string that instantiates the model and tweaking the hyperparameters to your taste.

In [None]:

# create a list of base-models
def get_models():
	models = list()
	models.append(XGBRegressor(booster="gbtree",eta=0.2,eval_metric= "rmse", n_estimators=1000))
	models.append(ElasticNet())
	models.append(KNeighborsRegressor())
	models.append(AdaBoostRegressor())
	models.append(BaggingRegressor(n_estimators=300))
	models.append(RandomForestRegressor(n_estimators=10))
	models.append(ExtraTreesRegressor(n_estimators=300))
	return models
 
# collect out of fold predictions form k-fold cross validation
def get_out_of_fold_predictions(X, y, models):
	meta_X, meta_y = list(), list()
	# define split of data
	kfold = KFold(n_splits=10, shuffle=True)
	# enumerate splits
	for train_ix, test_ix in kfold.split(X):
		fold_y_pred = list()
		# get data
		train_X, test_X = X.iloc[train_ix], X.iloc[test_ix]
		train_y, test_y = y.iloc[train_ix], y.iloc[test_ix]
		meta_y.extend(test_y.values.ravel())
		# fit and make predictions with each sub-model
		for model in models:
			model.fit(train_X, train_y.values.ravel())
			y_pred = model.predict(test_X)
			# store columns
			fold_y_pred.append(y_pred.reshape(len(y_pred),1))
		# store fold y_pred as columns
		meta_X.append(hstack(fold_y_pred))
	return vstack(meta_X), asarray(meta_y)
 
# fit all base models on the training dataset
def fit_base_models(X, y, models):
	for model in models:
		model.fit(X, y.values.ravel())
 
# fit a meta model
def fit_meta_model(X, y):
	model = Lasso(alpha=1e-15)
	model.fit(X, y)
	return model
 
# evaluate a list of models on a dataset
def evaluate_models(X, y, models):
	for model in models:
		y_pred = model.predict(X)
		mse = mean_squared_error(y, y_pred)
		print('%s: RMSE %.3f' % (model.__class__.__name__, sqrt(mse)))
 
# make predictions with stacked model
def super_learner_predictions(X, models, meta_model):
	meta_X = list()
	for model in models:
		y_pred = model.predict(X)
		meta_X.append(y_pred.reshape(len(y_pred),1))
	meta_X = hstack(meta_X)
	# predict
	return meta_model.predict(meta_X)

### Training your model
The test size represents the proportion of the data that is intended for use as the "Test data", Where test size is set at 0.25, you are simply telling the algorithm to use 75% of the data to train the model and 25% to test the model. This value can be set at any figure that the model builder chooses.  

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.20, random_state=1)
print('Train', x_train.shape, y_train.shape, 'Test', x_test.shape, y_test.shape)

In [None]:
# get models
models = get_models()

In [None]:
# get out of fold predictions
meta_X, meta_y = get_out_of_fold_predictions(x_train, y_train, models)
print('Meta ', meta_X.shape, meta_y.shape)

In [None]:
# fit base models
fit_base_models(x_train, y_train, models)
# fit the meta model
meta_model = fit_meta_model(meta_X, meta_y)

In [None]:
# evaluate base models
evaluate_models(x_test, y_test, models)

In [None]:
# evaluate meta model
y_pred = super_learner_predictions(x_test, models, meta_model)
print('Super Learner: RMSE %.3f' % (sqrt(mean_squared_error(y_test, y_pred))))

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

### The Root Mean squared Error (RMSE)
The root mean squared error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the actual values observed. It is a very useful tool in telling how well your model predicted the values in the test data set. Below is a function that calculates the returns the average RMSE of the model 


In [None]:
def rmse(y_test, y_predict):
    return np.sqrt(mean_squared_error(y_test, y_predict))

In [None]:
rmse(y_test, y_pred)

This means that on the average, the predictions of your model deviated from the actual values by about "4858.875754272154". You can better appreciate the implications of this value when you compare it to the mean y_train value, which are the actual values used in training the model. This RMSE value means that a load_shortfall_3h value that is actually 10,000 could possibly have been predicted by your model to be 10,000 +- 4858.875754272154, which implies that your models prediction can be lower than 6000 and can be higher than 14,000. Not too good a performance you might say.

In [None]:
y_train.mean()

### The r squared score
Another metric that is useful in assessing model performance is the r squared score. This score is a measure of the percentage of accuracy of your models predictions. We import this metric from sklearn with the code below. The r2_score of our model reveals that our model returns a correct prediction only 13.4% of the time. A rather poor performance again but bear in mind that thios is aa very basic model that has not been optimized in most ways possible.

In [None]:
#This is another model evaluation tool that tells you how well your mode.l performs
from sklearn.metrics import r2_score

In [None]:
r2_score(y_test, y_pred)

### How to Improve your Model and Optimize performance
consider taking these steps to generate better models and enhance the performance of your generated models. 
* Better Model: This is  just a basic linear regression model. Use google to find other models
and try them out to see how this can make your model performance ratings to improve.

* Better Features: As you know, we dropped the date_time feature which just might be a very helpful
feature to improve our model performance. Try to get this particular matrix back into your model by perfecting the neccesary steps to encode the "time" data until into a form usable by your model. It will also be helpful to drop highly correlated features from the model.

* Hyper Parameter tuning: All models come with default features that can be edited in relation to the data set
to improve model performance. Be careful in modifying hyper parameters though as this may also have a negative 
impact on the performance of your model.  

* Cross validation: Use cross validation to improve model performance(refer to the cross validation train)

***(Please note that you can and should improve your notebook by including extensive relevant visualizations as a tool for your EDA. Also, this list of recommendations is by no means exhaustive. You at liberty to research and apply other strategies to improve the performance of your model and make your presentation better.)***

### Making a Kaggle submission
After you are done creating your model, you can make a kaggle submission from your models results by following this steps.


In [None]:
#Designate the dataframes to be used for model training and testing
x_train = df_train[:len(df_train)].drop('load_shortfall_3h',axis=1)
x_test = df_test

In [None]:
#Fit your models and make your predictions

# fit base models
fit_base_models(x_train, y, models)
# fit the meta model
meta_model = fit_meta_model(meta_X, meta_y)
# evaluate meta model
preds = super_learner_predictions(x_test, models, meta_model)


In [None]:
#confirm that your predictions have actually been generated
daf=pd.DataFrame(preds, columns=['load_shortfall_3h'])
daf.head()

In [None]:
test_data = df_test = pd.read_csv('df_test.csv') 
df_test.head()

In [None]:
#Run this code to generate a .csv file of your submission
output = pd.DataFrame({'time':df_test['time']})
submission2 = output.join(daf)
submission2.to_csv('submission2.csv', index=False)

In [None]:
submission2

### The entry submission process
To run a submission on kaggle, all of the steps involved in the creation of your model as stated above should have been completed on Kaggle. Once you are done with the process and your submission file has been generated and you are happy with the results of the process, locate the save version button on the top right corner of the kaggle page. click on save & run all (commit) button that shows up at the center of your screen and finally click on save to save the output. When you click on the code button, you will find a file called out-put. Clicking on this file will reveal your .csv file that has been submitted to kaggle. You can now submit this file as an entry on kaggle or download it if you prefer to.

Alternatively, you can submit by downloading the .csv file, go to the leaderboard, click on submit and upload your submission file from wherever it is saved on you local machine. Your entry will be evaluated immediately and you will be told your score.

***Note that you can run multiple submissions, as many times as you wish before the official closing date of the submission. Each time your entry will be evaluated and told
what your score is.***

In [None]:
# create one or more ML models


In [None]:
# evaluate one or more ML models

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic