# Regression-based neural networks - Using Kera and TensorFlow - Car Dataset
Keras is an application programming interface (API) used for running high-level neural networks. The model runs on top of TensorFlow which was developed by Google. Keras is recognized as one of the most popular deep learning libraries in Python for research and development because of it ease of use and its simplicity. However, the Scikit-learn library is the most popular library for general machine learning in Python. Most times, building a very complex deep learning network could be challenging but with Keras, this can be achieved with only a few lines of code.

I will be using Keras Library  to build a regression models using the US Economic time series Data set, The dataset could be download here and saved into CSV - https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/economics.csv or https://github.com/tidyverse/ggplot2/blob/master/data-raw/economics.csv. 

The data set contains 574 rows and 5 variables. 

- Date - The date the data was recorded
- Psavert - Personal savings rate.
- Pce     - Personal consumption expenditures, in billions of dollars.
- Uempmed - Median duration of unemployment, in weeks.
- Pop     - Total population, in thousands.
- Unemploy- Number of unemployed in thousands (dependent variable). 

# Deep Learning Neural Network
The basic architecture of the deep learning neural network, which we will be following, consists of three main components.

1) Input Layer: This is where the training observations are fed. The number of predictor variables is also specified here through the neurons.

2) Hidden Layers: These are the intermediate layers between the input and output layers. The deep neural network learns about the relationships involved in data in this component.

3) Output Layer: This is the layer where the final output is extracted from what’s happening in the previous two layers. In case of regression problems, the output later will have one neuron.

# Problem Statement

Across the Nations of the world, unemployment has become a major socio-economic and political problem. However, each government has a specific way of managing this very task, while managing unemployment within an economy, it is very important to predict it as well. I will be building a deep learning regression model using Keras to predict unemployment.

# Model Evaluation Metric

The performance of the model using Root Mean Squared Error (RMSE) which is commonly used metric when evaluating problems. RMSE measures the average magnitude of the residuals or error. Mathematically, it is computed as the square root of the average of square differences between predicted and actual values. 

# Step 1 - To Load the Required Python Libraries and Modules 

In [74]:
# Import required libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import sklearn

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

# Keras specific

from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.wrappers.scikit_learn import KerasRegressor

# Step 2 - To read the Data and Conduct Basic Data Checks

In [75]:
# The first line of code reads in the data as pandas dataframe

data = pd.read_csv('economics.csv')


In [76]:
data.head()

Unnamed: 0,date,pce,pop,psavert,uempmed,unemploy
0,01/07/1967,506.7,198712.0,12.6,4.5,2944
1,01/08/1967,509.8,198911.0,12.6,4.7,2945
2,01/09/1967,515.6,199113.0,11.9,4.6,2958
3,01/10/1967,512.2,199311.0,12.9,4.9,3143
4,01/11/1967,517.4,199498.0,12.8,4.7,3066


In [77]:
# view or prints the shape 

data.shape

(574, 6)

The dataset contains 574 rows/observations and 6 columns/variables.

In [78]:
# this shows the summary statistics of the numerical variables

data.describe()

Unnamed: 0,pce,pop,psavert,uempmed,unemploy
count,574.0,574.0,574.0,574.0,574.0
mean,4820.092683,257159.652662,8.567247,8.608711,7771.310105
std,3556.803613,36682.398508,2.964179,4.106645,2641.95918
min,506.7,198712.0,2.2,4.0,2685.0
25%,1578.3,224896.0,6.4,6.0,6284.0
50%,3936.85,253060.0,8.4,7.5,7494.0
75%,7626.325,290290.75,11.1,9.1,8685.5
max,12193.8,320402.295,17.3,25.2,15352.0


Based on the statistical summary, it shows that the variables/rows has 574 as 'count' which is the same thing with the number of records in the dataset. In this scenario, there is no missing values. 

However, in more real life data set, i.e. when working with large data set, there is will more missing values. 

In [79]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 574 entries, 0 to 573
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      574 non-null    object 
 1   pce       574 non-null    float64
 2   pop       574 non-null    float64
 3   psavert   574 non-null    float64
 4   uempmed   574 non-null    float64
 5   unemploy  574 non-null    int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 27.0+ KB


The date column has no contribution to the prediction of the model, we need to drop the column.

In [80]:
data = data.drop(['date'], axis=1)

In [81]:
data.head()

Unnamed: 0,pce,pop,psavert,uempmed,unemploy
0,506.7,198712.0,12.6,4.5,2944
1,509.8,198911.0,12.6,4.7,2945
2,515.6,199113.0,11.9,4.6,2958
3,512.2,199311.0,12.9,4.9,3143
4,517.4,199498.0,12.8,4.7,3066


# Step 3 - To Create an Arrays for the Features and the Response Variable

Here, we identify the target variable/column and also we define the feature columns, that is, the reset of the columns to be used for prediction purpose.

In [82]:
# To list all the variables/columns in the dataset

data.columns

Index(['pce', 'pop', 'psavert', 'uempmed', 'unemploy'], dtype='object')

In [83]:
# To create an object of the target variable

target_column = ['unemploy'] 

# To list all the features, excluding the target variable 'unemploy'
predictors = list(set(list(data.columns))-set(target_column))

# Normalization of the Predictors Columns
Next, we need to normalizes the predictors. Since the units of the variables differ very significantly and which could influence the modeling process, it is very important to normalizes the columns to be used for prediction. In this situation when there is variation in unit of the data as seen when we use 'describe()' function above, we will do what is called "Normalization" using scaling of the predictors between 0 and 1 as shown below:

In [84]:
data[predictors] = data[predictors]/data[predictors].max()


Now, we need to display the summary of the normalized data set. All the independent variables 'y' are now been scaled between 0 and 1. However, the target variable remains unchanged. 

In [85]:
data.describe()

Unnamed: 0,pce,pop,psavert,uempmed,unemploy
count,574.0,574.0,574.0,574.0,574.0
mean,0.39529,0.802615,0.495217,0.341616,7771.310105
std,0.29169,0.114489,0.17134,0.162962,2641.95918
min,0.041554,0.620195,0.127168,0.15873,2685.0
25%,0.129435,0.701918,0.369942,0.238095,6284.0
50%,0.322857,0.78982,0.485549,0.297619,7494.0
75%,0.625426,0.90602,0.641618,0.361111,8685.5
max,1.0,1.0,1.0,1.0,15352.0


# Step 4 - Creating the Training and Test Datasets

The data have to be splitted into both training and testing data (i.e. 70% traning data and 30% testing data)

In [86]:
# To create arrays of independent (X) and dependent (y) variables, respectively.

X = data[predictors].values
y = data[target_column].values


# To split the data set into training and test dataset ( divided into 70% training and 30% testing data set)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)


# To print the shape of the both training data set and Test data set
print(X_train.shape); print(X_test.shape)

(401, 4)
(173, 4)


As shown above, the shape of the training set is (401 observations of 4 variables) while the shape of the test set (173 observations of 4 variables).

# Step 5 - Building the Deep Learning Regression Model

To build the Regression Model, the deep learning in keras will be used. The activation function used for the hidden layer in neural network is called "Rectified Linear Unit" or ReLU. ReLU is the most widely used activation function because of its non-linear advantages and also it ability to not activate all the neurons at the same time. This implies that at a time, only a few neurons are activated, and this make the network to sparse and very efficient. We will firsst define the model and also we will be using the sequential model because the network consists of a linear stack of layers as shown below:

 Then we repeat the same process in the third and fourth line of codes for the hidden layers, this time without the input_dim parameter. 


In [88]:
# First we need to define the model

# Calls for sequential constructor
model = Sequential()

# To specify the activation function for first layer and the number of input dimensions which is 4 predictors in this case
model.add(Dense(500, input_dim=4, activation= "relu"))

# To repeat the process for the hidden layers without the input dimension (input_dim) parameter
model.add(Dense(100, activation= "relu"))
model.add(Dense(50, activation= "relu"))

# To create the output layer with one node that is expected to output the number of unemployed in thousands
model.add(Dense(1))

#model.summary() #Print model Summary

# To Define an Optimizer and the Loss Measure for Training

I will define the optimizer and the loss measure for training. The Mean Square Error will be used and serves as the loss measure and the "adam" optimizer is the minimization algorithm. The actual benefit of the "adam" optimizer is that we don't need to specify the learning rate as in the case of gradient descent. That means, it will save the task of optimizing the learning rate for the model. This can be achieved as shown below:

In [89]:
# Using the Mean Square Error to serve as loss measure and using 'adam' optimizer as the minimization algorithm
model.compile(loss= "mean_squared_error" , optimizer="adam", metrics=["mean_squared_error"])

# To fits the model on the training dataset, using 'epochs' which respresent the number of training iterations, epochs equal 20
model.fit(X_train, y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x1a16dcef108>

# Step 6 - To Make Prediction on the Test Data and Compute Evaluation Metrics

I will make prediction on both the train and test data. The evaluation metrics to be used is "Root Mean Square Error (RMSE)" and the RMSE values for both train and test data will be printed respectively as shown below:

In [90]:
# To Predict on the train data
pred_train= model.predict(X_train)

# To print the RMSE value on the train data
print(np.sqrt(mean_squared_error(y_train,pred_train)))

# To Predict on the test data
pred= model.predict(X_test)

# To print the RMSE value on the test data
print(np.sqrt(mean_squared_error(y_test,pred))) 

1866.394427714595
1825.6014425717678


# The Evaluation of the Model Performance

Based on the output from RMSE values above, the RMSE is the evaluation metric, and the lower the RMSE value, the better the model performance. Therefore, the RMSE values for train data was 1856 thousand and 1825 thousand for test data. However, in contrast to accuracy, it is not straightforward to interpret RMSE as we would have to look at the unit which in our case is in thousands.

# In Conclusion

I have built Regression Models using the Deep Learning Framework known as "Keras" using the US Economics Time Series dataset and the deep learning regression model to predict the number of unemployed population in thousands. The model acheive a better and stable performance with little variance in the train and test set RMSE. The units of the target variable is in thousands and this also have effect on the RMSE value.

The performance of the model can also be further enhanced by other iterations such as changing the number of neurons, add more to the hidden layers and increase the number of the 'epochs' and this can be tried out to see the impact on the model performance.

Aside, using Deep Learning Keras library, some other algorithm can be used to model using same dataset such as Random Forest, Decision Tree, Gradient Boosting, Support Vector Machines. 