# __Traffic Volume Prediction__
<h2 align="center"><b>Advanced Data Science Capstone Poject by</b></h2>
<h2 align="center"><b>IBM / Coursera</b></h1>
<h2 align=center>Vasilis Kokkinos (September 2019)</h2>

 ## Introduction / Business Problem

USE CASE: Predictive model of traffic volume. It can be used as template for similar situations.

DATA SET: Metro Interstate Traffic Volume Data Set
Hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN. Hourly weather features and holidays included for impacts on traffic volume.

Source: https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume



### __Attribute Information:__

__holiday:__ Categorical US National holidays plus regional holiday, Minnesota State Fair

__temp:__ Numeric Average temp in kelvin

__rain_1h:__ Numeric Amount in mm of rain that occurred in the hour

__snow_1h:__ Numeric Amount in mm of snow that occurred in the hour

__clouds_all:__ Numeric Percentage of cloud cover

__weather_main:__ Categorical Short textual description of the current weather

__weather_description:__ Categorical Longer textual description of the current weather

__date_time:__ DateTime Hour of the data collected in local CST time

__traffic_volume:__ Numeric Hourly I-94 ATR 301 reported westbound traffic volume

------------------------------------------------------------------------------------------------------

#### __Feature Engineering__ was performed in a previous step in the notebook __traffic_volume.feature_eng.py.v01.ipynb__

In that step:
* the 'date_time' feature was broken down to three new features 'month', 'day_of_week', 'hour_of_day'
* categorical string features were transformed to indexes

The resulting data set was saved in _parquet_ format in __traffic_volume_feature_eng_df.parquet__
    
This notebook, is the __Model Definition__ step

--------------------------------------------------------------------------------------------

Import necessary packages, initialize Apache Spark session, and add supporting functions

In [1]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
spark = SparkSession.builder.appName('Traffic Volume Prediction').getOrCreate()

# Create an sql context so that we can query data files in sql like syntax
sqlContext = SQLContext(spark)
spark

#### __Read in the data set__

In [3]:
df = spark.read.parquet('traffic_volume_feature_eng_df.parquet')

df.createOrReplaceTempView('df')
df.printSchema()

root
 |-- temp: double (nullable = true)
 |-- rain_1h: double (nullable = true)
 |-- snow_1h: double (nullable = true)
 |-- clouds_all: integer (nullable = true)
 |-- traffic_volume: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- hour_of_day: integer (nullable = true)
 |-- holidayIndex: integer (nullable = true)
 |-- weatherMainIndex: integer (nullable = true)
 |-- weatherDescIndex: integer (nullable = true)



## Basic data set checks

In [4]:
df.show(10, truncate=False)

+------+-------+-------+----------+--------------+-----+-----------+-----------+------------+----------------+----------------+
|temp  |rain_1h|snow_1h|clouds_all|traffic_volume|month|day_of_week|hour_of_day|holidayIndex|weatherMainIndex|weatherDescIndex|
+------+-------+-------+----------+--------------+-----+-----------+-----------+------------+----------------+----------------+
|280.62|0.0    |0.0    |96        |432           |10   |6          |3          |0           |6               |10              |
|279.24|0.0    |0.0    |1         |5074          |11   |5          |18         |0           |1               |0               |
|284.82|0.0    |0.0    |75        |3083          |11   |4          |18         |0           |0               |2               |
|270.81|0.0    |0.0    |90        |2131          |12   |1          |20         |0           |4               |16              |
|269.47|0.0    |0.0    |20        |5113          |12   |5          |9          |0           |0          

In [5]:
print('Number of rows in the dataframe: {}'.format(df.count()))

Number of rows in the dataframe: 40569


#### Create a __VectoAssembler__ object, that will create a column __'features'__ with the values of all the columns except for the target column __'traffic_volume'__

In [6]:
from pyspark.ml.feature import VectorAssembler

In [7]:
vectorAssembler = VectorAssembler(inputCols=['temp', 'rain_1h', 'snow_1h', 'clouds_all', 'month', 'day_of_week', \
                                             'hour_of_day', 'holidayIndex', 'weatherMainIndex', 'weatherDescIndex'], \
                                  outputCol='features')

#### Create a __MinMaxScaler__ object that will create another __Vector__ column, __'features_norm'__ that will be the normalized version of the 'features' column

In [8]:
from pyspark.ml.feature import MinMaxScaler

In [9]:
minMaxScaler = MinMaxScaler(inputCol='features', outputCol='features_norm')

#### Let's create a __Pipeline__ with the vectorAssembler and the standardScaler objects
This initial PIpeline will be used only to get the correlation matrix of the features.

In [10]:
from pyspark.ml import Pipeline

In [11]:
pipeline = Pipeline(stages=[vectorAssembler, minMaxScaler])
df = pipeline.fit(df).transform(df)

df.printSchema()

root
 |-- temp: double (nullable = true)
 |-- rain_1h: double (nullable = true)
 |-- snow_1h: double (nullable = true)
 |-- clouds_all: integer (nullable = true)
 |-- traffic_volume: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- hour_of_day: integer (nullable = true)
 |-- holidayIndex: integer (nullable = true)
 |-- weatherMainIndex: integer (nullable = true)
 |-- weatherDescIndex: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- features_norm: vector (nullable = true)



In [12]:
df.select('features', 'features_norm').show(5, truncate=False)

+------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
|features                                        |features_norm                                                                                                          |
+------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
|[280.62,0.0,0.0,96.0,10.0,6.0,3.0,0.0,6.0,10.0] |[0.5583383323335335,0.0,0.0,0.96,0.8181818181818182,0.8333333333333334,0.13043478260869565,0.0,0.6,0.30303030303030304]|
|[279.24,0.0,0.0,1.0,11.0,5.0,18.0,0.0,1.0,0.0]  |[0.5376424715056991,0.0,0.0,0.01,0.9090909090909091,0.6666666666666666,0.782608695652174,0.0,0.1,0.0]                  |
|[284.82,0.0,0.0,75.0,11.0,4.0,18.0,0.0,0.0,2.0] |[0.6213257348530294,0.0,0.0,0.75,0.9090909090909091,0.5,0.782608695652174,0.0,0.0,0.06060606060

#### Let's now normalize all the colums

New normalized features will be created (for each existing feature) that will be used in the non machine learning model. The names of the new fields will have the suffix ___Norm__

Source: https://stackoverflow.com/questions/40337744/scalenormalise-a-column-in-spark-dataframe-pyspark

In [13]:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

In [14]:
columns = ['temp', 'rain_1h', 'snow_1h', 'clouds_all', 'traffic_volume', 'month', 'day_of_week', \
           'hour_of_day', 'holidayIndex', 'weatherMainIndex', 'weatherDescIndex']

In [15]:
# UDF for converting column type from vector to double type
unlist = udf(lambda x: round(float(list(x)[0]), 7), DoubleType())
# Iterating over columns to be scaled
for column in columns:
    # VectorAssembler Transformation - Converting column to vector type
    assembler = VectorAssembler(inputCols=[column],outputCol=column+"_Vect")
    # MinMaxScaler Transformation
    scaler = MinMaxScaler(inputCol=column+"_Vect", outputCol=column+"_Norm")
    # Pipeline of VectorAssembler and MinMaxScaler
    pipeline = Pipeline(stages=[assembler, scaler])
    # Fitting pipeline on dataframe
    df = pipeline.fit(df).transform(df).withColumn(column+"_Norm", unlist(column+"_Norm")).drop(column+"_Vect")
df.createOrReplaceTempView('df')

In [16]:
df.createOrReplaceTempView('df')
df.printSchema()

root
 |-- temp: double (nullable = true)
 |-- rain_1h: double (nullable = true)
 |-- snow_1h: double (nullable = true)
 |-- clouds_all: integer (nullable = true)
 |-- traffic_volume: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- hour_of_day: integer (nullable = true)
 |-- holidayIndex: integer (nullable = true)
 |-- weatherMainIndex: integer (nullable = true)
 |-- weatherDescIndex: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- features_norm: vector (nullable = true)
 |-- temp_Norm: double (nullable = true)
 |-- rain_1h_Norm: double (nullable = true)
 |-- snow_1h_Norm: double (nullable = true)
 |-- clouds_all_Norm: double (nullable = true)
 |-- traffic_volume_Norm: double (nullable = true)
 |-- month_Norm: double (nullable = true)
 |-- day_of_week_Norm: double (nullable = true)
 |-- hour_of_day_Norm: double (nullable = true)
 |-- holidayIndex_Norm: double (nullable = true)
 |-- weatherMainIndex_N

In [17]:
df.show(5, truncate=False)

+------+-------+-------+----------+--------------+-----+-----------+-----------+------------+----------------+----------------+------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------+------------+------------+---------------+-------------------+----------+----------------+----------------+-----------------+---------------------+---------------------+
|temp  |rain_1h|snow_1h|clouds_all|traffic_volume|month|day_of_week|hour_of_day|holidayIndex|weatherMainIndex|weatherDescIndex|features                                        |features_norm                                                                                                          |temp_Norm|rain_1h_Norm|snow_1h_Norm|clouds_all_Norm|traffic_volume_Norm|month_Norm|day_of_week_Norm|hour_of_day_Norm|holidayIndex_Norm|weatherMainIndex_Norm|weatherDescIndex_Norm|
+------+-------+-------+----------+-------------

### Create the __correlation marix__ based on the __'features'__ columns

In [18]:
from pyspark.ml.stat import Correlation

In [19]:
corr_matrix = Correlation.corr(df, 'features', method='pearson').collect()[0][0].toArray().round(3)
print(corr_matrix)

[[ 1.     0.061 -0.016 -0.109  0.241 -0.003  0.129 -0.013 -0.042  0.048]
 [ 0.061  1.     0.001  0.068  0.017 -0.008 -0.007 -0.003  0.093  0.209]
 [-0.016  0.001  1.     0.024  0.015 -0.012  0.008 -0.001  0.023  0.023]
 [-0.109  0.068  0.024  1.    -0.013 -0.04   0.055  0.     0.132  0.379]
 [ 0.241  0.017  0.015 -0.013  1.     0.005  0.004 -0.004  0.031  0.019]
 [-0.003 -0.008 -0.012 -0.04   0.005  1.     0.    -0.034 -0.018 -0.034]
 [ 0.129 -0.007  0.008  0.055  0.004  0.     1.    -0.053 -0.095  0.001]
 [-0.013 -0.003 -0.001  0.    -0.004 -0.034 -0.053  1.    -0.003 -0.003]
 [-0.042  0.093  0.023  0.132  0.031 -0.018 -0.095 -0.003  1.     0.668]
 [ 0.048  0.209  0.023  0.379  0.019 -0.034  0.001 -0.003  0.668  1.   ]]


#### Let's use __pandas data frame__ so that we can have a more 'colorful' version of the correlation matrix.

In [20]:
import pandas as pd
corr_matrix_df = pd.DataFrame(corr_matrix)
corr_matrix_df.style.background_gradient(cmap='coolwarm')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.061,-0.016,-0.109,0.241,-0.003,0.129,-0.013,-0.042,0.048
1,0.061,1.0,0.001,0.068,0.017,-0.008,-0.007,-0.003,0.093,0.209
2,-0.016,0.001,1.0,0.024,0.015,-0.012,0.008,-0.001,0.023,0.023
3,-0.109,0.068,0.024,1.0,-0.013,-0.04,0.055,0.0,0.132,0.379
4,0.241,0.017,0.015,-0.013,1.0,0.005,0.004,-0.004,0.031,0.019
5,-0.003,-0.008,-0.012,-0.04,0.005,1.0,0.0,-0.034,-0.018,-0.034
6,0.129,-0.007,0.008,0.055,0.004,0.0,1.0,-0.053,-0.095,0.001
7,-0.013,-0.003,-0.001,0.0,-0.004,-0.034,-0.053,1.0,-0.003,-0.003
8,-0.042,0.093,0.023,0.132,0.031,-0.018,-0.095,-0.003,1.0,0.668
9,0.048,0.209,0.023,0.379,0.019,-0.034,0.001,-0.003,0.668,1.0


#### No two features are strongly correlated, so we can't remove any of them.

In [21]:
# Delete corr_matrix_df to free up some memory
del corr_matrix_df

#### The two newly created colums __'features'__ and __'features_norm'__ will also be used for the Machine Learning algorithm

## Split the data into training, validation, testing data sets

I will split the data fram into three: training, validation and testing data frames

I will use the training data set for model definition and training and the validation and testing data for model evaluation.

In [22]:
df_train, df_val, df_test = df.randomSplit([0.7, 0.2, 0.1], seed=12345)
    
print('Number of rows in the training dataframe: {}'.format(df_train.count()))
print('Number of rows in the validation dataframe: {}'.format(df_val.count()))
print('Number of rows in the testing dataframe: {}'.format(df_test.count()))

Number of rows in the training dataframe: 28194
Number of rows in the validation dataframe: 8266
Number of rows in the testing dataframe: 4109


In [23]:
df_train.createOrReplaceTempView('df_train')
df_val.createOrReplaceTempView('df_val')
df_test.createOrReplaceTempView('df_test')

## __NON DEEP LEARNING MODEL DEFINITION__

### __LINEAR REGRESSION__

#### As a first attempt for the model definition, I will use __Linear Regression with pyspark.ml__

In [24]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

#### __GRID SEARCH FOR LINEAR REGRESSION__
#### Using Linear Regrssion, let's run it over a different sets of hyperparameters
This will give us an indication of not only if our model works and if so, it will also give a set of good hyperparameters

The grid search will iterate through the hyperparameters:
* regParams
* fitIntercepts
* elasticNetParams

As performance measurement I will use __R-squared (r2)__ so I can get a normalized score

In [25]:
evaluator = RegressionEvaluator().setMetricName('r2').setPredictionCol("prediction").setLabelCol("traffic_volume")

In [26]:
lr = LinearRegression(featuresCol='features_norm', labelCol='traffic_volume', predictionCol='prediction')

# Define the hyperparameters lists
regParams = [1, 0.1, 0.001]
fitIntercepts = [False, True]
elasticNetParams = [0, 0.5, 1]

# Run the Grid Search
for regParam in regParams:
    lr.set(lr.getParam('regParam'), regParam)
    for fitIntercept in fitIntercepts:
        lr.set(lr.getParam('fitIntercept'), fitIntercept)
        for elasticNetParam in elasticNetParams:
            lr.set(lr.getParam('elasticNetParam'), elasticNetParam)
            
            score = evaluator.evaluate(lr.fit(df_train).transform(df_train))
            print('For regparam = {}, fitIntercept = {}, elasticNetParam = {} -> score = {}'. \
                  format(regParam, fitIntercept, elasticNetParam, score))

For regparam = 1, fitIntercept = False, elasticNetParam = 0 -> score = 0.10640432780698006
For regparam = 1, fitIntercept = False, elasticNetParam = 0.5 -> score = 0.10640350809464594
For regparam = 1, fitIntercept = False, elasticNetParam = 1 -> score = 0.10640126378952575
For regparam = 1, fitIntercept = True, elasticNetParam = 0 -> score = 0.16659604729248756
For regparam = 1, fitIntercept = True, elasticNetParam = 0.5 -> score = 0.16659509914799697
For regparam = 1, fitIntercept = True, elasticNetParam = 1 -> score = 0.16659274643016087
For regparam = 0.1, fitIntercept = False, elasticNetParam = 0 -> score = 0.10640436138332643
For regparam = 0.1, fitIntercept = False, elasticNetParam = 0.5 -> score = 0.10640435318020769
For regparam = 0.1, fitIntercept = False, elasticNetParam = 1 -> score = 0.10640432987466553
For regparam = 0.1, fitIntercept = True, elasticNetParam = 0 -> score = 0.16659608618903876
For regparam = 0.1, fitIntercept = True, elasticNetParam = 0.5 -> score = 0.1665

#### It is apparent that Linear Regression does not produce good results.

#### As an extra step, let's see if there is a strong correlation between the target feature and each of the indepedent features.

In [27]:
for column in ['temp', 'rain_1h', 'snow_1h', 'clouds_all', 'month', 'day_of_week', 'hour_of_day', 'holidayIndex', \
               'weatherMainIndex', 'weatherDescIndex']:
    print('Correlation between {} and traffic_volume: {}'.format(column, df.corr(column, 'traffic_volume')))

Correlation between temp and traffic_volume: 0.13953016306327648
Correlation between rain_1h and traffic_volume: -0.013412084020647055
Correlation between snow_1h and traffic_volume: -0.0020894165320195415
Correlation between clouds_all and traffic_volume: 0.07839171108552294
Correlation between month and traffic_volume: -0.00471402637026558
Correlation between day_of_week and traffic_volume: -0.14418674924649147
Correlation between hour_of_day and traffic_volume: 0.3550783497016299
Correlation between holidayIndex and traffic_volume: -0.03943187953138766
Correlation between weatherMainIndex and traffic_volume: -0.0762702785255084
Correlation between weatherDescIndex and traffic_volume: 0.02565496866273948


None of the features is linearly correlated to the dependent feature, so __it is not a Linear Regression problem__.

### __NON LINEAR REGRESSION__

#### Let's experiment with the __RandomForestRegressor__ algorithm

Again __R-squared (r2)__ will be used as performance measurement

In [28]:
from pyspark.ml.regression import RandomForestRegressor

In [29]:
evaluator = RegressionEvaluator().setMetricName('r2').setPredictionCol("prediction").setLabelCol("traffic_volume")

In [30]:
randomForestRegressor = RandomForestRegressor(labelCol='traffic_volume', featuresCol='features_norm', predictionCol='prediction')

ndl_model_rfr = randomForestRegressor.fit(df_train)
ndl_prediction_rfr = ndl_model_rfr.transform(df_train)
ndl_score_rfr = evaluator.evaluate(ndl_prediction_rfr)

print('\tEvaluation Score: ', ndl_score_rfr)

	Evaluation Score:  0.8664582474124316


The __r2 score__ is around __0.86__ which is a good indication that the model __RandomForestRegressor__ is working.

Let's check the default parameters of the randomForestRegressor model

In [31]:
model_params = randomForestRegressor.extractParamMap()
for param, value in model_params.items():
    print(param, '=>', value)

RandomForestRegressor_931c203c4b3c__seed => 512816861531764284
RandomForestRegressor_931c203c4b3c__predictionCol => prediction
RandomForestRegressor_931c203c4b3c__labelCol => traffic_volume
RandomForestRegressor_931c203c4b3c__featuresCol => features_norm
RandomForestRegressor_931c203c4b3c__maxDepth => 5
RandomForestRegressor_931c203c4b3c__maxBins => 32
RandomForestRegressor_931c203c4b3c__minInstancesPerNode => 1
RandomForestRegressor_931c203c4b3c__minInfoGain => 0.0
RandomForestRegressor_931c203c4b3c__maxMemoryInMB => 256
RandomForestRegressor_931c203c4b3c__cacheNodeIds => False
RandomForestRegressor_931c203c4b3c__checkpointInterval => 10
RandomForestRegressor_931c203c4b3c__impurity => variance
RandomForestRegressor_931c203c4b3c__subsamplingRate => 1.0
RandomForestRegressor_931c203c4b3c__numTrees => 20
RandomForestRegressor_931c203c4b3c__featureSubsetStrategy => auto


In the __model training__ step, I will increase the __maxDepth__ parameter of the model so that its performance increases

Let's now create a new Pipeline that will only contain the __randomForestRegressor__ model. The normalizer stage is not needed since in the model the normalized features are not needed

In [32]:
rfr_model_pl = Pipeline(stages=[randomForestRegressor])

Let's now save the pipeline

In [33]:
rfr_model_pl.write().overwrite().save('rfr_model_pl.pkl')

## __DEEP LEARNING MODEL DEFINITION__

Let's check the TensorFlow configuration

In [34]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 12454945463495432793
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 1448044134
locality {
  bus_id: 1
  links {
  }
}
incarnation: 13439210984147270269
physical_device_desc: "device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0"
]


#### Again, I will work with the training data set __df_train__

Create arrays that correspond to the training and validation normalized data sets

In [35]:
X_train = df_train['temp_Norm', 'rain_1h_Norm', 'snow_1h_Norm', 'clouds_all_Norm', 'month_Norm', 'day_of_week_Norm', \
                   'hour_of_day_Norm', 'holidayIndex_Norm', 'weatherMainIndex_Norm', 'weatherDescIndex_Norm']
y_train = df_train[['traffic_volume_Norm']]
X_val = df_val['temp_Norm', 'rain_1h_Norm', 'snow_1h_Norm', 'clouds_all_Norm', 'month_Norm', 'day_of_week_Norm', \
               'hour_of_day_Norm', 'holidayIndex_Norm', 'weatherMainIndex_Norm', 'weatherDescIndex_Norm']
y_val = df_val[['traffic_volume_Norm']]

In [36]:
X_train.createOrReplaceTempView('X_train')
X_train_arr = np.array(spark.sql('select * from X_train').collect())

y_train.createOrReplaceTempView('y_train')
y_train_arr = np.array(spark.sql('select * from y_train').collect())

X_val.createOrReplaceTempView('X_val')
X_val_arr = np.array(spark.sql('select * from X_val').collect())

y_val.createOrReplaceTempView('y_val')
y_val_arr = np.array(spark.sql('select * from y_val').collect())

#### I will use __Keras__ to create the neural network

In [37]:
from keras.models import Sequential
from keras.layers import Dense

from keras.initializers import VarianceScaling

# For the model compilation
from keras.optimizers import Adam
from keras import metrics

Using TensorFlow backend.


After a lot of search and experimentation I shaped with the following configuration

In [38]:
kernel_initializer=VarianceScaling(distribution='uniform')

optimizer = Adam()
model = Sequential()
model.add(Dense(20, input_shape=(10,), kernel_initializer=kernel_initializer, activation='relu'))
model.add(Dense(512, kernel_initializer=kernel_initializer, activation='relu'))
model.add(Dense(256, kernel_initializer=kernel_initializer, activation='relu'))
model.add(Dense(128, kernel_initializer=kernel_initializer, activation='relu'))
model.add(Dense(64, kernel_initializer=kernel_initializer, activation='relu'))
model.add(Dense(32, kernel_initializer=kernel_initializer, activation='relu'))
model.add(Dense(64, kernel_initializer=kernel_initializer, activation='relu'))
model.add(Dense(16, kernel_initializer=kernel_initializer, activation='relu'))
model.add(Dense(8, kernel_initializer=kernel_initializer, activation='relu'))
model.add(Dense(1, kernel_initializer=kernel_initializer, activation='linear'))

model.compile(loss='mse', optimizer=optimizer, metrics=['mse', 'mae'])

model.summary()





Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 20)                220       
_________________________________________________________________
dense_2 (Dense)              (None, 512)               10752     
_________________________________________________________________
dense_3 (Dense)              (None, 256)               131328    
_________________________________________________________________
dense_4 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_5 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_6 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_7 (Dense)              (None, 64)           

Let's now fit the model using the training data set having the validation data set as our validation data

It will be run on 50 epochs with batch size 127. I took __Ilja Rasin's__ advice from his lelevant presentation. In this case the training set size 28194 which is devided exactly by 127

In [39]:
model.fit(X_train_arr, y_train_arr, validation_data=(X_val_arr, y_val_arr), epochs=50, batch_size=127, verbose=2)



Train on 28194 samples, validate on 8266 samples
Epoch 1/50
 - 3s - loss: 0.0423 - mean_squared_error: 0.0423 - mean_absolute_error: 0.1550 - val_loss: 0.0169 - val_mean_squared_error: 0.0169 - val_mean_absolute_error: 0.0973
Epoch 2/50
 - 1s - loss: 0.0120 - mean_squared_error: 0.0120 - mean_absolute_error: 0.0785 - val_loss: 0.0081 - val_mean_squared_error: 0.0081 - val_mean_absolute_error: 0.0610
Epoch 3/50
 - 1s - loss: 0.0073 - mean_squared_error: 0.0073 - mean_absolute_error: 0.0575 - val_loss: 0.0076 - val_mean_squared_error: 0.0076 - val_mean_absolute_error: 0.0557
Epoch 4/50
 - 1s - loss: 0.0063 - mean_squared_error: 0.0063 - mean_absolute_error: 0.0519 - val_loss: 0.0062 - val_mean_squared_error: 0.0062 - val_mean_absolute_error: 0.0534
Epoch 5/50
 - 1s - loss: 0.0057 - mean_squared_error: 0.0057 - mean_absolute_error: 0.0490 - val_loss: 0.0063 - val_mean_squared_error: 0.0063 - val_mean_absolute_error: 0.0527
Epoch 6/50
 - 1s - loss: 0.0052 - mean_squared_error: 0.0052 - m

<keras.callbacks.History at 0x1c606f86508>

#### Let's see the evaluation metrics of the model on the training set.

In [40]:
# Training set
score = model.evaluate(X_train_arr, y_train_arr)
print('Evaluation Mean Squared Error: {}'.format(score[1]))
print('Evaluation Mean Absolute Error: {}'.format(score[2]))

Evaluation Mean Squared Error: 0.003915158861432421
Evaluation Mean Absolute Error: 0.03845701188286661


The algorithm looks promising. In the __Model Training__ step I will try to improve its performance

#### Let's save the model configuration only

In [41]:
from keras.models import model_from_json
import json

In [42]:
traffic_volume_dl_model_json = model.to_json()
with open('traffic_volume_dl_model.json', 'w') as outfile:
    json.dump(traffic_volume_dl_model_json, outfile)

#### Finally, let's save the final dataframe all the training, validation and testing dataframes in __parquet__ format for future use.

In [43]:
df = df.repartition(1)
df.write.parquet('traffic_volume_df_norm.parquet')
df_train = df_train.repartition(1)
df_train.write.parquet('traffic_volume_df_train.parquet')
df_val = df_val.repartition(1)
df_val.write.parquet('traffic_volume_df_val.parquet')
df_test = df_test.repartition(1)
df_test.write.parquet('traffic_volume_df_test.parquet')

## __Summary__

We now have defined the initial non deep learning and a deep leerning algorithm for our non-linear regression problem and saved them in:
__rfr_model_pl.pkl__ and __traffic_volume_dl_model.json__

We have also saved the final dataframe and the taining, validation and testing data sets in:__traffic_volume_df_norm.parquet__, __traffic_volume_df_train.parquet__, __traffic_volume_df_val.parquet__ and __traffic_volume_df_test.parquet__ to use them in the next steps

The Pipeline and the non-linear algorithm will be taken to the next steps where the algorithms will be trained