<a href="https://colab.research.google.com/github/matthew110395/12004210_DataAnalytics/blob/main/12004210_DAOTW_Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction
This assignment builds on the New York taxi problem identified in assignment 1, where the City of New York is looking for a way to accurately predict the number of collisions on a particular day of the week. Within this document two different types of machine learning models will be utilised to predict the number of collisions. The linear relationships identified in assignment 1 will be used to create linear regression models. Deep Learning Neural Network (DNN) models will be used to predict the number of collisions where the relationship is not linear. 

#Imports
To prepare data for machine learning the pandas package has been used, alongside the numpy package which has been used to aid with mathematical functions.

As within part 1 of this assignment, the data file containing location data exceeds the size limit for hosting within github. To overcome this the file was zipped. To extract the data the zipfile package has been used.

Within this document, TensorFlow is used for machine learning, with both linear regression models and a Deep Neural Network models. TensorFlow version 1 is unsupported within Google Colab, therefore must be installed using a package manager.

Shutil is also imported to allow for file management, in particular the removal of saved models.

In [77]:
#Import Packages
import pandas as pd
import numpy as np
import zipfile
!pip install tensorflow==1.15.2
import tensorflow as tf
import shutil  

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#Linear Regressor
Throughout assignment 1 a number of linear relationships were uncovered within the dataset. These relationships form the basis of the linear regression models below.

A linear regressor is used to predict an output variable based on one or more input variables (IBM n.d.).

To improve the accuracy of the model the target values are scaled. This reduces the range of collisions from 188-1161 to 0.1619... - 1 which allows for quicker training  (Zhang 2019).

In [78]:
#Scale to maximum number of collisions
SCALE_COLLISIONS=1161

##Precipitation
As uncovered in assignment 1; as the volume of precipitation increases, the number of collisions increase. 

The datafile produced in assignment is imported.

In [79]:
#Read File
df_prcp = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/prcp_clean.csv', index_col=0, )
print(df_prcp[:6])

   day  year  mo  da collision_date  NUM_COLLISIONS  temp  dewp     slp  \
1    5  2020   1  24     2020-01-24             524  37.3  33.7  1028.5   
2    2  2021   1  12     2021-01-12             278  37.0  29.1  1019.0   
3    5  2021   1  22     2021-01-22             254  36.5  28.4  1003.1   
4    3  2021   1  27     2021-01-27             262  34.6  33.8  1012.8   
5    2  2021   1  26     2021-01-26             263  31.9  23.4  1016.9   
6    1  2022   1  24     2022-01-24             237  34.5  23.8  1010.6   

   visib  ...   max   min  prcp   sndp  fog  rain_drizzle  snow_ice_pellets  \
1    6.5  ...  46.0  19.9  0.00  999.9    1             0                 0   
2   10.0  ...  44.1  21.0  0.00  999.9    0             0                 0   
3   10.0  ...  44.1  19.9  0.00  999.9    0             0                 0   
4    8.0  ...  41.0  28.9  0.25  999.9    1             1                 0   
5    9.0  ...  37.9  21.0  0.00  999.9    1             0                 0   


In order to create the linear regression model, extra columns are removed to simplify the model with the aim of reducing error values.

The incomplete years (2012 and 2022) are removed, along with the erroneous data for 2020 and 2021.

To aid with the production of the model the target is moved to the end of the data table.

In [80]:
#Remove Cols not Required 
df_prcp = df_prcp.drop(columns=['collision_date', 'temp', 'dewp', 'slp','visib','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud'])
df_prcp = df_prcp.loc[df_prcp["year"] != 2012]
df_prcp = df_prcp.loc[df_prcp["year"] < 2020]
cols = df_prcp['NUM_COLLISIONS']
df_prcp = df_prcp.drop(columns=['NUM_COLLISIONS'])
#Move NUM_COLLISIONS to end
df_prcp.insert(loc=9, column='NUM_COLLISIONS', value=cols)
print(df_prcp[:6])
df_prcp.describe()

    day  year  mo  da  prcp  fog  rain_drizzle  snow_ice_pellets  hail  \
49    4  2016   1  28  0.09    0             0                 0     0   
51    5  2014   1  17  0.00    1             0                 0     0   
54    1  2016   1  25  0.02    0             0                 0     0   
55    5  2016   1  29  0.00    0             0                 0     0   
58    5  2017   1  20  0.00    0             0                 0     0   
59    7  2013   1  13  0.01    1             0                 0     0   

    NUM_COLLISIONS  
49             681  
51             589  
54             658  
55             645  
58             605  
59             373  


Unnamed: 0,day,year,mo,da,prcp,fog,rain_drizzle,snow_ice_pellets,hail,NUM_COLLISIONS
count,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0
mean,3.998425,2015.989366,6.518708,15.745569,0.122588,0.253249,0.375345,0.085467,0.000394,599.135093
std,2.003542,1.996126,3.455211,8.803199,0.329143,0.434958,0.484307,0.27963,0.019846,100.299164
min,1.0,2013.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2.0,2014.0,4.0,8.0,0.0,0.0,0.0,0.0,0.0,531.0
50%,4.0,2016.0,7.0,16.0,0.0,0.0,0.0,0.0,0.0,602.0
75%,6.0,2018.0,10.0,23.0,0.06,1.0,1.0,0.0,0.0,665.0
max,7.0,2019.0,12.0,31.0,3.76,1.0,1.0,1.0,1.0,1161.0


To remove any bias within the dataset, it is randomly shuffled. The data is then split into the predictors and the target.

In [81]:
# Shuffle the data
shuffle = df_prcp.iloc[np.random.permutation(len(df_prcp))]

# Select all apart from last col
predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])
print(shuffle[:6])

      day  year  mo  da  prcp  fog  rain_drizzle  snow_ice_pellets  hail
3492    4  2015  12  31  0.00    0             1                 0     0
1307    5  2013   5  10  0.00    1             0                 0     0
506     2  2016   2   9  0.06    0             0                 1     0
1716    3  2019   6   5  0.00    0             1                 0     0
486     5  2017   2  17  0.00    0             0                 0     0
2648    5  2018   9  28  0.10    0             1                 0     0
      day  year  mo  da  prcp  fog  rain_drizzle  snow_ice_pellets  hail  \
3492    4  2015  12  31  0.00    0             1                 0     0   
1307    5  2013   5  10  0.00    1             0                 0     0   
506     2  2016   2   9  0.06    0             0                 1     0   
1716    3  2019   6   5  0.00    0             1                 0     0   
486     5  2017   2  17  0.00    0             0                 0     0   
2648    5  2018   9  28  0.10    

In [82]:
# Select Target (last col)
targets = shuffle.iloc[:,-1]

print(targets[:6])

3492    527
1307    698
506     571
1716    655
486     680
2648    712
Name: NUM_COLLISIONS, dtype: int64


In [83]:
#Split to test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
print(trainsize)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize
print(testsize)
nppredictors = 9
noutputs = 1

2031
508


In [84]:
# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model_prcp', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_prcp', optimizer=tf.train.AdamOptimizer(learning_rate=0.00001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))
print("starting to train");
#Train Model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)
preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS

pred = format(str(predslistscale))
#Test Model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));

avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2184c36790>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/linear_regression_trained_model_prcp/model.ckpt.
INFO:tensorflow:loss = 0.28435007, step = 1
INFO:tensorflow:global_step/sec: 610.428
INFO:tensorflow:loss = 0.006029467, step = 101 (0.170 sec)
INFO:tensorflow:global_step/sec: 830.111
INFO:tensorflow:loss = 0.006208708, step = 201 (0.120 sec)
INFO:tensorflow:global_step/sec: 763.028
INFO:tensorflow:loss = 0.00811196, step = 301 (0.129 sec)
INFO:tensorflow:global_step/sec: 739.671
INFO:tensorflow:loss = 0.006580451, step = 401 (0.134 sec)
INFO:tensorflow:global_step/sec: 742.39
INFO:tensorflow:loss = 0.0064119017, step = 501 (0.139 sec)
INFO:tensorflow:global_step/sec: 483.678
INFO:tensorflow:loss = 0.006235397, step = 601 (0.207 sec)
INFO:tensorflow:global_step/sec: 537.563
INFO:tensorflow:loss = 0.0064507676, step = 701 (0.1

LinearRegression has RMSE of 102.05434614543951
Just using average = 600.2299359921221 has RMSE of 105.566287878724


A number of learning rates were used to determine a suitable learning rate for the model. As the learning rate decreases the overall time to train the dataset increases (Zulkifli 2018).

In [85]:
#Check Error Rate
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_prcp', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21845a0e50>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_prc

508
508
[0.5122645  0.47875375 0.52287126 0.5635435  0.46389306 0.55077493
 0.56979203 0.47948697 0.54274005 0.5035545  0.5411757  0.54123
 0.5133076  0.48793462 0.55686814 0.48470786 0.53721666 0.50310105
 0.47432306 0.5792667  0.54587805 0.4638464  0.5302459  0.4894051
 0.55784094 0.5486665  0.50999916 0.46061707 0.5070047  0.5136187
 0.53376424 0.53320867 0.51746637 0.5568299  0.51152563 0.5712332
 0.5349731  0.49143696 0.4980978  0.53075117 0.50611013 0.5668902
 0.5099612  0.53449446 0.5607861  0.5116442  0.54498    0.5537348
 0.54023683 0.5213465  0.55704856 0.49917585 0.52356577 0.55058366
 0.489171   0.5526224  0.5368833  0.48330578 0.5391398  0.48670578
 0.51017135 0.54889625 0.5786473  0.56032413 0.50977975 0.51447284
 0.48049292 0.48267332 0.5532888  0.5355576  0.46934167 0.4938137
 0.54982024 0.4537593  0.48576254 0.5497053  0.5488554  0.54657066
 0.59345555 0.49571893 0.4655688  0.504705   0.49902675 0.5296054
 0.48964164 0.54712546 0.50650275 0.46399084 0.5053166  0.550896

In [86]:
input = pd.DataFrame.from_dict(data = 
				{
         'day' : [1,1,1,10],
         'year' : [2019,2019,2019,2020],
         'mo' : [3,3,3,12],
         'da' : [10,10,10,12],
         'prcp' : [0,2.34,5.5,2.24],
         'fog' : [0,0,1,1],
         'rain_drizzle' : [0,1,1,1],
         'snow_ice_pellets' : [0,0,0,0],
         'hail' : [0,0,0,0]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_prcp', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180c349d0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_prc

[631.5835  628.92267 644.65247 542.87445]


Two main tests have been applied to the model. The Route Mean Squared Error (RMSE) and a comparison between the target values in the testing dataset and the predicted values using the predictors in the testing dataset.

Predominantly the RMSE of the model is lower than that of the average. This indicates that the model makes more accurate predictions compared to the average.

##Dew Point (dewp)
A relationship between dew point and the number of collisions was also uncovered in assignment 1. This linear relationship suggests that as the dew point increases the number of collisions increase. 

The process to produce the model follows the same process as the precipitation model.

In [87]:
#Read Data
df_dewp = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/dewp_clean.csv', index_col=0, )
print(df_dewp[:6])

   day  year  mo  da collision_date  NUM_COLLISIONS  temp  dewp     slp  \
1    5  2020   1  24     2020-01-24             524  37.3  33.7  1028.5   
2    2  2021   1  12     2021-01-12             278  37.0  29.1  1019.0   
3    5  2021   1  22     2021-01-22             254  36.5  28.4  1003.1   
4    3  2021   1  27     2021-01-27             262  34.6  33.8  1012.8   
5    2  2021   1  26     2021-01-26             263  31.9  23.4  1016.9   
6    1  2022   1  24     2022-01-24             237  34.5  23.8  1010.6   

   visib  ...   max   min  prcp   sndp  fog  rain_drizzle  snow_ice_pellets  \
1    6.5  ...  46.0  19.9  0.00  999.9    1             0                 0   
2   10.0  ...  44.1  21.0  0.00  999.9    0             0                 0   
3   10.0  ...  44.1  19.9  0.00  999.9    0             0                 0   
4    8.0  ...  41.0  28.9  0.25  999.9    1             1                 0   
5    9.0  ...  37.9  21.0  0.00  999.9    1             0                 0   


In [88]:
#Remove Cols not Required 
df_dewp = df_dewp.drop(columns=['collision_date', 'temp', 'prcp', 'slp','visib','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_dewp = df_dewp.loc[df_dewp["year"] != 2012]
df_dewp = df_dewp.loc[df_dewp["year"] < 2020]
cols = df_dewp['NUM_COLLISIONS']
df_dewp = df_dewp.drop(columns=['NUM_COLLISIONS'])
#Move target to end
df_dewp.insert(loc=5, column='NUM_COLLISIONS', value=cols)
print(df_dewp[:6])
df_dewp.describe()

    day  year  mo  da  dewp  NUM_COLLISIONS
49    4  2016   1  28  24.4             681
51    5  2014   1  17  35.8             589
54    1  2016   1  25  21.2             658
55    5  2016   1  29  36.8             645
58    5  2017   1  20  32.5             605
59    7  2013   1  13  44.9             373


Unnamed: 0,day,year,mo,da,dewp,NUM_COLLISIONS
count,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0
mean,3.998434,2015.999217,6.52407,15.723679,44.16317,599.10998
std,2.000391,2.0,3.449676,8.801271,16.995303,100.277185
min,1.0,2013.0,1.0,1.0,-6.7,188.0
25%,2.0,2014.0,4.0,8.0,32.15,531.0
50%,4.0,2016.0,7.0,16.0,45.3,602.0
75%,6.0,2018.0,10.0,23.0,58.5,665.0
max,7.0,2019.0,12.0,31.0,74.1,1161.0


In [89]:
#Shuffle Data
shuffle = df_dewp.iloc[np.random.permutation(len(df_dewp))]
#Select Predictors
predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])
print(shuffle[:6])

      day  year  mo  da  dewp
1587    2  2019   6  18  58.8
2698    3  2016   9   7  66.8
1724    5  2017   6  23  63.4
1383    1  2018   5   7  49.7
1877    5  2019   7  26  58.9
2879    2  2013  10   1  51.8
      day  year  mo  da  dewp  NUM_COLLISIONS
1587    2  2019   6  18  58.8             721
2698    3  2016   9   7  66.8             648
1724    5  2017   6  23  63.4             793
1383    1  2018   5   7  49.7             695
1877    5  2019   7  26  58.9             650
2879    2  2013  10   1  51.8             616


In [90]:
#Select last col as target
targets = shuffle.iloc[:,-1]

print(targets[:6])

1587    721
2698    648
1724    793
1383    695
1877    650
2879    616
Name: NUM_COLLISIONS, dtype: int64


In [91]:
#Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
print(trainsize)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize
print(testsize)
nppredictors = 5
noutputs = 1

2044
511


In [92]:
# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model_dewp', ignore_errors=True)

#Setup Model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_dewp', optimizer=tf.train.AdamOptimizer(learning_rate=0.00001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))
print("starting to train");
#Train Model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)
preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS

#test Model
pred = format(str(predslistscale))
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));

avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2184825110>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_dewp', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/linear_regression_trained_model_dewp/model.ckpt.
INFO:tensorflow:loss = 0.27910894, step = 1
INFO:tensorflow:global_step/sec: 585.842
INFO:tensorflow:loss = 0.0071757007, step = 101 (0.174 sec)
INFO:tensorflow:global_step/sec: 865.49
INFO:tensorflow:loss = 0.009145996, step = 201 (0.116 sec)
INFO:tensorflow:global_step/sec: 616.143
INFO:tensorflow:loss = 0.006120279, step = 301 (0.162 sec)
INFO:tensorflow:global_step/sec: 509.497
INFO:tensorflow:loss = 0.0068645384, step = 401 (0.196 sec)
INFO:tensorflow:global_step/sec: 536.93
INFO:tensorflow:loss = 0.0065441383, step = 501 (0.186 sec)
INFO:tensorflow:global_step/sec: 718.277
INFO:tensorflow:loss = 0.005712265, step = 601 (0.140 sec)
INFO:tensorflow:global_step/sec: 756.416
INFO:tensorflow:loss = 0.0065529943, step = 701 (0

LinearRegression has RMSE of 93.96416755097607
Just using average = 598.9432485322897 has RMSE of 102.3375683663379


In [93]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_dewp', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2193f8d590>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_dewp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_dew

511
511
[0.54108864 0.5150208  0.51542944 0.51455796 0.47583237 0.5577232
 0.56107754 0.52491874 0.5058224  0.5312116  0.57578826 0.4976822
 0.5146251  0.47125548 0.54165584 0.5187817  0.54353875 0.48916832
 0.5086365  0.51474327 0.4938765  0.47406805 0.54567677 0.57583743
 0.53132784 0.5793128  0.5340886  0.5401705  0.53567487 0.56672806
 0.507975   0.5596787  0.53980374 0.52301604 0.47560015 0.49049875
 0.52553046 0.48754847 0.5383884  0.49556622 0.5098874  0.529255
 0.56506574 0.49334618 0.5572748  0.5790375  0.55949205 0.4743112
 0.5073691  0.53109753 0.46992546 0.5120298  0.5085686  0.57224846
 0.5523468  0.50943124 0.53462005 0.48818082 0.51841134 0.5310553
 0.5350835  0.580653   0.5600857  0.5329062  0.5573527  0.5525702
 0.5150917  0.518853   0.49951407 0.5492718  0.5072489  0.548728
 0.51278436 0.48733842 0.48348555 0.5607031  0.46873987 0.5772446
 0.54728687 0.5182029  0.47003058 0.48797196 0.58404195 0.5136731
 0.5490102  0.44753367 0.5624638  0.5363252  0.5077698  0.5223185

In [94]:
input = pd.DataFrame.from_dict(data = 
				{
         'day' : [1,1,1,10],
         'year' : [2019,2019,2019,2020],
         'mo' : [3,3,3,12],
         'da' : [10,10,10,12],
         'dewp' : [0,2.34,5.5,2.24],
         
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_dewp', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f218474e550>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_dewp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_dew

[596.0713  598.4295  601.614   503.96735]


The RMSE for the dewp model is slightly lower in comparison to the model produced for precipitation. As the RMSE is lower than the RMSE of the mean, it shows that the model has a higher level of accuracy in comparison to the using the mean.

##Visibility (visib)
A relationship was also uncovered between visibility and the number of collisions. This is a negative linear relationship where the visibility increases the number of collisions decrease. 

The process to produce the model follows the same process as above.

In [95]:
#Read Data
df_visib = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/coldata.csv', index_col=0, )
print(df_visib[:6])

   day  year  mo  da collision_date  NUM_COLLISIONS  temp  dewp     slp  \
1    5  2020   1  24     2020-01-24             524  37.3  33.7  1028.5   
2    2  2021   1  12     2021-01-12             278  37.0  29.1  1019.0   
3    5  2021   1  22     2021-01-22             254  36.5  28.4  1003.1   
4    3  2021   1  27     2021-01-27             262  34.6  33.8  1012.8   
5    2  2021   1  26     2021-01-26             263  31.9  23.4  1016.9   
6    1  2022   1  24     2022-01-24             237  34.5  23.8  1010.6   

   visib  ...   max   min  prcp   sndp  fog  rain_drizzle  snow_ice_pellets  \
1    6.5  ...  46.0  19.9  0.00  999.9    1             0                 0   
2   10.0  ...  44.1  21.0  0.00  999.9    0             0                 0   
3   10.0  ...  44.1  19.9  0.00  999.9    0             0                 0   
4    8.0  ...  41.0  28.9  0.25  999.9    1             1                 0   
5    9.0  ...  37.9  21.0  0.00  999.9    1             0                 0   


In [96]:
#Remove Cols not Required 
df_visib = df_visib.drop(columns=['collision_date', 'temp', 'prcp', 'slp','dewp','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_visib = df_visib.loc[df_visib["year"] != 2012]
df_visib = df_visib.loc[df_visib["year"] < 2020]
cols = df_visib['NUM_COLLISIONS']
df_visib = df_visib.drop(columns=['NUM_COLLISIONS'])
#Move target to end
df_visib.insert(loc=5, column='NUM_COLLISIONS', value=cols)
print(df_visib[:6])
df_visib.describe()

    day  year  mo  da  visib  NUM_COLLISIONS
49    4  2016   1  28   10.0             681
51    5  2014   1  17    6.7             589
54    1  2016   1  25   10.0             658
55    5  2016   1  29   10.0             645
58    5  2017   1  20   10.0             605
59    7  2013   1  13    4.3             373


Unnamed: 0,day,year,mo,da,visib,NUM_COLLISIONS
count,2556.0,2556.0,2556.0,2556.0,2556.0,2556.0
mean,3.999218,2016.0,6.524257,15.725743,8.295618,599.118936
std,2.000391,2.0,3.449013,8.800168,2.20787,100.258581
min,1.0,2013.0,1.0,1.0,0.2,188.0
25%,2.0,2014.0,4.0,8.0,7.1,531.0
50%,4.0,2016.0,7.0,16.0,9.4,602.0
75%,6.0,2018.0,10.0,23.0,10.0,665.0
max,7.0,2019.0,12.0,31.0,10.0,1161.0


In [97]:
#Shuffle the data
shuffle = df_visib.iloc[np.random.permutation(len(df_visib))]
predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])
print(shuffle[:6])

      day  year  mo  da  visib
2825    4  2019  10  24   10.0
1856    1  2019   7   8   10.0
1021    7  2015   4  12   10.0
2935    3  2016  10   5   10.0
1685    6  2018   6  23    5.7
2187    2  2016   8   9   10.0
      day  year  mo  da  visib  NUM_COLLISIONS
2825    4  2019  10  24   10.0             613
1856    1  2019   7   8   10.0             592
1021    7  2015   4  12   10.0             443
2935    3  2016  10   5   10.0             695
1685    6  2018   6  23    5.7             513
2187    2  2016   8   9   10.0             650


In [98]:
# Select the last col as target
targets = shuffle.iloc[:,-1]

print(targets[:6])

2825    613
1856    592
1021    443
2935    695
1685    513
2187    650
Name: NUM_COLLISIONS, dtype: int64


In [99]:
#Split to test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
print(trainsize)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize
print(testsize)
nppredictors = 5
noutputs = 1

2044
512


In [100]:
# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model_visib', ignore_errors=True)

#Setup Model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_visib', optimizer=tf.train.AdamOptimizer(learning_rate=0.0001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))
print("starting to train");
#train Model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)
preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS
#Test Model
pred = format(str(predslistscale))
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));

avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21848e5050>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_visib', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/linear_regression_trained_model_visib/model.ckpt.
INFO:tensorflow:loss = 0.27459064, step = 1
INFO:tensorflow:global_step/sec: 657.885
INFO:tensorflow:loss = 0.009505197, step = 101 (0.157 sec)
INFO:tensorflow:global_step/sec: 849.991
INFO:tensorflow:loss = 0.008307781, step = 201 (0.119 sec)
INFO:tensorflow:global_step/sec: 826.181
INFO:tensorflow:loss = 0.0071506426, step = 301 (0.121 sec)
INFO:tensorflow:global_step/sec: 755.623
INFO:tensorflow:loss = 0.0070540262, step = 401 (0.132 sec)
INFO:tensorflow:global_step/sec: 754.632
INFO:tensorflow:loss = 0.007330095, step = 501 (0.132 sec)
INFO:tensorflow:global_step/sec: 797.108
INFO:tensorflow:loss = 0.0065802094, step = 601 (0.125 sec)
INFO:tensorflow:global_step/sec: 687.471
INFO:tensorflow:loss = 0.0069297077, step = 701

LinearRegression has RMSE of 93.30438690969395
Just using average = 600.527397260274 has RMSE of 97.94271362023213


In [101]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

#testd = pd.DataFrame.from_records(predictors[trainsize:].values,columns=['day','year','month','da','prcp','fog','rain','snow','hail'])
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_visib', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2184612150>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_visib', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_vi

512
512
[0.51203763 0.56929165 0.49971002 0.53724235 0.5331647  0.4596194
 0.5065841  0.5196762  0.4785367  0.50867176 0.5652723  0.4853061
 0.52714485 0.5508507  0.5110668  0.49564546 0.5065315  0.4916242
 0.51037836 0.5458444  0.5211129  0.48834977 0.5527546  0.50391304
 0.5037089  0.49517938 0.5110578  0.5189451  0.5335094  0.5112895
 0.4963031  0.5103712  0.5012339  0.53205377 0.5731611  0.55361545
 0.55337316 0.50122494 0.5149181  0.49802685 0.5079251  0.48917392
 0.51711273 0.5218819  0.4799218  0.5363916  0.49335063 0.5628964
 0.5526853  0.47750828 0.48871735 0.4885171  0.5009921  0.52025706
 0.49132308 0.5235904  0.5267433  0.50128156 0.5249217  0.4807761
 0.51871735 0.54851705 0.46648568 0.51754755 0.51672596 0.4948084
 0.4541079  0.55862695 0.50122607 0.5371028  0.4885851  0.563091
 0.45256427 0.5198303  0.51368785 0.50315446 0.52299356 0.55308145
 0.49334005 0.5486676  0.45334056 0.46320236 0.51783854 0.5497701
 0.5620634  0.5276797  0.5162332  0.5170213  0.4965934  0.535922

In [102]:
input = pd.DataFrame.from_dict(data = 
				{
         'day' : [1,1,1,10],
         'year' : [2019,2019,2019,2020],
         'mo' : [3,3,3,12],
         'da' : [10,10,10,12],
         'visib' : [1,5,9.5,5],
         
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_visib', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21848b7d90>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_visib', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_vi

[648.7901  633.99225 617.3446  546.5106 ]


As the difference RMSE between the mean and the model is less, it is arguable that the model is not as accurate.

Although the RMSE value indicates a weaker model, the error rate is lower indicating that there is a significant error increasing the RMSE.

#Deep Learning Neural Network (DNN)
Although the primary outcome of assignment 1 was uncovering linear relationships, more complex relationships which cannot be predicted using a linear regressor were uncovered. In this case a Deep Learning Neural Network (DNN) can be used.

A Deep Learning Neural Network (DNN) is a form of unsupervised learning, where a number of hidden layers are used to uncover non-linear relationships. Karhunen, Raiko and Cho (2015) infer that deep learning neural networks work in a similar way to the human brain. This is where both the relationship between the input and output data is explored, as well as the relationship between the underlying data.

##Precipitation (prcp)
The process for training a DNN follows a similar process followed above for a Linear Regressor. The data cleansed and one hot encoded as part of assignment 1 is loaded from GitHub.

In [103]:
#Read Data
df = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/prcp_clean_dnn.csv', index_col=0, )
print(df[:6])

   year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd   gust  \
1  2020  24             524  37.3  33.7  1028.5    6.5   3.3    8.0  999.9   
2  2021  12             278  37.0  29.1  1019.0   10.0   6.5   12.0  999.9   
3  2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0   20.0   
4  2021  27             262  34.6  33.8  1012.8    8.0   7.8   12.0  999.9   
5  2021  26             263  31.9  23.4  1016.9    9.0   7.4   12.0  999.9   
6  2022  24             237  34.5  23.8  1010.6    9.7   7.4   12.0  999.9   

   ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  ...    0    0    0    0    0    0    0    1    0    0  
2  ...    0    0    0    0    1    0    0    0    0    0  
3  ...    0    0    0    0    0    0    0    1    0    0  
4  ...    0    0    0    0    0    0    0    0    1    0  
5  ...    0    0    0    0    1    0    0    0    0    0  
6  ...    0    0    0    0    0    0    1    0    0    0  

[6 rows x 39 columns]


In [104]:
#Remove Cols not Required 
df_prcp_dnn = df.drop(columns=['temp', 'dewp', 'slp','visib','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud'])
df_prcp_dnn = df_prcp_dnn.loc[df_prcp_dnn["year"] != 2012]
df_prcp_dnn = df_prcp_dnn.loc[df_prcp_dnn["year"] < 2020]
#Move target to the end
cols = df_prcp_dnn['NUM_COLLISIONS']
df_prcp_dnn = df_prcp_dnn.drop(columns=['NUM_COLLISIONS'])
df_prcp_dnn.insert(loc=26, column='NUM_COLLISIONS', value=cols)
print(df_prcp_dnn[:6])
df_prcp_dnn.describe()

    year  da  prcp  fog  rain_drizzle  snow_ice_pellets  hail  Apr  Aug  Dec  \
49  2016  28  0.09    0             0                 0     0    0    0    0   
51  2014  17  0.00    1             0                 0     0    0    0    0   
54  2016  25  0.02    0             0                 0     0    0    0    0   
55  2016  29  0.00    0             0                 0     0    0    0    0   
58  2017  20  0.00    0             0                 0     0    0    0    0   
59  2013  13  0.01    1             0                 0     0    0    0    0   

    ...  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
49  ...    0    0    0    0    0    0    0    0    1             681  
51  ...    0    0    0    0    0    0    1    0    0             589  
54  ...    0    0    0    0    0    1    0    0    0             658  
55  ...    0    0    0    0    0    0    1    0    0             645  
58  ...    0    0    0    0    0    0    1    0    0             605  
59  ...    0 

Unnamed: 0,year,da,prcp,fog,rain_drizzle,snow_ice_pellets,hail,Apr,Aug,Dec,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,...,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0
mean,2015.989366,15.745569,0.122588,0.253249,0.375345,0.085467,0.000394,0.082316,0.083497,0.085467,...,0.085467,0.079953,0.14297,0.14297,0.143364,0.143757,0.142182,0.142576,0.142182,599.135093
std,1.996126,8.803199,0.329143,0.434958,0.484307,0.27963,0.019846,0.274899,0.276687,0.27963,...,0.27963,0.271273,0.350111,0.350111,0.350512,0.350913,0.349305,0.349709,0.349305,100.299164
min,2013.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,531.0
50%,2016.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,602.0
75%,2018.0,23.0,0.06,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,665.0
max,2019.0,31.0,3.76,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [105]:
# Shuffle Data
shuffle = df_prcp_dnn.iloc[np.random.permutation(len(df_prcp_dnn))]
predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da  prcp  fog  rain_drizzle  snow_ice_pellets  hail  Apr  Aug  \
3308  2019  30  0.00    0             0                 0     0    0    0   
1460  2018  25  0.00    1             0                 0     0    0    0   
638   2015   1  0.00    0             0                 1     0    0    0   
279   2013  31  0.02    1             1                 0     0    0    0   
1007  2014   4  0.00    0             0                 0     0    1    0   
3447  2017   4  0.00    0             0                 0     0    0    0   

      Dec  ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
3308    0  ...    1    0    0    1    0    0    0    0    0    0  
1460    0  ...    0    0    0    0    0    0    0    1    0    0  
638     0  ...    0    0    0    0    0    1    0    0    0    0  
279     0  ...    0    0    0    0    0    0    0    0    0    1  
1007    0  ...    0    0    0    0    0    0    0    1    0    0  
3447    1  ...    0    0    0    0    0    0    1    0    

In [106]:
#Select target as last col
targets = shuffle.iloc[:,-1]
print(targets[:6])

3308    465
1460    799
638     693
279     592
1007    622
3447    684
Name: NUM_COLLISIONS, dtype: int64


In [107]:
#Split to test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize
noutputs = 1
# calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

26


The difference between the training of a DNN and linear regression model, is the addition of hidden layers. In order to optimise the trained model, the number of hidden layers, the number of nodes within each layer and the learning rate were all modified.

In [108]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_prcp', ignore_errors=True)

#Setup Model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_prcp', hidden_units=[20,18,13], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))
print("starting to train");
#Train Model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)

predslistscale = preds['scores']*SCALE_COLLISIONS

#Test model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2186e71cd0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_prcp/model.ckpt.
INFO:tensorflow:loss = 28197.518, step = 1
INFO:tensorflow:global_step/sec: 368.348
INFO:tensorflow:loss = 0.058571868, step = 101 (0.275 sec)
INFO:tensorflow:global_step/sec: 528.723
INFO:tensorflow:loss = 0.018285077, step = 201 (0.192 sec)
INFO:tensorflow:global_step/sec: 532.143
INFO:tensorflow:loss = 0.015707014, step = 301 (0.188 sec)
INFO:tensorflow:global_step/sec: 569.757
INFO:tensorflow:loss = 0.013750879, step = 401 (0.173 sec)
INFO:tensorflow:global_step/sec: 479.117
INFO:tensorflow:loss = 0.012924065, step = 501 (0.210 sec)
INFO:tensorflow:global_step/sec: 568.161
INFO:tensorflow:loss = 0.010729466, step = 601 (0.175 sec)
INFO:tensorflow:global_step/sec: 525.617
INFO:tensorflow:loss = 0.009711044, step = 701 (0.192 s

DNNRegression has RMSE of 213.61763574817323
Just using average = 598.4052191038897 has RMSE of 107.25214718599702


In [109]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))
#Ensure hidden layers match the model trained above
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_prcp', hidden_units=[20,18,13], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2184cc6ad0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}


508
508


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_prcp/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.47063005 0.6991395  0.8176476  0.66398084 0.61244667 0.894935
 0.5318612  0.7843565  0.7408217  0.6858996  0.6934792  0.76377094
 0.464527   0.28804004 0.62882197 0.58927953 0.82757103 0.7063514
 0.82870376 0.21499814 0.3918761  0.5395268  0.6454295  0.23120536
 0.6257101  0.34818208 0.6918303  0.47304833 0.8110062  0.41884673
 0.7397896  0.7759863  0.55743444 0.8293468  0.35304892 0.5767642
 0.7565862  0.61564577 0.6500515  0.23896016 0.5030142  0.45227635
 0.23676957 0.64034975 0.38316047 0.7324356  0.4970585  0.3158275
 0.44244444 0.71325433 0.30481946 0.4245578  0.3584088  0.49355876
 0.77216065 0.45941007 0.6146046  0.47531474 0.43749297 0.40387356
 0.7924274  0.34974182 0.74032223 0.77735436 0.6322795  0.39708936
 0.6952187  0.16465868 0.5950438  0.77372396 0.44510806 0.15381922
 0.46390307 0.44802248 0.48448598 0.3597015  0.5361577  0.47861397
 0.22917189 0.49556386 0.82228625 0.2928096  0.65191424 0.5200051
 0.54712164 0.22848238 0.37621343 0.46496904 0.32403052 0.33914268
 

In [110]:
input = pd.DataFrame.from_dict(data = 
				{
            'year':[2019,2019,2019,2020],
          'da':[10,10,10,20],
         'prcp' : [0,2.34,5.5,2.24],
         'fog' : [0,0,1,1],
         'rain_drizzle' : [0,1,1,1],
         'snow_ice_pellets' : [0,0,0,0],
         'hail' : [0,0,0,0],
         'Apr' : [0,0,0,0],
         'Aug' : [1,1,1,1],
         'Dec' : [0,0,0,0],
         'Feb' : [0,0,0,0],
         'Jan' : [0,0,0,0],
         'Jul' : [0,0,0,0],
         'Jun' : [0,0,0,0],
         'Mar' : [0,0,0,0],
         'May' : [0,0,0,0],
         'Nov' : [0,0,0,0],
         'Oct' : [0,0,0,0],
         'Sep' : [0,0,0,0],
         'Fri' : [0,0,0,0],
         'Mon' : [1,1,1,1],
         'Sat' : [0,0,0,0],
         'Sun' : [0,0,0,0],
         'Thu' : [0,0,0,0],
         'Tue' : [0,0,0,0],
         'Wed' : [0,0,0,0]
      
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_prcp', hidden_units=[20,18,13], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180fb9950>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_prcp/mode

[932.5393 882.323  850.8384 659.843 ]


The RMSE value is very similar to that of the linear regression model, indicating that both models are accurate predictors. The error rate of the DNN is higher, indicating the there is a higher number of errors, but the margin of error is lower.

##Dew Point (dewp)
As with the linear regressor the process of training each model follows a very similar process, with the number of hidden layers, the number of nodes within each layer and the learning rate changing dependant on the dataset.

In [111]:
#Read Data
df_dewp_dnn = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/dewp_clean_dnn.csv', index_col=0, )
print(df_dewp_dnn[:6])

   year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd   gust  \
1  2020  24             524  37.3  33.7  1028.5    6.5   3.3    8.0  999.9   
2  2021  12             278  37.0  29.1  1019.0   10.0   6.5   12.0  999.9   
3  2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0   20.0   
4  2021  27             262  34.6  33.8  1012.8    8.0   7.8   12.0  999.9   
5  2021  26             263  31.9  23.4  1016.9    9.0   7.4   12.0  999.9   
6  2022  24             237  34.5  23.8  1010.6    9.7   7.4   12.0  999.9   

   ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  ...    0    0    0    0    0    0    0    1    0    0  
2  ...    0    0    0    0    1    0    0    0    0    0  
3  ...    0    0    0    0    0    0    0    1    0    0  
4  ...    0    0    0    0    0    0    0    0    1    0  
5  ...    0    0    0    0    1    0    0    0    0    0  
6  ...    0    0    0    0    0    0    1    0    0    0  

[6 rows x 39 columns]


In [112]:
#Remove Cols not Required 
df_dewp_dnn = df_dewp_dnn.drop(columns=['temp', 'prcp', 'slp','visib','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_dewp_dnn = df_dewp_dnn.loc[df_dewp_dnn["year"] != 2012]
df_dewp_dnn = df_dewp_dnn.loc[df_dewp_dnn["year"] < 2020]
#Move target to end
cols = df_dewp_dnn['NUM_COLLISIONS']
df_dewp_dnn = df_dewp_dnn.drop(columns=['NUM_COLLISIONS'])
df_dewp_dnn.insert(loc=22, column='NUM_COLLISIONS', value=cols)
print(df_dewp_dnn[:6])
df_dewp_dnn.describe()

    year  da  dewp  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Oct  Sep  Fri  \
49  2016  28  24.4    0    0    0    0    1    0    0  ...    0    0    0   
51  2014  17  35.8    0    0    0    0    1    0    0  ...    0    0    0   
54  2016  25  21.2    0    0    0    0    1    0    0  ...    0    0    0   
55  2016  29  36.8    0    0    0    0    1    0    0  ...    0    0    0   
58  2017  20  32.5    0    0    0    0    1    0    0  ...    0    0    0   
59  2013  13  44.9    0    0    0    0    1    0    0  ...    0    0    0   

    Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
49    0    0    0    0    0    1             681  
51    0    0    0    1    0    0             589  
54    0    0    1    0    0    0             658  
55    0    0    0    1    0    0             645  
58    0    0    0    1    0    0             605  
59    0    1    0    0    0    0             373  

[6 rows x 23 columns]


Unnamed: 0,year,da,dewp,Apr,Aug,Dec,Feb,Jan,Jul,Jun,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,...,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0
mean,2015.999217,15.723679,44.16317,0.082192,0.084932,0.084932,0.077104,0.084932,0.08454,0.082192,...,0.084932,0.082192,0.142466,0.143249,0.142857,0.142857,0.142857,0.142857,0.142857,599.10998
std,2.0,8.801271,16.995303,0.27471,0.278834,0.278834,0.266808,0.278834,0.278251,0.27471,...,0.278834,0.27471,0.349596,0.350395,0.349996,0.349996,0.349996,0.349996,0.349996,100.277185
min,2013.0,1.0,-6.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,32.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,531.0
50%,2016.0,16.0,45.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,602.0
75%,2018.0,23.0,58.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,665.0
max,2019.0,31.0,74.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [113]:
#Shuffle Data
shuffle = df_dewp_dnn.iloc[np.random.permutation(len(df_dewp_dnn))]

predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da  dewp  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Nov  Oct  Sep  \
2556  2018  21  54.6    0    0    0    0    0    0    0  ...    0    0    1   
2514  2013  25  45.3    0    0    0    0    0    0    0  ...    0    0    1   
502   2019   7  37.9    0    0    0    1    0    0    0  ...    0    0    0   
3349  2014   1  46.9    0    0    0    0    0    0    0  ...    1    0    0   
2916  2019   7  61.7    0    0    0    0    0    0    0  ...    0    1    0   
189   2017  12  45.5    0    0    0    0    1    0    0  ...    0    0    0   

      Fri  Mon  Sat  Sun  Thu  Tue  Wed  
2556    0    0    0    0    1    0    0  
2514    0    0    0    0    0    1    0  
502     0    0    0    0    0    0    1  
3349    1    0    0    0    0    0    0  
2916    0    0    0    1    0    0    0  
189     0    0    0    0    0    0    1  

[6 rows x 22 columns]


In [114]:
#Select last col as target
targets = shuffle.iloc[:,-1]
print(targets[:6])

2556    715
2514    539
502     588
3349    656
2916    575
189     646
Name: NUM_COLLISIONS, dtype: int64


In [115]:
#Split to test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
#Calculate number of outputs
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

22


In [116]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_dewp', ignore_errors=True)

#Setup Model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_dewp', hidden_units=[21,17,9], optimizer=tf.train.AdamOptimizer(learning_rate=0.0001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

#Train model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS
#Test Model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21846dd690>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_dewp', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_dewp/model.ckpt.
INFO:tensorflow:loss = 6797.327, step = 1
INFO:tensorflow:global_step/sec: 389.68
INFO:tensorflow:loss = 9.369931, step = 101 (0.264 sec)
INFO:tensorflow:global_step/sec: 442.251
INFO:tensorflow:loss = 5.141411, step = 201 (0.224 sec)
INFO:tensorflow:global_step/sec: 528.123
INFO:tensorflow:loss = 4.6838293, step = 301 (0.188 sec)
INFO:tensorflow:global_step/sec: 480.193
INFO:tensorflow:loss = 3.7785985, step = 401 (0.208 sec)
INFO:tensorflow:global_step/sec: 481.663
INFO:tensorflow:loss = 2.8587873, step = 501 (0.210 sec)
INFO:tensorflow:global_step/sec: 475.056
INFO:tensorflow:loss = 2.564952, step = 601 (0.217 sec)
INFO:tensorflow:global_step/sec: 480.993
INFO:tensorflow:loss = 2.0920012, step = 701 (0.201 sec)
INFO:tensorflow

DNNRegression has RMSE of 82.12004390938685
Just using average = 598.3923679060665 has RMSE of 99.44265935333182


In [117]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

# Ensure number of hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_dewp', hidden_units=[21,17,9], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21809a56d0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_dewp', '_session_creation_timeout_secs': 7200}


511
511


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_dewp/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.5936745  0.5735634  0.5188759  0.5288704  0.5355385  0.5756081
 0.56106645 0.5687416  0.4068001  0.43673784 0.50471574 0.55792314
 0.5229042  0.4728554  0.41969377 0.55380327 0.5918434  0.40025407
 0.51411515 0.4107979  0.5256966  0.4297188  0.51452714 0.588105
 0.52067643 0.42454606 0.42288285 0.4568947  0.50116044 0.4597786
 0.39795    0.4763649  0.41940385 0.43850785 0.51870805 0.5434273
 0.56633073 0.53166276 0.5426491  0.5306862  0.4337929  0.47738725
 0.5656288  0.46231157 0.5537575  0.5093697  0.51821977 0.47634965
 0.48704606 0.50370866 0.49202043 0.5383614  0.5662697  0.47586137
 0.4889534  0.53044206 0.5877693  0.59246904 0.6017464  0.44694597
 0.5096901  0.49504167 0.5294655  0.53344804 0.53041154 0.50907975
 0.5255287  0.5096596  0.4078377  0.53514177 0.5080574  0.5009926
 0.5144966  0.5104378  0.5056465  0.6004036  0.54144365 0.52833635
 0.47476274 0.5176094  0.5780495  0.547242   0.4281624  0.5160988
 0.5456551  0.52584916 0.47622758 0.552079   0.5000923  0.3929146
 0.

As shown by the RMSE value, this model is a more efficient way to predict the number of collisions in comparison to using the mean. In comparison to the linear regression model trained, the RMSE is lower indicating the DNN makes more accurate predictions. As with the DNN for precipitation, the RMSE is lower than the linear model the error rate is higher indicating there is more errors but the margin of error is lower.

##Sea Level Pressure(slp)
Through the analysis carried out in assignment 1, no clear relationship between sea level pressure and the number of collisions was uncovered. A DNN will be used to attempt to predict the number of collisions at a given pressure point.

In [118]:
#Read Data
df_slp_dnn = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/slp_clean_dnn.csv', index_col=0, )
print(df[:6])

   year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd   gust  \
1  2020  24             524  37.3  33.7  1028.5    6.5   3.3    8.0  999.9   
2  2021  12             278  37.0  29.1  1019.0   10.0   6.5   12.0  999.9   
3  2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0   20.0   
4  2021  27             262  34.6  33.8  1012.8    8.0   7.8   12.0  999.9   
5  2021  26             263  31.9  23.4  1016.9    9.0   7.4   12.0  999.9   
6  2022  24             237  34.5  23.8  1010.6    9.7   7.4   12.0  999.9   

   ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  ...    0    0    0    0    0    0    0    1    0    0  
2  ...    0    0    0    0    1    0    0    0    0    0  
3  ...    0    0    0    0    0    0    0    1    0    0  
4  ...    0    0    0    0    0    0    0    0    1    0  
5  ...    0    0    0    0    1    0    0    0    0    0  
6  ...    0    0    0    0    0    0    1    0    0    0  

[6 rows x 39 columns]


In [119]:
#Remove Cols not Required 
df_slp_dnn = df_slp_dnn.drop(columns=['temp', 'prcp', 'dewp','visib','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_slp_dnn = df_slp_dnn.loc[df_slp_dnn["year"] != 2012]
df_slp_dnn = df_slp_dnn.loc[df_slp_dnn["year"] < 2020]
#Move target to the end
cols = df_slp_dnn['NUM_COLLISIONS']
df_slp_dnn = df_slp_dnn.drop(columns=['NUM_COLLISIONS'])
df_slp_dnn.insert(loc=22, column='NUM_COLLISIONS', value=cols)
print(df_slp_dnn[:6])
df_slp_dnn.describe()

    year  da     slp  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Oct  Sep  Fri  \
49  2016  28  1016.1    0    0    0    0    1    0    0  ...    0    0    0   
51  2014  17  1014.8    0    0    0    0    1    0    0  ...    0    0    0   
54  2016  25  1021.4    0    0    0    0    1    0    0  ...    0    0    0   
55  2016  29   999.4    0    0    0    0    1    0    0  ...    0    0    0   
58  2017  20  1015.5    0    0    0    0    1    0    0  ...    0    0    0   
59  2013  13  1020.7    0    0    0    0    1    0    0  ...    0    0    0   

    Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
49    0    0    0    0    0    1             681  
51    0    0    0    1    0    0             589  
54    0    0    1    0    0    0             658  
55    0    0    0    1    0    0             645  
58    0    0    0    1    0    0             605  
59    0    1    0    0    0    0             373  

[6 rows x 23 columns]


Unnamed: 0,year,da,slp,Apr,Aug,Dec,Feb,Jan,Jul,Jun,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,...,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0
mean,2016.000391,15.719765,1016.777221,0.082192,0.084932,0.08454,0.077104,0.084932,0.084932,0.082192,...,0.084932,0.082192,0.142857,0.143249,0.142857,0.142857,0.142857,0.142857,0.142466,599.147162
std,2.000294,8.796698,7.628429,0.27471,0.278834,0.278251,0.266808,0.278834,0.278834,0.27471,...,0.278834,0.27471,0.349996,0.350395,0.349996,0.349996,0.349996,0.349996,0.349596,100.268048
min,2013.0,1.0,989.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,1012.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,531.0
50%,2016.0,16.0,1016.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,602.0
75%,2018.0,23.0,1021.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,665.0
max,2019.0,31.0,1044.2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [120]:
#Shuffle dataset
shuffle = df_slp_dnn.iloc[np.random.permutation(len(df_slp_dnn))]

predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da     slp  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Nov  Oct  Sep  \
3150  2014   5  1018.3    0    0    0    0    0    0    0  ...    1    0    0   
3100  2015  18  1034.9    0    0    0    0    0    0    0  ...    1    0    0   
1807  2017   6  1006.3    0    0    0    0    0    0    1  ...    0    0    0   
3133  2013  17  1024.9    0    0    0    0    0    0    0  ...    1    0    0   
3666  2014   8  1035.8    0    0    1    0    0    0    0  ...    0    0    0   
857   2019   2  1016.8    0    0    0    0    0    0    0  ...    0    0    0   

      Fri  Mon  Sat  Sun  Thu  Tue  Wed  
3150    0    0    0    0    0    1    0  
3100    0    0    0    0    0    1    0  
1807    0    1    0    0    0    0    0  
3133    0    0    1    0    0    0    0  
3666    0    0    0    1    0    0    0  
857     1    0    0    0    0    0    0  

[6 rows x 22 columns]


In [121]:
#Select Target
targets = shuffle.iloc[:,-1]

# print out the first 6 rows of the targets data.
print(targets[:6])

3150    554
3100    628
1807    660
3133    532
3666    579
857     642
Name: NUM_COLLISIONS, dtype: int64


In [122]:
#Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
#Calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

22


In [123]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_slp', ignore_errors=True)
#Setup model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_slp', hidden_units=[19,15,11], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

# Train model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS
#Test Model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));

avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21846fe950>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_slp', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_slp/model.ckpt.
INFO:tensorflow:loss = 4669.8906, step = 1
INFO:tensorflow:global_step/sec: 382.448
INFO:tensorflow:loss = 0.052570075, step = 101 (0.266 sec)
INFO:tensorflow:global_step/sec: 503.265
INFO:tensorflow:loss = 0.041005157, step = 201 (0.200 sec)
INFO:tensorflow:global_step/sec: 546.107
INFO:tensorflow:loss = 0.030781014, step = 301 (0.181 sec)
INFO:tensorflow:global_step/sec: 476.473
INFO:tensorflow:loss = 0.035366643, step = 401 (0.213 sec)
INFO:tensorflow:global_step/sec: 531.736
INFO:tensorflow:loss = 0.025151623, step = 501 (0.188 sec)
INFO:tensorflow:global_step/sec: 538.3
INFO:tensorflow:loss = 0.032797255, step = 601 (0.185 sec)
INFO:tensorflow:global_step/sec: 466.196
INFO:tensorflow:loss = 0.017441947, step = 701 (0.212 sec)

DNNRegression has RMSE of 98.85118429161771
Just using average = 596.4124266144814 has RMSE of 103.92185858384369


In [124]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

#Ensure hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_slp', hidden_units=[19,15,11], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21845ebd50>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_slp', '_session_creation_timeout_secs': 7200}


511
511


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_slp/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.5917966  0.5788905  0.5427968  0.6050431  0.60605115 0.53256863
 0.55283135 0.6348626  0.58504266 0.56610745 0.5229327  0.57837456
 0.57006425 0.57838696 0.42912263 0.62351006 0.5539052  0.598273
 0.544069   0.5425717  0.55858487 0.48077077 0.6013991  0.593394
 0.60126084 0.62328976 0.52626103 0.56571263 0.56194466 0.5511157
 0.5327956  0.5393569  0.5615527  0.59272546 0.541166   0.5956523
 0.57379025 0.62063664 0.44211453 0.48356122 0.58795995 0.6356494
 0.56791276 0.55647916 0.5654122  0.6186139  0.5639493  0.53519696
 0.44032258 0.43082494 0.5323474  0.57730263 0.48051232 0.528588
 0.61530846 0.5364129  0.5203511  0.58852357 0.5159642  0.621514
 0.59628934 0.5974147  0.626517   0.44697922 0.55336446 0.60993356
 0.53923196 0.5500466  0.60500497 0.60565156 0.5661542  0.6001603
 0.5909774  0.5644137  0.5798623  0.6243207  0.62324303 0.50806206
 0.6059577  0.56019753 0.5835902  0.6280181  0.6066901  0.58339375
 0.5862071  0.4644019  0.6453149  0.6018407  0.5802838  0.577357
 0.550438

Although no linear relationship was uncovered, it can be argued there is a relationship present due to the RMSE value which is lower than the mean value. This relationship is also shown in the error rate which is comparative to the other models produced.

##Gust
As with sea level pressure, no linear relationship between the maximum gust and the number of collisions was uncovered.

In [125]:
#Read data
df_gust_dnn = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/gust_clean_dnn.csv', index_col=0, )
print(df_gust_dnn[:6])

    year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd  gust  \
3   2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0  20.0   
11  2020  15             508  43.9  38.3  1019.4    8.2   5.4   14.0  15.0   
12  2021   1             257  39.6  29.3  1029.3   10.0   7.6   14.0  20.0   
14  2022  25             235  41.6  31.8  1013.2   10.0   9.6   15.0  19.0   
18  2021   3             186  41.1  32.3  1018.0   10.0  10.3   19.0  27.0   
19  2020   2             413  39.6  28.9  1011.8   10.0  13.0   19.0  26.0   

    ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
3   ...    0    0    0    0    0    0    0    1    0    0  
11  ...    0    0    0    0    0    0    0    0    1    0  
12  ...    0    0    0    0    0    0    0    1    0    0  
14  ...    0    0    0    0    1    0    0    0    0    0  
18  ...    0    0    0    0    0    1    0    0    0    0  
19  ...    0    0    0    0    0    0    0    0    0    1  

[6 rows x 39 columns]


In [126]:
#Remove Cols not Required 
df_gust_dnn = df_gust_dnn.drop(columns=['temp', 'prcp', 'dewp','visib','max','min','sndp','wdsp','mxpsd','slp','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_gust_dnn = df_gust_dnn.loc[df_gust_dnn["year"] != 2012]
df_gust_dnn = df_gust_dnn.loc[df_gust_dnn["year"] < 2020]
#Move target col to end
cols = df_gust_dnn['NUM_COLLISIONS']
df_gust_dnn = df_gust_dnn.drop(columns=['NUM_COLLISIONS'])
df_gust_dnn.insert(loc=22, column='NUM_COLLISIONS', value=cols)
print(df_gust_dnn[:6])
df_gust_dnn.describe()

    year  da  gust  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Oct  Sep  Fri  \
74  2016  17  18.1    0    0    0    0    1    0    0  ...    0    0    0   
76  2014   9  20.0    0    0    0    0    1    0    0  ...    0    0    0   
79  2019  19  21.0    0    0    0    0    1    0    0  ...    0    0    1   
80  2015  11  17.1    0    0    0    0    1    0    0  ...    0    0    0   
83  2015  29  20.0    0    0    0    0    1    0    0  ...    0    0    0   
85  2019  13  15.9    0    0    0    0    1    0    0  ...    0    0    0   

    Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
74    0    1    0    0    0    0             451  
76    0    0    0    0    0    1             561  
79    0    0    0    0    0    0             479  
80    0    1    0    0    0    0             341  
83    0    0    0    0    0    1             519  
85    0    1    0    0    0    0             374  

[6 rows x 23 columns]


Unnamed: 0,year,da,gust,Apr,Aug,Dec,Feb,Jan,Jul,Jun,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,...,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0
mean,2015.91283,15.702885,27.511602,0.095764,0.042357,0.104359,0.09515,0.108656,0.046041,0.061387,...,0.087784,0.071209,0.143646,0.139963,0.141191,0.139963,0.151627,0.138122,0.145488,596.513198
std,2.01341,8.667634,7.36677,0.294358,0.201465,0.305819,0.293513,0.311302,0.209637,0.240113,...,0.283067,0.257253,0.350839,0.347055,0.348325,0.347055,0.358769,0.345133,0.3527,104.47966
min,2013.0,1.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,526.0
50%,2016.0,16.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,597.0
75%,2018.0,23.0,31.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,663.0
max,2019.0,31.0,71.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [127]:
# Shuffle Data
shuffle = df_gust_dnn.iloc[np.random.permutation(len(df_gust_dnn))]

predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da  gust  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Nov  Oct  Sep  \
1346  2018  19  21.0    0    0    0    0    0    0    0  ...    0    0    0   
1793  2017  20  27.0    0    0    0    0    0    0    1  ...    0    0    0   
2313  2014  22  19.0    0    1    0    0    0    0    0  ...    0    0    0   
2261  2017   7  26.0    0    1    0    0    0    0    0  ...    0    0    0   
2658  2018  11  24.1    0    0    0    0    0    0    0  ...    0    0    1   
2396  2018  16  20.0    0    1    0    0    0    0    0  ...    0    0    0   

      Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1346    1    0    0    0    0    0    0  
1793    0    1    0    0    0    0    0  
2313    0    0    0    0    1    0    0  
2261    0    0    0    1    0    0    0  
2658    0    1    0    0    0    0    0  
2396    0    0    0    0    0    0    1  

[6 rows x 22 columns]


In [128]:
# Select last col as a target
targets = shuffle.iloc[:,-1]
print(targets[:6])

1346    583
1793    813
2313    568
2261    606
2658    544
2396    647
Name: NUM_COLLISIONS, dtype: int64


In [129]:
# Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
# Calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

22


In [130]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_gust', ignore_errors=True)

#Setup model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_gust', hidden_units=[23,19,11], optimizer=tf.train.AdamOptimizer(learning_rate=0.001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

# Train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS

#Test the model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180f5fed0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_gust', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_gust/model.ckpt.
INFO:tensorflow:loss = 17590.316, step = 1
INFO:tensorflow:global_step/sec: 382.345
INFO:tensorflow:loss = 0.15763947, step = 101 (0.266 sec)
INFO:tensorflow:global_step/sec: 556.143
INFO:tensorflow:loss = 0.02768094, step = 201 (0.179 sec)
INFO:tensorflow:global_step/sec: 494.775
INFO:tensorflow:loss = 0.02600477, step = 301 (0.203 sec)
INFO:tensorflow:global_step/sec: 505.587
INFO:tensorflow:loss = 0.025448497, step = 401 (0.197 sec)
INFO:tensorflow:global_step/sec: 509.087
INFO:tensorflow:loss = 0.016932437, step = 501 (0.199 sec)
INFO:tensorflow:global_step/sec: 530.131
INFO:tensorflow:loss = 0.022061246, step = 601 (0.189 sec)
INFO:tensorflow:global_step/sec: 495.747
INFO:tensorflow:loss = 0.020206688, step = 701 (0.201 sec)

DNNRegression has RMSE of 83.0932609202468
Just using average = 598.0214888718342 has RMSE of 102.22298263083186


In [131]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

#Ensure the hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_gust', hidden_units=[23,19,11], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180c94650>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_gust', '_session_creation_timeout_secs': 7200}


326
326


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_gust/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.5188271  0.36390457 0.49615058 0.56507266 0.51946986 0.47704086
 0.48370704 0.49621543 0.47898063 0.4349266  0.56818354 0.5488583
 0.48026428 0.42894325 0.4329563  0.42191657 0.35769233 0.5655476
 0.5220982  0.46409377 0.37677345 0.50938    0.41703567 0.4025627
 0.5555798  0.39144668 0.40776405 0.3493229  0.41965637 0.52045786
 0.4768215  0.51956904 0.48221168 0.37879142 0.54401743 0.548946
 0.5190159  0.4634796  0.55416644 0.45651206 0.46388015 0.42749175
 0.48287925 0.42678413 0.47890243 0.49764976 0.40844688 0.44855842
 0.45801505 0.54503787 0.4487072  0.5102097  0.4700218  0.4360977
 0.54078066 0.47660407 0.4482666  0.50360453 0.51975214 0.5024048
 0.4851223  0.50352824 0.5201527  0.4850479  0.5313755  0.486652
 0.4737049  0.5966431  0.37022552 0.47613868 0.45756492 0.5713154
 0.48762283 0.5024067  0.41706237 0.51967394 0.5035378  0.5616566
 0.4640747  0.35955963 0.594936   0.50145495 0.48008308 0.423328
 0.37171516 0.52702487 0.5273453  0.36366233 0.56755793 0.52733004
 0.48752

This model shows there is no relationship between gust and the number of collisions.

The RMSE value is over 4 times the mean value indicating the model is unable to make accurate predictions. This is also reflected in the very high error rate.

##Maximum Sustained Wind Speed (mxpsd)
In an attempt to predict the number of collisions given the maximum sustained wind speed a DNN will be used as no linear relationship was uncovered within assignment 1.

In [132]:
#Read the data
df_mxpsd_dnn = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/mxpsd_clean_dnn.csv', index_col=0, )
print(df_mxpsd_dnn[:6])

   year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd   gust  \
1  2020  24             524  37.3  33.7  1028.5    6.5   3.3    8.0  999.9   
2  2021  12             278  37.0  29.1  1019.0   10.0   6.5   12.0  999.9   
3  2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0   20.0   
4  2021  27             262  34.6  33.8  1012.8    8.0   7.8   12.0  999.9   
5  2021  26             263  31.9  23.4  1016.9    9.0   7.4   12.0  999.9   
6  2022  24             237  34.5  23.8  1010.6    9.7   7.4   12.0  999.9   

   ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  ...    0    0    0    0    0    0    0    1    0    0  
2  ...    0    0    0    0    1    0    0    0    0    0  
3  ...    0    0    0    0    0    0    0    1    0    0  
4  ...    0    0    0    0    0    0    0    0    1    0  
5  ...    0    0    0    0    1    0    0    0    0    0  
6  ...    0    0    0    0    0    0    1    0    0    0  

[6 rows x 39 columns]


In [133]:
#Remove the cols not required
df_mxpsd_dnn = df_mxpsd_dnn.drop(columns=['temp', 'prcp', 'dewp','visib','max','min','sndp','wdsp','gust','slp','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_mxpsd_dnn = df_mxpsd_dnn.loc[df_mxpsd_dnn["year"] != 2012]
df_mxpsd_dnn = df_mxpsd_dnn.loc[df_mxpsd_dnn["year"] < 2020]
#Move the target to the end
cols = df_mxpsd_dnn['NUM_COLLISIONS']
df_mxpsd_dnn = df_mxpsd_dnn.drop(columns=['NUM_COLLISIONS'])
df_mxpsd_dnn.insert(loc=22, column='NUM_COLLISIONS', value=cols)
print(df_mxpsd_dnn[:6])
df_mxpsd_dnn.describe()

    year  da  mxpsd  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Oct  Sep  Fri  \
49  2016  28    8.9    0    0    0    0    1    0    0  ...    0    0    0   
51  2014  17    8.9    0    0    0    0    1    0    0  ...    0    0    0   
54  2016  25    8.9    0    0    0    0    1    0    0  ...    0    0    0   
55  2016  29    9.9    0    0    0    0    1    0    0  ...    0    0    0   
58  2017  20    9.9    0    0    0    0    1    0    0  ...    0    0    0   
59  2013  13    9.9    0    0    0    0    1    0    0  ...    0    0    0   

    Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
49    0    0    0    0    0    1             681  
51    0    0    0    1    0    0             589  
54    0    0    1    0    0    0             658  
55    0    0    0    1    0    0             645  
58    0    0    0    1    0    0             605  
59    0    1    0    0    0    0             373  

[6 rows x 23 columns]


Unnamed: 0,year,da,mxpsd,Apr,Aug,Dec,Feb,Jan,Jul,Jun,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,...,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0
mean,2016.001567,15.737172,17.24011,0.082256,0.084998,0.084998,0.077164,0.084998,0.084998,0.082256,...,0.084998,0.081473,0.142969,0.143361,0.142969,0.142969,0.142969,0.142577,0.142186,599.033686
std,2.000587,8.797367,5.858333,0.274808,0.278933,0.278933,0.266904,0.278933,0.278933,0.274808,...,0.278933,0.273613,0.35011,0.350509,0.35011,0.35011,0.35011,0.34971,0.349309,100.284761
min,2013.0,1.0,5.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,531.0
50%,2016.0,16.0,15.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,602.0
75%,2018.0,23.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,665.0
max,2019.0,31.0,49.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [134]:
# shuffle the data
shuffle = df_mxpsd_dnn.iloc[np.random.permutation(len(df_mxpsd_dnn))]

predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da  mxpsd  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Nov  Oct  Sep  \
762   2016   3   19.0    0    0    0    0    0    0    0  ...    0    0    0   
1093  2017  21   19.0    1    0    0    0    0    0    0  ...    0    0    0   
3139  2015   3   14.0    0    0    0    0    0    0    0  ...    1    0    0   
2040  2014   2   14.0    0    0    0    0    0    1    0  ...    0    0    0   
1266  2017  31   11.1    0    0    0    0    0    0    0  ...    0    0    0   
3456  2018   8   14.0    0    0    1    0    0    0    0  ...    0    0    0   

      Fri  Mon  Sat  Sun  Thu  Tue  Wed  
762     0    0    0    0    0    0    1  
1093    0    0    0    0    1    0    0  
3139    0    1    0    0    0    0    0  
2040    0    0    0    0    0    1    0  
1266    0    0    0    0    0    1    0  
3456    1    0    0    0    0    0    0  

[6 rows x 22 columns]


In [135]:
# Select the target as the last col
targets = shuffle.iloc[:,-1]
print(targets[:6])

762     625
1093    690
3139    655
2040    658
1266    669
3456    652
Name: NUM_COLLISIONS, dtype: int64


In [136]:
#Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
# Calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

22


In [137]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_mxpsd', ignore_errors=True)

#Setup the model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_mxpsd', hidden_units=[17,13,7], optimizer=tf.train.AdamOptimizer(learning_rate=0.001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

# Train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS

# Test the model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2184aec1d0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_mxpsd', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_mxpsd/model.ckpt.
INFO:tensorflow:loss = 3820.8137, step = 1
INFO:tensorflow:global_step/sec: 399.426
INFO:tensorflow:loss = 0.10032459, step = 101 (0.256 sec)
INFO:tensorflow:global_step/sec: 451.258
INFO:tensorflow:loss = 0.09438558, step = 201 (0.220 sec)
INFO:tensorflow:global_step/sec: 481.886
INFO:tensorflow:loss = 0.06858734, step = 301 (0.210 sec)
INFO:tensorflow:global_step/sec: 475.562
INFO:tensorflow:loss = 0.088231534, step = 401 (0.207 sec)
INFO:tensorflow:global_step/sec: 434.642
INFO:tensorflow:loss = 0.06222921, step = 501 (0.229 sec)
INFO:tensorflow:global_step/sec: 419.405
INFO:tensorflow:loss = 0.06289874, step = 601 (0.242 sec)
INFO:tensorflow:global_step/sec: 466.426
INFO:tensorflow:loss = 0.05743025, step = 701 (0.213 sec)
I

DNNRegression has RMSE of 126.8893000616561
Just using average = 600.2110675808032 has RMSE of 101.47555317119034


In [138]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

# Ensure the hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_mxpsd', hidden_units=[17,13,7], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180cbe250>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_mxpsd', '_session_creation_timeout_secs': 7200}


511
511


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_mxpsd/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.39268243 0.4446085  0.39462262 0.4229653  0.40537995 0.4528101
 0.46087682 0.4553193  0.37730294 0.5131892  0.38035578 0.37657112
 0.54442966 0.46268564 0.38788593 0.46993047 0.42935663 0.52092224
 0.3892156  0.41852123 0.39377582 0.4082241  0.46014094 0.4402147
 0.49742794 0.46659845 0.3072858  0.42291182 0.40117145 0.46393573
 0.47856373 0.4420275  0.3226732  0.36103457 0.2902478  0.4914152
 0.45491165 0.35645604 0.44823664 0.38749093 0.3582512  0.39780056
 0.40780783 0.45763743 0.41064835 0.45277768 0.39463216 0.44341397
 0.4106112  0.431199   0.43909818 0.4483075  0.31129444 0.40828365
 0.43611014 0.30921847 0.43347925 0.45490915 0.34705853 0.34650886
 0.51033145 0.44218838 0.34909105 0.3695376  0.4356609  0.48392373
 0.51446706 0.5027646  0.2930268  0.4717847  0.46217448 0.46470493
 0.31352776 0.48107713 0.5097548  0.42300713 0.4510312  0.32764596
 0.30814892 0.4236434  0.36185902 0.41950643 0.4597814  0.40655673
 0.48074442 0.28789425 0.4793766  0.46170086 0.38569885 0.4393376

The low RMSE value suggests that a relationship between the maximum sustained wind speed and the number of collisions exist. The model produced can be used to predict the number of collisions with a degree of accuracy. As with the other DNN models produced the error rate is higher.

##Whole Dataset
As the purpose of this assignment is to accurately predict the number of collisions given the weather condition**s**, all available weather conditions are used as input variables to train a DNN model.

In [139]:
#Read the data
df = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/datadnn.csv', index_col=0, )
print(df[:6])

   year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd   gust  \
1  2020  24             524  37.3  33.7  1028.5    6.5   3.3    8.0  999.9   
2  2021  12             278  37.0  29.1  1019.0   10.0   6.5   12.0  999.9   
3  2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0   20.0   
4  2021  27             262  34.6  33.8  1012.8    8.0   7.8   12.0  999.9   
5  2021  26             263  31.9  23.4  1016.9    9.0   7.4   12.0  999.9   
6  2022  24             237  34.5  23.8  1010.6    9.7   7.4   12.0  999.9   

   ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  ...    0    0    0    0    0    0    0    1    0    0  
2  ...    0    0    0    0    1    0    0    0    0    0  
3  ...    0    0    0    0    0    0    0    1    0    0  
4  ...    0    0    0    0    0    0    0    0    1    0  
5  ...    0    0    0    0    1    0    0    0    0    0  
6  ...    0    0    0    0    0    0    1    0    0    0  

[6 rows x 39 columns]


As the whole dataset contains the error values for Dew Point, Sea Level Pressure, Maximum Sustained Wind Speed and Gust; these must be removed.

In [140]:
#Clean the data
dnn = df.loc[df["year"] != 2012]
dnn = dnn.loc[dnn["year"] < 2020]
dnn = dnn.loc[dnn["dewp"] != 9999]
dnn = dnn.loc[dnn["slp"] != 9999]
dnn = dnn.loc[dnn["mxpsd"] != 999.9]
dnn = dnn.loc[dnn["gust"] != 999.9]
#Move the target to the end
cols = dnn['NUM_COLLISIONS']
dnn = dnn.drop(columns=['NUM_COLLISIONS'])
dnn.insert(loc=38, column='NUM_COLLISIONS', value=cols)
print(dnn[:6])
dnn.describe()

    year  da  temp  dewp     slp  visib  wdsp  mxpsd  gust   max  ...  Oct  \
74  2016  17  40.2  32.3  1007.3    9.2   7.7   12.0  18.1  51.1  ...    0   
76  2014   9  23.5   8.3  1034.2   10.0   7.9   12.0  20.0  28.9  ...    0   
79  2019  19  34.5  29.7  1022.0    9.8   6.9   13.0  21.0  39.9  ...    0   
80  2015  11  27.1  12.1  1035.5   10.0   8.8   13.0  17.1  37.0  ...    0   
83  2015  29  29.2  20.9  1022.9   10.0   8.5   13.0  20.0  36.0  ...    0   
85  2019  13  26.0  12.8  1030.5   10.0   8.0   13.0  15.9  30.9  ...    0   

    Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
74    0    0    0    1    0    0    0    0             451  
76    0    0    0    0    0    0    0    1             561  
79    0    1    0    0    0    0    0    0             479  
80    0    0    0    1    0    0    0    0             341  
83    0    0    0    0    0    0    0    1             519  
85    0    0    0    1    0    0    0    0             374  

[6 rows x 39 columns]


Unnamed: 0,year,da,temp,dewp,slp,visib,wdsp,mxpsd,gust,max,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,...,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0
mean,2015.91283,15.702885,47.909638,45.903254,1015.632904,8.225599,12.602087,20.060896,27.511602,55.73407,...,0.087784,0.071209,0.143646,0.139963,0.141191,0.139963,0.151627,0.138122,0.145488,596.513198
std,2.01341,8.667634,13.746339,247.35284,8.134237,2.227285,3.986056,5.294117,7.36677,13.52726,...,0.283067,0.257253,0.350839,0.347055,0.348325,0.347055,0.358769,0.345133,0.3527,104.47966
min,2013.0,1.0,5.8,-6.7,989.5,0.6,4.5,8.9,14.0,18.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,38.1,28.2,1010.6,7.0,10.0,15.9,22.0,46.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,526.0
50%,2016.0,16.0,47.0,40.2,1015.4,9.3,12.0,19.0,26.0,55.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,597.0
75%,2018.0,23.0,58.8,52.8,1021.1,10.0,14.4,22.9,31.1,66.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,663.0
max,2019.0,31.0,77.5,9999.9,1039.1,10.0,39.3,49.0,71.1,87.1,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [141]:
# Shuffle the data
shuffle = dnn.iloc[np.random.permutation(len(dnn))]
predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da  temp  dewp     slp  visib  wdsp  mxpsd  gust   max  ...  Nov  \
573   2016  25  51.0  47.1   995.4    4.2  20.9   29.9  38.1  57.2  ...    0   
878   2018  13  35.0  33.4   995.9    3.4  31.0   44.1  58.1  46.0  ...    0   
870   2018  14  34.8  29.0   999.1    9.7  22.8   31.1  40.0  39.0  ...    0   
2716  2019  29  66.9  62.0  1021.4    8.7  11.4   20.0  26.0  72.0  ...    0   
782   2014  28  41.2  35.6  1023.4    8.9  13.7   20.0  27.0  48.0  ...    0   
1086  2013  11  46.2  41.0  1013.7    6.6  10.0   19.0  26.0  52.0  ...    0   

      Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
573     0    0    0    0    0    0    0    0    1  
878     0    0    0    1    0    0    0    0    0  
870     0    0    0    0    0    0    0    1    0  
2716    0    1    0    0    1    0    0    0    0  
782     0    0    0    0    0    0    1    0    0  
1086    0    0    0    0    0    0    0    0    1  

[6 rows x 38 columns]


In [142]:
#select the target as the last col
targets = shuffle.iloc[:,-1]
print(targets[:6])

573     663
878     578
870     652
2716    488
782     530
1086    517
Name: NUM_COLLISIONS, dtype: int64


In [143]:
# Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
# Calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)


38


In [144]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model', ignore_errors=True)

#Setup the model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[17,9,5,3], optimizer=tf.train.AdamOptimizer(learning_rate=0.001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

# Train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS
#Test the model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2186e65550>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model/model.ckpt.
INFO:tensorflow:loss = 54331.492, step = 1
INFO:tensorflow:global_step/sec: 368.083
INFO:tensorflow:loss = 584.1095, step = 101 (0.274 sec)
INFO:tensorflow:global_step/sec: 542.85
INFO:tensorflow:loss = 72.88313, step = 201 (0.188 sec)
INFO:tensorflow:global_step/sec: 509.544
INFO:tensorflow:loss = 14.796068, step = 301 (0.194 sec)
INFO:tensorflow:global_step/sec: 491.36
INFO:tensorflow:loss = 21.750216, step = 401 (0.206 sec)
INFO:tensorflow:global_step/sec: 552.56
INFO:tensorflow:loss = 12.787994, step = 501 (0.182 sec)
INFO:tensorflow:global_step/sec: 450.561
INFO:tensorflow:loss = 0.4161414, step = 601 (0.220 sec)
INFO:tensorflow:global_step/sec: 514.989
INFO:tensorflow:loss = 2.3250408, step = 701 (0.197 sec)
INFO:tensorflow:glob

DNNRegression has RMSE of 108.55842237291203
Just using average = 595.7129700690714 has RMSE of 108.48606071336582


In [145]:
print(predictors[trainsize:].values)

#Ensure the hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[17,9,5,3], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180990cd0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}


[[2.018e+03 2.300e+01 4.640e+01 ... 0.000e+00 0.000e+00 0.000e+00]
 [2.017e+03 1.000e+00 3.870e+01 ... 0.000e+00 1.000e+00 0.000e+00]
 [2.019e+03 2.600e+01 5.520e+01 ... 0.000e+00 0.000e+00 0.000e+00]
 ...
 [2.018e+03 1.900e+01 2.980e+01 ... 0.000e+00 1.000e+00 0.000e+00]
 [2.018e+03 3.000e+01 3.840e+01 ... 1.000e+00 0.000e+00 0.000e+00]
 [2.015e+03 1.300e+01 6.820e+01 ... 0.000e+00 0.000e+00 0.000e+00]]


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052  0.5135052  0.5135052  0.5135052  0.5135052  0.5135052
 0.5135052

The low RMSE and the low percentage error rate suggest that using the whole cleaned dataset is an accurate predictor of the number of collisions.

## Location
As identified in assignment 1, as the location tends towards the centre of New York there is stronger linear relationships. This suggests there is a link between the number of collisions, location and the observed weather conditions. A DNN will be trained to attempt to predict the number of collisions given the location, day and weather conditions.

In [146]:
#Read the data and extract from the zip
#Reference - (geeksforgeeks.org 2021)
df_loc = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/locdnn.zip', index_col=0,compression='zip' )
print(df_loc[:6])

   year  da  NUM_COLLISIONS   latitude  longitude  temp  dewp     slp  visib  \
1  2018   2               1  40.681750 -73.967480  14.7   2.0  1024.9   10.0   
2  2018   2               1  40.645370 -73.945110  14.7   2.0  1024.9   10.0   
3  2018   2               1  40.614830 -73.998380  14.7   2.0  1024.9   10.0   
4  2018   2               1  40.592190 -74.087395  14.7   2.0  1024.9   10.0   
5  2018   2               1  40.769817 -73.782370  14.7   2.0  1024.9   10.0   
6  2018   2               1  40.660175 -73.928200  14.7   2.0  1024.9   10.0   

   wdsp  ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  12.9  ...    0    0    0    0    1    0    0    0    0    0  
2  12.9  ...    0    0    0    0    1    0    0    0    0    0  
3  12.9  ...    0    0    0    0    1    0    0    0    0    0  
4  12.9  ...    0    0    0    0    1    0    0    0    0    0  
5  12.9  ...    0    0    0    0    1    0    0    0    0    0  
6  12.9  ...    0    0    0    0    1    0    0  

In [147]:
#Remove unrequired cols
df_loc_dnn = df_loc.drop(columns=['thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
#Clean data
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["year"] != 2012]
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["year"] < 2020]
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["dewp"] != 9999]
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["slp"] != 9999]
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["mxpsd"] != 999.9]
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["gust"] != 999.9]
#Move the target to the end
cols = df_loc_dnn['NUM_COLLISIONS']
df_loc_dnn = df_loc_dnn.drop(columns=['NUM_COLLISIONS'])
df_loc_dnn.insert(loc=34, column='NUM_COLLISIONS', value=cols)
print(df_loc_dnn[:6])
df_loc_dnn.describe()

   year  da   latitude  longitude  temp  dewp     slp  visib  wdsp  mxpsd  \
1  2018   2  40.681750 -73.967480  14.7   2.0  1024.9   10.0  12.9   20.0   
2  2018   2  40.645370 -73.945110  14.7   2.0  1024.9   10.0  12.9   20.0   
3  2018   2  40.614830 -73.998380  14.7   2.0  1024.9   10.0  12.9   20.0   
4  2018   2  40.592190 -74.087395  14.7   2.0  1024.9   10.0  12.9   20.0   
5  2018   2  40.769817 -73.782370  14.7   2.0  1024.9   10.0  12.9   20.0   
6  2018   2  40.660175 -73.928200  14.7   2.0  1024.9   10.0  12.9   20.0   

   ...  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
1  ...    0    0    0    1    0    0    0    0    0               1  
2  ...    0    0    0    1    0    0    0    0    0               1  
3  ...    0    0    0    1    0    0    0    0    0               1  
4  ...    0    0    0    1    0    0    0    0    0               1  
5  ...    0    0    0    1    0    0    0    0    0               1  
6  ...    0    0    0    1    0    0    

Unnamed: 0,year,da,latitude,longitude,temp,dewp,slp,visib,wdsp,mxpsd,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,...,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0
mean,2016.070154,15.638823,40.723907,-73.920916,48.419836,47.2749,1015.556656,8.199143,12.598551,20.021545,...,0.093866,0.074335,0.133032,0.146763,0.114698,0.142687,0.170187,0.141897,0.150735,1.02709
std,1.991298,8.613159,0.078454,0.086634,13.750834,261.636367,8.13658,2.230079,3.921832,5.219745,...,0.291643,0.262315,0.339609,0.35387,0.318657,0.349754,0.375797,0.348945,0.357791,0.180994
min,2013.0,1.0,40.498949,-74.253006,5.8,-6.7,989.5,0.6,4.5,8.9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,2014.0,8.0,40.66886,-73.976715,38.4,28.8,1010.7,6.9,10.0,15.9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,2016.0,16.0,40.72247,-73.92921,47.8,41.3,1015.3,9.3,12.0,19.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2018.0,23.0,40.768165,-73.86665,59.5,53.4,1021.0,10.0,14.4,22.9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,2019.0,31.0,40.912884,-73.66301,77.5,9999.9,1039.1,10.0,39.3,49.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0


In [148]:
#Set the scale to the maximum value
SCALE_LOC=11
# Shuffle data
shuffle = df_loc_dnn.iloc[np.random.permutation(len(df_loc_dnn))]

predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

         year  da   latitude  longitude  temp  dewp     slp  visib  wdsp  \
1428318  2015  28  40.587210 -73.923860  21.6  14.7  1009.6    6.6  16.6   
661367   2016   3  40.761223 -73.923800  58.8  56.2  1014.6    5.1   8.4   
928834   2015   8  40.761486 -73.960599  69.7  61.1  1013.8   10.0  11.8   
312058   2013  17  40.736005 -73.993617  63.2  59.2  1011.3    7.0   9.1   
1551367  2016  18  40.789950 -73.942927  36.0  23.1  1023.3   10.0  12.5   
1482098  2013  16  40.764736 -73.988033  32.0  15.5  1012.3   10.0  12.5   

         mxpsd  ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1428318   29.9  ...    0    0    0    0    0    0    0    0    1    0  
661367    15.0  ...    1    0    0    0    0    0    0    0    0    1  
928834    15.0  ...    0    0    0    1    0    0    0    0    0    0  
312058    14.0  ...    0    0    0    0    0    0    1    0    0    0  
1551367   18.1  ...    0    0    0    0    0    0    0    0    0    1  
1482098   17.1  ...    0    0    0 

In [149]:
# Select last col as target
targets = shuffle.iloc[:,-1]
print(targets[:6])

1428318    1
661367     1
928834     1
312058     1
1551367    1
1482098    1
Name: NUM_COLLISIONS, dtype: int64


In [150]:
# Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
# Calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

34


In [151]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_loc', ignore_errors=True)

#Setup the model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_loc', hidden_units=[19,15,11,7], optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

# Train the model.
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_LOC, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_LOC
#Test Model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180f91550>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_loc', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_loc/model.ckpt.
INFO:tensorflow:loss = 1954.7893, step = 1
INFO:tensorflow:global_step/sec: 349.222
INFO:tensorflow:loss = 0.38380194, step = 101 (0.291 sec)
INFO:tensorflow:global_step/sec: 478.001
INFO:tensorflow:loss = 0.2817147, step = 201 (0.208 sec)
INFO:tensorflow:global_step/sec: 425.272
INFO:tensorflow:loss = 0.19359416, step = 301 (0.238 sec)
INFO:tensorflow:global_step/sec: 454.515
INFO:tensorflow:loss = 0.12018517, step = 401 (0.218 sec)
INFO:tensorflow:global_step/sec: 478.549
INFO:tensorflow:loss = 0.068043634, step = 501 (0.212 sec)
INFO:tensorflow:global_step/sec: 472.06
INFO:tensorflow:loss = 0.036159124, step = 601 (0.207 sec)
INFO:tensorflow:global_step/sec: 435.458
INFO:tensorflow:loss = 0.018172484, step = 701 (0.231 sec)
INF

DNNRegression has RMSE of 0.17941017063178782
Just using average = 1.0270641352408854 has RMSE of 0.17934385792799779


In [152]:
print(predictors[trainsize:].values)
#Ensure the number of hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_loc', hidden_units=[19,15,11,7], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_LOC
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180b7e290>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_loc', '_session_creation_timeout_secs': 7200}


[[2.01700000e+03 2.80000000e+01 4.07359900e+01 ... 1.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [2.01300000e+03 2.90000000e+01 4.06181866e+01 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]
 [2.01900000e+03 1.60000000e+01 4.07223320e+01 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 ...
 [2.01300000e+03 2.00000000e+01 4.05779450e+01 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [2.01800000e+03 2.80000000e+01 4.06322940e+01 ... 1.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [2.01900000e+03 1.10000000e+01 4.05783580e+01 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]]


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_loc/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.09293779 0.09293779 0.09293779 ... 0.09293779 0.09293779 0.09293779]
[0.00086133 0.00086133 0.00086133 ... 0.00086133 0.00086133 0.00086133]
The trained model has an aproximate error rate of 0.5149880768946036 which equates to 50%


Despite extensive training, the model is not accurately able to predict the number of collisions. This is inferred by the RMSE which is close to that of the mean RMSE. 

#Conclusion
As shown above the given weather conditions for a particular day within New York, the number of collisions can be accurately predicted.

Arguments can be made for both models; the linear regression models appear to be less efficient due to the higher RMSE values in comparison to the DNN models. In contrast the error rate is lower for the linear regression models. This suggests that although the DNN models produce more errors, but the margin of error is lower, due to the RMSE placing a larger weighting on larger errors.

In hindsight the way location has been encoded does not allow for accurate predictions to be made as the number of collisions always tends towards 1 for a given location.

It is clear that the models produced above can accurately predict the number of collisions as set out in the specification of the assignment.


#References
geeksforgeeks.org (2021) Read a zipped file as a Pandas DataFrame [online]. Available from <<https://www.geeksforgeeks.org/read-a-zipped-file-as-a-pandas-dataframe/>? [12 November 2022] 

IBM (n.d.) What is linear regression? [online]. Available from <<https://www.ibm.com/uk-en/topics/linear-regression>> [17 November 2022] 

Karhunen, J., Raiko, T. and Cho, K. (2015) 'Chapter 7 - Unsupervised deep learning: A short review.' In Advances in Independent Component Analysis and Learning Machines. Academic Press. Ch. 7. 135-142.

Zhang, Z. (2019) Understand Data Normalization in Machine Learning [online]. Available from <<https://towardsdatascience.com/understand-data-normalization-in-machine-learning-8ff3062101f0>> [17 November 2022] 

Zulkifli, H. (2018) Understanding Learning Rates and How It Improves Performance in Deep Learning [online]. Available from <<https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10>> [17 November 2022] 

