<a href="https://colab.research.google.com/github/matthew110395/12004210_DataAnalytics/blob/main/12004210_DAOTW_Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction
This assignment builds on the New York taxi problem identified in assignment 1, where the City of New York is looking for a way to accurately predict the number of collisions on a particular day of the week. Within this document two different types of machine learning models will be utilised to predict the number of collisions. The linear relationships identified in assignment 1 will be used to create linear regression models. Deep Learning Neural Network (DNN) models will be used to predict the number of collisions where the relationship is not linear. 

#Imports
To prepare data for machine learning the pandas package has been used, alongside the numpy package which has been used to aid with mathematical functions.

As within part 1 of this assignment, the data file containing location data exceeds the size limit for hosting within github. To overcome this the file was zipped. To extract the data the zipfile package has been used.

Within this document, TensorFlow is used for machine learning, with both linear regression models and a Deep Neural Network models. TensorFlow version 1 is unsupported within Google Colab, therefore must be installed using a package manager.

Shutil is also imported to allow for file management, in particular the removal of saved models.

In [77]:
#Import Packages
import pandas as pd
import numpy as np
import zipfile
!pip install tensorflow==1.15.2
import tensorflow as tf
import shutil  

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#Linear Regressor
Throughout assignment 1 a number of linear relationships were uncovered within the dataset. These relationships form the basis of the linear regression models below.

A linear regressor is used to predict an output variable based on one or more input variables (IBM n.d.).

To improve the accuracy of the model the target values are scaled. This reduces the range of collisions from 188-1161 to 0.1619... - 1 which allows for quicker training  (Zhang 2019).

In [78]:
#Scale to maximum number of collisions
SCALE_COLLISIONS=1161

##Precipitation
As uncovered in assignment 1; as the volume of precipitation increases, the number of collisions increase. 

The datafile produced in assignment is imported.

In [79]:
#Read File
df_prcp = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/prcp_clean.csv', index_col=0, )
print(df_prcp[:6])

   day  year  mo  da collision_date  NUM_COLLISIONS  temp  dewp     slp  \
1    5  2020   1  24     2020-01-24             524  37.3  33.7  1028.5   
2    2  2021   1  12     2021-01-12             278  37.0  29.1  1019.0   
3    5  2021   1  22     2021-01-22             254  36.5  28.4  1003.1   
4    3  2021   1  27     2021-01-27             262  34.6  33.8  1012.8   
5    2  2021   1  26     2021-01-26             263  31.9  23.4  1016.9   
6    1  2022   1  24     2022-01-24             237  34.5  23.8  1010.6   

   visib  ...   max   min  prcp   sndp  fog  rain_drizzle  snow_ice_pellets  \
1    6.5  ...  46.0  19.9  0.00  999.9    1             0                 0   
2   10.0  ...  44.1  21.0  0.00  999.9    0             0                 0   
3   10.0  ...  44.1  19.9  0.00  999.9    0             0                 0   
4    8.0  ...  41.0  28.9  0.25  999.9    1             1                 0   
5    9.0  ...  37.9  21.0  0.00  999.9    1             0                 0   


In order to create the linear regression model, extra columns are removed to simplify the model with the aim of reducing error values.

The incomplete years (2012 and 2022) are removed, along with the erroneous data for 2020 and 2021.

To aid with the production of the model the target is moved to the end of the data table.

In [80]:
#Remove Cols not Required 
df_prcp = df_prcp.drop(columns=['collision_date', 'temp', 'dewp', 'slp','visib','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud'])
df_prcp = df_prcp.loc[df_prcp["year"] != 2012]
df_prcp = df_prcp.loc[df_prcp["year"] < 2020]
cols = df_prcp['NUM_COLLISIONS']
df_prcp = df_prcp.drop(columns=['NUM_COLLISIONS'])
#Move NUM_COLLISIONS to end
df_prcp.insert(loc=9, column='NUM_COLLISIONS', value=cols)
print(df_prcp[:6])
df_prcp.describe()

    day  year  mo  da  prcp  fog  rain_drizzle  snow_ice_pellets  hail  \
49    4  2016   1  28  0.09    0             0                 0     0   
51    5  2014   1  17  0.00    1             0                 0     0   
54    1  2016   1  25  0.02    0             0                 0     0   
55    5  2016   1  29  0.00    0             0                 0     0   
58    5  2017   1  20  0.00    0             0                 0     0   
59    7  2013   1  13  0.01    1             0                 0     0   

    NUM_COLLISIONS  
49             681  
51             589  
54             658  
55             645  
58             605  
59             373  


Unnamed: 0,day,year,mo,da,prcp,fog,rain_drizzle,snow_ice_pellets,hail,NUM_COLLISIONS
count,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0
mean,3.998425,2015.989366,6.518708,15.745569,0.122588,0.253249,0.375345,0.085467,0.000394,599.135093
std,2.003542,1.996126,3.455211,8.803199,0.329143,0.434958,0.484307,0.27963,0.019846,100.299164
min,1.0,2013.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2.0,2014.0,4.0,8.0,0.0,0.0,0.0,0.0,0.0,531.0
50%,4.0,2016.0,7.0,16.0,0.0,0.0,0.0,0.0,0.0,602.0
75%,6.0,2018.0,10.0,23.0,0.06,1.0,1.0,0.0,0.0,665.0
max,7.0,2019.0,12.0,31.0,3.76,1.0,1.0,1.0,1.0,1161.0


To remove any bias within the dataset, it is randomly shuffled. The data is then split into the predictors and the target.

In [81]:
# Shuffle the data
shuffle = df_prcp.iloc[np.random.permutation(len(df_prcp))]

# Select all apart from last col
predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])
print(shuffle[:6])

      day  year  mo  da  prcp  fog  rain_drizzle  snow_ice_pellets  hail
3492    4  2015  12  31  0.00    0             1                 0     0
1307    5  2013   5  10  0.00    1             0                 0     0
506     2  2016   2   9  0.06    0             0                 1     0
1716    3  2019   6   5  0.00    0             1                 0     0
486     5  2017   2  17  0.00    0             0                 0     0
2648    5  2018   9  28  0.10    0             1                 0     0
      day  year  mo  da  prcp  fog  rain_drizzle  snow_ice_pellets  hail  \
3492    4  2015  12  31  0.00    0             1                 0     0   
1307    5  2013   5  10  0.00    1             0                 0     0   
506     2  2016   2   9  0.06    0             0                 1     0   
1716    3  2019   6   5  0.00    0             1                 0     0   
486     5  2017   2  17  0.00    0             0                 0     0   
2648    5  2018   9  28  0.10    

In [82]:
# Select Target (last col)
targets = shuffle.iloc[:,-1]

print(targets[:6])

3492    527
1307    698
506     571
1716    655
486     680
2648    712
Name: NUM_COLLISIONS, dtype: int64


In [83]:
#Split to test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
print(trainsize)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize
print(testsize)
nppredictors = 9
noutputs = 1

2031
508


In [84]:
# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model_prcp', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_prcp', optimizer=tf.train.AdamOptimizer(learning_rate=0.00001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))
print("starting to train");
#Train Model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)
preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS

pred = format(str(predslistscale))
#Test Model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));

avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2184c36790>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/linear_regression_trained_model_prcp/model.ckpt.
INFO:tensorflow:loss = 0.28435007, step = 1
INFO:tensorflow:global_step/sec: 610.428
INFO:tensorflow:loss = 0.006029467, step = 101 (0.170 sec)
INFO:tensorflow:global_step/sec: 830.111
INFO:tensorflow:loss = 0.006208708, step = 201 (0.120 sec)
INFO:tensorflow:global_step/sec: 763.028
INFO:tensorflow:loss = 0.00811196, step = 301 (0.129 sec)
INFO:tensorflow:global_step/sec: 739.671
INFO:tensorflow:loss = 0.006580451, step = 401 (0.134 sec)
INFO:tensorflow:global_step/sec: 742.39
INFO:tensorflow:loss = 0.0064119017, step = 501 (0.139 sec)
INFO:tensorflow:global_step/sec: 483.678
INFO:tensorflow:loss = 0.006235397, step = 601 (0.207 sec)
INFO:tensorflow:global_step/sec: 537.563
INFO:tensorflow:loss = 0.0064507676, step = 701 (0.1

LinearRegression has RMSE of 102.05434614543951
Just using average = 600.2299359921221 has RMSE of 105.566287878724


A number of learning rates were used to determine a suitable learning rate for the model. As the learning rate decreases the overall time to train the dataset increases (Zulkifli 2018).

In [85]:
#Check Error Rate
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_prcp', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21845a0e50>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_prc

508
508
[0.5122645  0.47875375 0.52287126 0.5635435  0.46389306 0.55077493
 0.56979203 0.47948697 0.54274005 0.5035545  0.5411757  0.54123
 0.5133076  0.48793462 0.55686814 0.48470786 0.53721666 0.50310105
 0.47432306 0.5792667  0.54587805 0.4638464  0.5302459  0.4894051
 0.55784094 0.5486665  0.50999916 0.46061707 0.5070047  0.5136187
 0.53376424 0.53320867 0.51746637 0.5568299  0.51152563 0.5712332
 0.5349731  0.49143696 0.4980978  0.53075117 0.50611013 0.5668902
 0.5099612  0.53449446 0.5607861  0.5116442  0.54498    0.5537348
 0.54023683 0.5213465  0.55704856 0.49917585 0.52356577 0.55058366
 0.489171   0.5526224  0.5368833  0.48330578 0.5391398  0.48670578
 0.51017135 0.54889625 0.5786473  0.56032413 0.50977975 0.51447284
 0.48049292 0.48267332 0.5532888  0.5355576  0.46934167 0.4938137
 0.54982024 0.4537593  0.48576254 0.5497053  0.5488554  0.54657066
 0.59345555 0.49571893 0.4655688  0.504705   0.49902675 0.5296054
 0.48964164 0.54712546 0.50650275 0.46399084 0.5053166  0.550896

In [86]:
input = pd.DataFrame.from_dict(data = 
				{
         'day' : [1,1,1,10],
         'year' : [2019,2019,2019,2020],
         'mo' : [3,3,3,12],
         'da' : [10,10,10,12],
         'prcp' : [0,2.34,5.5,2.24],
         'fog' : [0,0,1,1],
         'rain_drizzle' : [0,1,1,1],
         'snow_ice_pellets' : [0,0,0,0],
         'hail' : [0,0,0,0]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_prcp', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180c349d0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_prc

[631.5835  628.92267 644.65247 542.87445]


Two main tests have been applied to the model. The Route Mean Squared Error (RMSE) and a comparison between the target values in the testing dataset and the predicted values using the predictors in the testing dataset.

Predominantly the RMSE of the model is lower than that of the average. This indicates that the model makes more accurate predictions compared to the average.

Based on the relationship, the test data outputs results as expected.

##Dew Point (dewp)
A relationship between dew point and the number of collisions was also uncovered in assignment 1. This linear relationship suggests that as the dew point increases the number of collisions increase. 

The process to produce the model follows the same process as the precipitation model.

In [87]:
#Read Data
df_dewp = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/dewp_clean.csv', index_col=0, )
print(df_dewp[:6])

   day  year  mo  da collision_date  NUM_COLLISIONS  temp  dewp     slp  \
1    5  2020   1  24     2020-01-24             524  37.3  33.7  1028.5   
2    2  2021   1  12     2021-01-12             278  37.0  29.1  1019.0   
3    5  2021   1  22     2021-01-22             254  36.5  28.4  1003.1   
4    3  2021   1  27     2021-01-27             262  34.6  33.8  1012.8   
5    2  2021   1  26     2021-01-26             263  31.9  23.4  1016.9   
6    1  2022   1  24     2022-01-24             237  34.5  23.8  1010.6   

   visib  ...   max   min  prcp   sndp  fog  rain_drizzle  snow_ice_pellets  \
1    6.5  ...  46.0  19.9  0.00  999.9    1             0                 0   
2   10.0  ...  44.1  21.0  0.00  999.9    0             0                 0   
3   10.0  ...  44.1  19.9  0.00  999.9    0             0                 0   
4    8.0  ...  41.0  28.9  0.25  999.9    1             1                 0   
5    9.0  ...  37.9  21.0  0.00  999.9    1             0                 0   


In [88]:
#Remove Cols not Required 
df_dewp = df_dewp.drop(columns=['collision_date', 'temp', 'prcp', 'slp','visib','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_dewp = df_dewp.loc[df_dewp["year"] != 2012]
df_dewp = df_dewp.loc[df_dewp["year"] < 2020]
cols = df_dewp['NUM_COLLISIONS']
df_dewp = df_dewp.drop(columns=['NUM_COLLISIONS'])
#Move target to end
df_dewp.insert(loc=5, column='NUM_COLLISIONS', value=cols)
print(df_dewp[:6])
df_dewp.describe()

    day  year  mo  da  dewp  NUM_COLLISIONS
49    4  2016   1  28  24.4             681
51    5  2014   1  17  35.8             589
54    1  2016   1  25  21.2             658
55    5  2016   1  29  36.8             645
58    5  2017   1  20  32.5             605
59    7  2013   1  13  44.9             373


Unnamed: 0,day,year,mo,da,dewp,NUM_COLLISIONS
count,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0
mean,3.998434,2015.999217,6.52407,15.723679,44.16317,599.10998
std,2.000391,2.0,3.449676,8.801271,16.995303,100.277185
min,1.0,2013.0,1.0,1.0,-6.7,188.0
25%,2.0,2014.0,4.0,8.0,32.15,531.0
50%,4.0,2016.0,7.0,16.0,45.3,602.0
75%,6.0,2018.0,10.0,23.0,58.5,665.0
max,7.0,2019.0,12.0,31.0,74.1,1161.0


In [89]:
#Shuffle Data
shuffle = df_dewp.iloc[np.random.permutation(len(df_dewp))]
#Select Predictors
predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])
print(shuffle[:6])

      day  year  mo  da  dewp
1587    2  2019   6  18  58.8
2698    3  2016   9   7  66.8
1724    5  2017   6  23  63.4
1383    1  2018   5   7  49.7
1877    5  2019   7  26  58.9
2879    2  2013  10   1  51.8
      day  year  mo  da  dewp  NUM_COLLISIONS
1587    2  2019   6  18  58.8             721
2698    3  2016   9   7  66.8             648
1724    5  2017   6  23  63.4             793
1383    1  2018   5   7  49.7             695
1877    5  2019   7  26  58.9             650
2879    2  2013  10   1  51.8             616


In [90]:
#Select last col as target
targets = shuffle.iloc[:,-1]

print(targets[:6])

1587    721
2698    648
1724    793
1383    695
1877    650
2879    616
Name: NUM_COLLISIONS, dtype: int64


In [91]:
#Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
print(trainsize)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize
print(testsize)
nppredictors = 5
noutputs = 1

2044
511


In [92]:
# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model_dewp', ignore_errors=True)

#Setup Model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_dewp', optimizer=tf.train.AdamOptimizer(learning_rate=0.00001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))
print("starting to train");
#Train Model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)
preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS

#test Model
pred = format(str(predslistscale))
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));

avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2184825110>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_dewp', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/linear_regression_trained_model_dewp/model.ckpt.
INFO:tensorflow:loss = 0.27910894, step = 1
INFO:tensorflow:global_step/sec: 585.842
INFO:tensorflow:loss = 0.0071757007, step = 101 (0.174 sec)
INFO:tensorflow:global_step/sec: 865.49
INFO:tensorflow:loss = 0.009145996, step = 201 (0.116 sec)
INFO:tensorflow:global_step/sec: 616.143
INFO:tensorflow:loss = 0.006120279, step = 301 (0.162 sec)
INFO:tensorflow:global_step/sec: 509.497
INFO:tensorflow:loss = 0.0068645384, step = 401 (0.196 sec)
INFO:tensorflow:global_step/sec: 536.93
INFO:tensorflow:loss = 0.0065441383, step = 501 (0.186 sec)
INFO:tensorflow:global_step/sec: 718.277
INFO:tensorflow:loss = 0.005712265, step = 601 (0.140 sec)
INFO:tensorflow:global_step/sec: 756.416
INFO:tensorflow:loss = 0.0065529943, step = 701 (0

LinearRegression has RMSE of 93.96416755097607
Just using average = 598.9432485322897 has RMSE of 102.3375683663379


In [93]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_dewp', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2193f8d590>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_dewp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_dew

511
511
[0.54108864 0.5150208  0.51542944 0.51455796 0.47583237 0.5577232
 0.56107754 0.52491874 0.5058224  0.5312116  0.57578826 0.4976822
 0.5146251  0.47125548 0.54165584 0.5187817  0.54353875 0.48916832
 0.5086365  0.51474327 0.4938765  0.47406805 0.54567677 0.57583743
 0.53132784 0.5793128  0.5340886  0.5401705  0.53567487 0.56672806
 0.507975   0.5596787  0.53980374 0.52301604 0.47560015 0.49049875
 0.52553046 0.48754847 0.5383884  0.49556622 0.5098874  0.529255
 0.56506574 0.49334618 0.5572748  0.5790375  0.55949205 0.4743112
 0.5073691  0.53109753 0.46992546 0.5120298  0.5085686  0.57224846
 0.5523468  0.50943124 0.53462005 0.48818082 0.51841134 0.5310553
 0.5350835  0.580653   0.5600857  0.5329062  0.5573527  0.5525702
 0.5150917  0.518853   0.49951407 0.5492718  0.5072489  0.548728
 0.51278436 0.48733842 0.48348555 0.5607031  0.46873987 0.5772446
 0.54728687 0.5182029  0.47003058 0.48797196 0.58404195 0.5136731
 0.5490102  0.44753367 0.5624638  0.5363252  0.5077698  0.5223185

In [94]:
input = pd.DataFrame.from_dict(data = 
				{
         'day' : [1,1,1,10],
         'year' : [2019,2019,2019,2020],
         'mo' : [3,3,3,12],
         'da' : [10,10,10,12],
         'dewp' : [0,2.34,5.5,2.24],
         
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_dewp', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f218474e550>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_dewp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_dew

[596.0713  598.4295  601.614   503.96735]


The RMSE for the dewp model is slightly lower in comparison to the model produced for precipitation. As the RMSE is lower than the RMSE of the mean, it shows that the model has a higher level of accuracy in comparison to the using the mean.

Based on the relationship, the test data outputs results as expected.

##Visibility (visib)
A relationship was also uncovered between visibility and the number of collisions. This is a negative linear relationship where the visibility increases the number of collisions decrease. 

The process to produce the model follows the same process as above.

In [95]:
#Read Data
df_visib = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/coldata.csv', index_col=0, )
print(df_visib[:6])

   day  year  mo  da collision_date  NUM_COLLISIONS  temp  dewp     slp  \
1    5  2020   1  24     2020-01-24             524  37.3  33.7  1028.5   
2    2  2021   1  12     2021-01-12             278  37.0  29.1  1019.0   
3    5  2021   1  22     2021-01-22             254  36.5  28.4  1003.1   
4    3  2021   1  27     2021-01-27             262  34.6  33.8  1012.8   
5    2  2021   1  26     2021-01-26             263  31.9  23.4  1016.9   
6    1  2022   1  24     2022-01-24             237  34.5  23.8  1010.6   

   visib  ...   max   min  prcp   sndp  fog  rain_drizzle  snow_ice_pellets  \
1    6.5  ...  46.0  19.9  0.00  999.9    1             0                 0   
2   10.0  ...  44.1  21.0  0.00  999.9    0             0                 0   
3   10.0  ...  44.1  19.9  0.00  999.9    0             0                 0   
4    8.0  ...  41.0  28.9  0.25  999.9    1             1                 0   
5    9.0  ...  37.9  21.0  0.00  999.9    1             0                 0   


In [96]:
#Remove Cols not Required 
df_visib = df_visib.drop(columns=['collision_date', 'temp', 'prcp', 'slp','dewp','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_visib = df_visib.loc[df_visib["year"] != 2012]
df_visib = df_visib.loc[df_visib["year"] < 2020]
cols = df_visib['NUM_COLLISIONS']
df_visib = df_visib.drop(columns=['NUM_COLLISIONS'])
#Move target to end
df_visib.insert(loc=5, column='NUM_COLLISIONS', value=cols)
print(df_visib[:6])
df_visib.describe()

    day  year  mo  da  visib  NUM_COLLISIONS
49    4  2016   1  28   10.0             681
51    5  2014   1  17    6.7             589
54    1  2016   1  25   10.0             658
55    5  2016   1  29   10.0             645
58    5  2017   1  20   10.0             605
59    7  2013   1  13    4.3             373


Unnamed: 0,day,year,mo,da,visib,NUM_COLLISIONS
count,2556.0,2556.0,2556.0,2556.0,2556.0,2556.0
mean,3.999218,2016.0,6.524257,15.725743,8.295618,599.118936
std,2.000391,2.0,3.449013,8.800168,2.20787,100.258581
min,1.0,2013.0,1.0,1.0,0.2,188.0
25%,2.0,2014.0,4.0,8.0,7.1,531.0
50%,4.0,2016.0,7.0,16.0,9.4,602.0
75%,6.0,2018.0,10.0,23.0,10.0,665.0
max,7.0,2019.0,12.0,31.0,10.0,1161.0


In [97]:
#Shuffle the data
shuffle = df_visib.iloc[np.random.permutation(len(df_visib))]
predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])
print(shuffle[:6])

      day  year  mo  da  visib
2825    4  2019  10  24   10.0
1856    1  2019   7   8   10.0
1021    7  2015   4  12   10.0
2935    3  2016  10   5   10.0
1685    6  2018   6  23    5.7
2187    2  2016   8   9   10.0
      day  year  mo  da  visib  NUM_COLLISIONS
2825    4  2019  10  24   10.0             613
1856    1  2019   7   8   10.0             592
1021    7  2015   4  12   10.0             443
2935    3  2016  10   5   10.0             695
1685    6  2018   6  23    5.7             513
2187    2  2016   8   9   10.0             650


In [98]:
# Select the last col as target
targets = shuffle.iloc[:,-1]

print(targets[:6])

2825    613
1856    592
1021    443
2935    695
1685    513
2187    650
Name: NUM_COLLISIONS, dtype: int64


In [99]:
#Split to test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
print(trainsize)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize
print(testsize)
nppredictors = 5
noutputs = 1

2044
512


In [100]:
# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model_visib', ignore_errors=True)

#Setup Model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_visib', optimizer=tf.train.AdamOptimizer(learning_rate=0.0001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))
print("starting to train");
#train Model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)
preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS
#Test Model
pred = format(str(predslistscale))
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));

avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21848e5050>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_visib', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/linear_regression_trained_model_visib/model.ckpt.
INFO:tensorflow:loss = 0.27459064, step = 1
INFO:tensorflow:global_step/sec: 657.885
INFO:tensorflow:loss = 0.009505197, step = 101 (0.157 sec)
INFO:tensorflow:global_step/sec: 849.991
INFO:tensorflow:loss = 0.008307781, step = 201 (0.119 sec)
INFO:tensorflow:global_step/sec: 826.181
INFO:tensorflow:loss = 0.0071506426, step = 301 (0.121 sec)
INFO:tensorflow:global_step/sec: 755.623
INFO:tensorflow:loss = 0.0070540262, step = 401 (0.132 sec)
INFO:tensorflow:global_step/sec: 754.632
INFO:tensorflow:loss = 0.007330095, step = 501 (0.132 sec)
INFO:tensorflow:global_step/sec: 797.108
INFO:tensorflow:loss = 0.0065802094, step = 601 (0.125 sec)
INFO:tensorflow:global_step/sec: 687.471
INFO:tensorflow:loss = 0.0069297077, step = 701

LinearRegression has RMSE of 93.30438690969395
Just using average = 600.527397260274 has RMSE of 97.94271362023213


In [101]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

#testd = pd.DataFrame.from_records(predictors[trainsize:].values,columns=['day','year','month','da','prcp','fog','rain','snow','hail'])
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_visib', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2184612150>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_visib', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_vi

512
512
[0.51203763 0.56929165 0.49971002 0.53724235 0.5331647  0.4596194
 0.5065841  0.5196762  0.4785367  0.50867176 0.5652723  0.4853061
 0.52714485 0.5508507  0.5110668  0.49564546 0.5065315  0.4916242
 0.51037836 0.5458444  0.5211129  0.48834977 0.5527546  0.50391304
 0.5037089  0.49517938 0.5110578  0.5189451  0.5335094  0.5112895
 0.4963031  0.5103712  0.5012339  0.53205377 0.5731611  0.55361545
 0.55337316 0.50122494 0.5149181  0.49802685 0.5079251  0.48917392
 0.51711273 0.5218819  0.4799218  0.5363916  0.49335063 0.5628964
 0.5526853  0.47750828 0.48871735 0.4885171  0.5009921  0.52025706
 0.49132308 0.5235904  0.5267433  0.50128156 0.5249217  0.4807761
 0.51871735 0.54851705 0.46648568 0.51754755 0.51672596 0.4948084
 0.4541079  0.55862695 0.50122607 0.5371028  0.4885851  0.563091
 0.45256427 0.5198303  0.51368785 0.50315446 0.52299356 0.55308145
 0.49334005 0.5486676  0.45334056 0.46320236 0.51783854 0.5497701
 0.5620634  0.5276797  0.5162332  0.5170213  0.4965934  0.535922

In [102]:
input = pd.DataFrame.from_dict(data = 
				{
         'day' : [1,1,1,10],
         'year' : [2019,2019,2019,2020],
         'mo' : [3,3,3,12],
         'da' : [10,10,10,12],
         'visib' : [1,5,9.5,5],
         
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_visib', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21848b7d90>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_visib', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model_vi

[648.7901  633.99225 617.3446  546.5106 ]


As the difference RMSE between the mean and the model is less, it is arguable that the model is not as accurate.

Although the RMSE value indicates a weaker model, the error rate is lower indicating that there is a significant error increasing the RMSE.

Based on the relationship, the test data outputs results as expected.

#Deep Learning Neural Network (DNN)
Although the primary outcome of assignment 1 was uncovering linear relationships, more complex relationships which cannot be predicted using a linear regressor were uncovered. In this case a Deep Learning Neural Network (DNN) can be used.

A Deep Learning Neural Network (DNN) is a form of unsupervised learning, where a number of hidden layers are used to uncover non-linear relationships. Karhunen, Raiko and Cho (2015) infer that deep learning neural networks work in a similar way to the human brain. This is where both the relationship between the input and output data is explored, as well as the relationship between the underlying data.

##Precipitation (prcp)
The process for training a DNN follows a similar process followed above for a Linear Regressor. The data cleansed and one hot encoded as part of assignment 1 is loaded from GitHub.

In [153]:
#Read Data
df = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/prcp_clean_dnn.csv', index_col=0, )
print(df[:6])

   year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd   gust  \
1  2020  24             524  37.3  33.7  1028.5    6.5   3.3    8.0  999.9   
2  2021  12             278  37.0  29.1  1019.0   10.0   6.5   12.0  999.9   
3  2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0   20.0   
4  2021  27             262  34.6  33.8  1012.8    8.0   7.8   12.0  999.9   
5  2021  26             263  31.9  23.4  1016.9    9.0   7.4   12.0  999.9   
6  2022  24             237  34.5  23.8  1010.6    9.7   7.4   12.0  999.9   

   ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  ...    0    0    0    0    0    0    0    1    0    0  
2  ...    0    0    0    0    1    0    0    0    0    0  
3  ...    0    0    0    0    0    0    0    1    0    0  
4  ...    0    0    0    0    0    0    0    0    1    0  
5  ...    0    0    0    0    1    0    0    0    0    0  
6  ...    0    0    0    0    0    0    1    0    0    0  

[6 rows x 39 columns]


In [154]:
#Remove Cols not Required 
df_prcp_dnn = df.drop(columns=['temp', 'dewp', 'slp','visib','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud'])
df_prcp_dnn = df_prcp_dnn.loc[df_prcp_dnn["year"] != 2012]
df_prcp_dnn = df_prcp_dnn.loc[df_prcp_dnn["year"] < 2020]
#Move target to the end
cols = df_prcp_dnn['NUM_COLLISIONS']
df_prcp_dnn = df_prcp_dnn.drop(columns=['NUM_COLLISIONS'])
df_prcp_dnn.insert(loc=26, column='NUM_COLLISIONS', value=cols)
print(df_prcp_dnn[:6])
df_prcp_dnn.describe()

    year  da  prcp  fog  rain_drizzle  snow_ice_pellets  hail  Apr  Aug  Dec  \
49  2016  28  0.09    0             0                 0     0    0    0    0   
51  2014  17  0.00    1             0                 0     0    0    0    0   
54  2016  25  0.02    0             0                 0     0    0    0    0   
55  2016  29  0.00    0             0                 0     0    0    0    0   
58  2017  20  0.00    0             0                 0     0    0    0    0   
59  2013  13  0.01    1             0                 0     0    0    0    0   

    ...  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
49  ...    0    0    0    0    0    0    0    0    1             681  
51  ...    0    0    0    0    0    0    1    0    0             589  
54  ...    0    0    0    0    0    1    0    0    0             658  
55  ...    0    0    0    0    0    0    1    0    0             645  
58  ...    0    0    0    0    0    0    1    0    0             605  
59  ...    0 

Unnamed: 0,year,da,prcp,fog,rain_drizzle,snow_ice_pellets,hail,Apr,Aug,Dec,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,...,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0,2539.0
mean,2015.989366,15.745569,0.122588,0.253249,0.375345,0.085467,0.000394,0.082316,0.083497,0.085467,...,0.085467,0.079953,0.14297,0.14297,0.143364,0.143757,0.142182,0.142576,0.142182,599.135093
std,1.996126,8.803199,0.329143,0.434958,0.484307,0.27963,0.019846,0.274899,0.276687,0.27963,...,0.27963,0.271273,0.350111,0.350111,0.350512,0.350913,0.349305,0.349709,0.349305,100.299164
min,2013.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,531.0
50%,2016.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,602.0
75%,2018.0,23.0,0.06,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,665.0
max,2019.0,31.0,3.76,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [155]:
# Shuffle Data
shuffle = df_prcp_dnn.iloc[np.random.permutation(len(df_prcp_dnn))]
predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da  prcp  fog  rain_drizzle  snow_ice_pellets  hail  Apr  Aug  \
2542  2013   2  0.00    1             1                 0     0    0    0   
3410  2019  21  0.00    0             0                 0     0    0    0   
220   2019   5  0.25    0             1                 0     0    0    0   
3446  2015   6  0.00    1             0                 0     0    0    0   
1328  2016  13  0.00    1             1                 0     0    0    0   
2090  2015  14  0.00    0             1                 0     0    0    0   

      Dec  ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
2542    0  ...    0    0    1    0    0    0    1    0    0    0  
3410    1  ...    0    0    0    1    0    0    0    0    0    0  
220     0  ...    0    0    0    1    0    0    0    0    0    0  
3446    1  ...    0    0    0    0    0    1    0    0    0    0  
1328    0  ...    0    0    0    0    0    0    0    1    0    0  
2090    0  ...    0    0    0    0    1    0    0    0    

In [156]:
#Select target as last col
targets = shuffle.iloc[:,-1]
print(targets[:6])

2542    431
3410    520
220     455
3446    485
1328    766
2090    657
Name: NUM_COLLISIONS, dtype: int64


In [157]:
#Split to test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize
noutputs = 1
# calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

26


The difference between the training of a DNN and linear regression model, is the addition of hidden layers. In order to optimise the trained model, the number of hidden layers, the number of nodes within each layer and the learning rate were all modified.

In [158]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_prcp', ignore_errors=True)

#Setup Model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_prcp', hidden_units=[20,18,13], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))
print("starting to train");
#Train Model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)

predslistscale = preds['scores']*SCALE_COLLISIONS

#Test model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180bba350>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_prcp/model.ckpt.
INFO:tensorflow:loss = 1514.6118, step = 1
INFO:tensorflow:global_step/sec: 346.132
INFO:tensorflow:loss = 0.029177908, step = 101 (0.295 sec)
INFO:tensorflow:global_step/sec: 536.571
INFO:tensorflow:loss = 0.016058322, step = 201 (0.184 sec)
INFO:tensorflow:global_step/sec: 490.067
INFO:tensorflow:loss = 0.015163561, step = 301 (0.207 sec)
INFO:tensorflow:global_step/sec: 503.908
INFO:tensorflow:loss = 0.016746871, step = 401 (0.195 sec)
INFO:tensorflow:global_step/sec: 534.915
INFO:tensorflow:loss = 0.013372488, step = 501 (0.187 sec)
INFO:tensorflow:global_step/sec: 488.683
INFO:tensorflow:loss = 0.0124777965, step = 601 (0.207 sec)
INFO:tensorflow:global_step/sec: 474.707
INFO:tensorflow:loss = 0.016399944, step = 701 (0.210 

DNNRegression has RMSE of 84.055815964759
Just using average = 599.69768586903 has RMSE of 107.78744673447711


In [159]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))
#Ensure hidden layers match the model trained above
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_prcp', hidden_units=[20,18,13], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21b8005a10>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}


508
508


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_prcp/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.50444597 0.44498342 0.49839872 0.5123872  0.5352444  0.52600855
 0.569745   0.5445346  0.52861947 0.5235066  0.52895635 0.4999084
 0.560666   0.5415916  0.54303545 0.48940796 0.4481277  0.47860855
 0.5308289  0.5452089  0.5274996  0.5099086  0.48506683 0.42798704
 0.48449367 0.4985029  0.53439444 0.57339257 0.5179083  0.5041413
 0.42731422 0.48793334 0.43446678 0.5611958  0.55776614 0.488429
 0.5438995  0.45518392 0.56067795 0.5307037  0.49741262 0.51700443
 0.5169677  0.44484895 0.56041664 0.45509714 0.53598565 0.5090565
 0.52829546 0.52326006 0.52145904 0.56543297 0.48814863 0.38606304
 0.5357327  0.51590675 0.5131163  0.41825742 0.55698675 0.5412697
 0.53855985 0.5941927  0.539051   0.52240103 0.57341546 0.51572937
 0.5401296  0.44628114 0.51654714 0.46380728 0.5290274  0.39833015
 0.5418796  0.4861681  0.4899754  0.43878764 0.5648858  0.5424122
 0.5152473  0.5093693  0.52032846 0.515449   0.5158457  0.538521
 0.5287613  0.5183181  0.58786386 0.504466   0.45372146 0.510529
 0.525

In [160]:
input = pd.DataFrame.from_dict(data = 
				{
            'year':[2019,2019,2019,2020],
          'da':[10,10,10,20],
         'prcp' : [0,2.34,5.5,2.24],
         'fog' : [0,0,1,1],
         'rain_drizzle' : [0,1,1,1],
         'snow_ice_pellets' : [0,0,0,0],
         'hail' : [0,0,0,0],
         'Apr' : [0,0,0,0],
         'Aug' : [1,1,1,1],
         'Dec' : [0,0,0,0],
         'Feb' : [0,0,0,0],
         'Jan' : [0,0,0,0],
         'Jul' : [0,0,0,0],
         'Jun' : [0,0,0,0],
         'Mar' : [0,0,0,0],
         'May' : [0,0,0,0],
         'Nov' : [0,0,0,0],
         'Oct' : [0,0,0,0],
         'Sep' : [0,0,0,0],
         'Fri' : [0,0,0,0],
         'Mon' : [1,1,1,1],
         'Sat' : [0,0,0,0],
         'Sun' : [0,0,0,0],
         'Thu' : [0,0,0,0],
         'Tue' : [0,0,0,0],
         'Wed' : [0,0,0,0]
      
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_prcp', hidden_units=[20,18,13], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2186e34810>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_prcp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_prcp/mode

[555.3892 561.1319 568.3101 559.7017]


The RMSE value is very similar to that of the linear regression model, indicating that both models are accurate predictors. The error rate of the DNN is higher, indicating the there is a higher number of errors, but the margin of error is lower.

Based on the linear relationship found, the test data outputs results as expected.

##Dew Point (dewp)
As with the linear regressor the process of training each model follows a very similar process, with the number of hidden layers, the number of nodes within each layer and the learning rate changing dependant on the dataset.

In [165]:
#Read Data
df_dewp_dnn = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/dewp_clean_dnn.csv', index_col=0, )
print(df_dewp_dnn[:6])

   year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd   gust  \
1  2020  24             524  37.3  33.7  1028.5    6.5   3.3    8.0  999.9   
2  2021  12             278  37.0  29.1  1019.0   10.0   6.5   12.0  999.9   
3  2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0   20.0   
4  2021  27             262  34.6  33.8  1012.8    8.0   7.8   12.0  999.9   
5  2021  26             263  31.9  23.4  1016.9    9.0   7.4   12.0  999.9   
6  2022  24             237  34.5  23.8  1010.6    9.7   7.4   12.0  999.9   

   ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  ...    0    0    0    0    0    0    0    1    0    0  
2  ...    0    0    0    0    1    0    0    0    0    0  
3  ...    0    0    0    0    0    0    0    1    0    0  
4  ...    0    0    0    0    0    0    0    0    1    0  
5  ...    0    0    0    0    1    0    0    0    0    0  
6  ...    0    0    0    0    0    0    1    0    0    0  

[6 rows x 39 columns]


In [166]:
#Remove Cols not Required 
df_dewp_dnn = df_dewp_dnn.drop(columns=['temp', 'prcp', 'slp','visib','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_dewp_dnn = df_dewp_dnn.loc[df_dewp_dnn["year"] != 2012]
df_dewp_dnn = df_dewp_dnn.loc[df_dewp_dnn["year"] < 2020]
#Move target to end
cols = df_dewp_dnn['NUM_COLLISIONS']
df_dewp_dnn = df_dewp_dnn.drop(columns=['NUM_COLLISIONS'])
df_dewp_dnn.insert(loc=22, column='NUM_COLLISIONS', value=cols)
print(df_dewp_dnn[:6])
df_dewp_dnn.describe()

    year  da  dewp  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Oct  Sep  Fri  \
49  2016  28  24.4    0    0    0    0    1    0    0  ...    0    0    0   
51  2014  17  35.8    0    0    0    0    1    0    0  ...    0    0    0   
54  2016  25  21.2    0    0    0    0    1    0    0  ...    0    0    0   
55  2016  29  36.8    0    0    0    0    1    0    0  ...    0    0    0   
58  2017  20  32.5    0    0    0    0    1    0    0  ...    0    0    0   
59  2013  13  44.9    0    0    0    0    1    0    0  ...    0    0    0   

    Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
49    0    0    0    0    0    1             681  
51    0    0    0    1    0    0             589  
54    0    0    1    0    0    0             658  
55    0    0    0    1    0    0             645  
58    0    0    0    1    0    0             605  
59    0    1    0    0    0    0             373  

[6 rows x 23 columns]


Unnamed: 0,year,da,dewp,Apr,Aug,Dec,Feb,Jan,Jul,Jun,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,...,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0
mean,2015.999217,15.723679,44.16317,0.082192,0.084932,0.084932,0.077104,0.084932,0.08454,0.082192,...,0.084932,0.082192,0.142466,0.143249,0.142857,0.142857,0.142857,0.142857,0.142857,599.10998
std,2.0,8.801271,16.995303,0.27471,0.278834,0.278834,0.266808,0.278834,0.278251,0.27471,...,0.278834,0.27471,0.349596,0.350395,0.349996,0.349996,0.349996,0.349996,0.349996,100.277185
min,2013.0,1.0,-6.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,32.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,531.0
50%,2016.0,16.0,45.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,602.0
75%,2018.0,23.0,58.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,665.0
max,2019.0,31.0,74.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [167]:
#Shuffle Data
shuffle = df_dewp_dnn.iloc[np.random.permutation(len(df_dewp_dnn))]

predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da  dewp  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Nov  Oct  Sep  \
264   2018   8  28.0    0    0    0    0    1    0    0  ...    0    0    0   
1431  2016  17  44.5    0    0    0    0    0    0    0  ...    0    0    0   
3029  2017  31  44.2    0    0    0    0    0    0    0  ...    0    1    0   
3594  2015  21  40.8    0    0    1    0    0    0    0  ...    0    0    0   
3222  2019  20  39.8    0    0    0    0    0    0    0  ...    1    0    0   
1945  2016  28  69.0    0    0    0    0    0    1    0  ...    0    0    0   

      Fri  Mon  Sat  Sun  Thu  Tue  Wed  
264     0    0    0    1    0    0    0  
1431    0    1    0    0    0    0    0  
3029    0    1    0    0    0    0    0  
3594    0    0    0    1    0    0    0  
3222    0    0    0    0    0    1    0  
1945    0    0    0    0    0    0    1  

[6 rows x 22 columns]


In [168]:
#Select last col as target
targets = shuffle.iloc[:,-1]
print(targets[:6])

264     720
1431    624
3029    696
3594    612
3222    526
1945    709
Name: NUM_COLLISIONS, dtype: int64


In [169]:
#Split to test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
#Calculate number of outputs
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

22


In [171]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_dewp', ignore_errors=True)

#Setup Model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_dewp', hidden_units=[19,17,9], optimizer=tf.train.AdamOptimizer(learning_rate=0.0001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

#Train model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS
#Test Model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180c8abd0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_dewp', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_dewp/model.ckpt.
INFO:tensorflow:loss = 2278.502, step = 1
INFO:tensorflow:global_step/sec: 180.808
INFO:tensorflow:loss = 9.152033, step = 101 (0.555 sec)
INFO:tensorflow:global_step/sec: 463.81
INFO:tensorflow:loss = 8.72504, step = 201 (0.217 sec)
INFO:tensorflow:global_step/sec: 455.05
INFO:tensorflow:loss = 8.032806, step = 301 (0.220 sec)
INFO:tensorflow:global_step/sec: 408.298
INFO:tensorflow:loss = 7.5503483, step = 401 (0.246 sec)
INFO:tensorflow:global_step/sec: 461.096
INFO:tensorflow:loss = 6.216921, step = 501 (0.216 sec)
INFO:tensorflow:global_step/sec: 426.439
INFO:tensorflow:loss = 7.2335544, step = 601 (0.233 sec)
INFO:tensorflow:global_step/sec: 396.964
INFO:tensorflow:loss = 5.8428593, step = 701 (0.254 sec)
INFO:tensorflow:gl

DNNRegression has RMSE of 81.18373280690219
Just using average = 598.4936399217221 has RMSE of 97.45723895859679


In [173]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

# Ensure number of hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_dewp', hidden_units=[19,17,9], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f217fa89290>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_dewp', '_session_creation_timeout_secs': 7200}


511
511


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_dewp/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.5161224  0.5144156  0.5638021  0.56961    0.50995    0.54459804
 0.5268664  0.48013327 0.5465937  0.5828879  0.550459   0.43661296
 0.5443691  0.5433554  0.50073284 0.5932928  0.5222144  0.52036756
 0.6468448  0.62040687 0.416818   0.5732942  0.5902322  0.5571462
 0.5390077  0.5489542  0.42926723 0.5936663  0.5482859  0.4821805
 0.49113512 0.43909934 0.5138586  0.3759073  0.5293005  0.4887133
 0.57810795 0.49117145 0.50808334 0.5481959  0.56424105 0.43003097
 0.5633474  0.5062528  0.5125602  0.59050876 0.4751144  0.49893475
 0.50104403 0.5693581  0.523823   0.44281793 0.569378   0.519079
 0.4985885  0.5440442  0.5873928  0.57331705 0.46690565 0.46823338
 0.5815407  0.41638747 0.4140756  0.5021293  0.58359385 0.5062102
 0.5639096  0.59023625 0.5109083  0.5284078  0.5416432  0.5555529
 0.6271585  0.5454535  0.41284668 0.4172884  0.5568577  0.5328828
 0.5241073  0.5762602  0.5381281  0.44959676 0.5053994  0.52595794
 0.51329064 0.58196455 0.5911612  0.59167945 0.48942715 0.5690166
 0.5

In [174]:
input = pd.DataFrame.from_dict(data = 
				{
            'year':[2019,2019,2019,2020],
          'da':[10,10,10,20],
         'dewp' : [0,10.5,66.1,10.5],
         'Apr' : [0,0,0,0],
         'Aug' : [1,1,1,1],
         'Dec' : [0,0,0,0],
         'Feb' : [0,0,0,0],
         'Jan' : [0,0,0,0],
         'Jul' : [0,0,0,0],
         'Jun' : [0,0,0,0],
         'Mar' : [0,0,0,0],
         'May' : [0,0,0,0],
         'Nov' : [0,0,0,0],
         'Oct' : [0,0,0,0],
         'Sep' : [0,0,0,0],
         'Fri' : [0,0,0,0],
         'Mon' : [1,1,1,1],
         'Sat' : [0,0,0,0],
         'Sun' : [0,0,0,0],
         'Thu' : [0,0,0,0],
         'Tue' : [0,0,0,0],
         'Wed' : [0,0,0,0]
      
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_dewp', hidden_units=[19,17,9], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f217f995a10>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_dewp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_dewp/mode

[428.91623 463.7521  648.1726  458.87842]


As shown by the RMSE value, this model is a more efficient way to predict the number of collisions in comparison to using the mean. In comparison to the linear regression model trained, the RMSE is lower indicating the DNN makes more accurate predictions. As with the DNN for precipitation, the RMSE is lower than the linear model the error rate is higher indicating there is more errors but the margin of error is lower.

Based on the linear relationship found, the test data outputs results as expected.

##Sea Level Pressure(slp)
Through the analysis carried out in assignment 1, no clear relationship between sea level pressure and the number of collisions was uncovered. A DNN will be used to attempt to predict the number of collisions at a given pressure point.

In [175]:
#Read Data
df_slp_dnn = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/slp_clean_dnn.csv', index_col=0, )
print(df[:6])

   year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd   gust  \
1  2020  24             524  37.3  33.7  1028.5    6.5   3.3    8.0  999.9   
2  2021  12             278  37.0  29.1  1019.0   10.0   6.5   12.0  999.9   
3  2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0   20.0   
4  2021  27             262  34.6  33.8  1012.8    8.0   7.8   12.0  999.9   
5  2021  26             263  31.9  23.4  1016.9    9.0   7.4   12.0  999.9   
6  2022  24             237  34.5  23.8  1010.6    9.7   7.4   12.0  999.9   

   ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  ...    0    0    0    0    0    0    0    1    0    0  
2  ...    0    0    0    0    1    0    0    0    0    0  
3  ...    0    0    0    0    0    0    0    1    0    0  
4  ...    0    0    0    0    0    0    0    0    1    0  
5  ...    0    0    0    0    1    0    0    0    0    0  
6  ...    0    0    0    0    0    0    1    0    0    0  

[6 rows x 39 columns]


In [176]:
#Remove Cols not Required 
df_slp_dnn = df_slp_dnn.drop(columns=['temp', 'prcp', 'dewp','visib','max','min','sndp','wdsp','mxpsd','gust','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_slp_dnn = df_slp_dnn.loc[df_slp_dnn["year"] != 2012]
df_slp_dnn = df_slp_dnn.loc[df_slp_dnn["year"] < 2020]
#Move target to the end
cols = df_slp_dnn['NUM_COLLISIONS']
df_slp_dnn = df_slp_dnn.drop(columns=['NUM_COLLISIONS'])
df_slp_dnn.insert(loc=22, column='NUM_COLLISIONS', value=cols)
print(df_slp_dnn[:6])
df_slp_dnn.describe()

    year  da     slp  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Oct  Sep  Fri  \
49  2016  28  1016.1    0    0    0    0    1    0    0  ...    0    0    0   
51  2014  17  1014.8    0    0    0    0    1    0    0  ...    0    0    0   
54  2016  25  1021.4    0    0    0    0    1    0    0  ...    0    0    0   
55  2016  29   999.4    0    0    0    0    1    0    0  ...    0    0    0   
58  2017  20  1015.5    0    0    0    0    1    0    0  ...    0    0    0   
59  2013  13  1020.7    0    0    0    0    1    0    0  ...    0    0    0   

    Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
49    0    0    0    0    0    1             681  
51    0    0    0    1    0    0             589  
54    0    0    1    0    0    0             658  
55    0    0    0    1    0    0             645  
58    0    0    0    1    0    0             605  
59    0    1    0    0    0    0             373  

[6 rows x 23 columns]


Unnamed: 0,year,da,slp,Apr,Aug,Dec,Feb,Jan,Jul,Jun,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,...,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0,2555.0
mean,2016.000391,15.719765,1016.777221,0.082192,0.084932,0.08454,0.077104,0.084932,0.084932,0.082192,...,0.084932,0.082192,0.142857,0.143249,0.142857,0.142857,0.142857,0.142857,0.142466,599.147162
std,2.000294,8.796698,7.628429,0.27471,0.278834,0.278251,0.266808,0.278834,0.278834,0.27471,...,0.278834,0.27471,0.349996,0.350395,0.349996,0.349996,0.349996,0.349996,0.349596,100.268048
min,2013.0,1.0,989.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,1012.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,531.0
50%,2016.0,16.0,1016.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,602.0
75%,2018.0,23.0,1021.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,665.0
max,2019.0,31.0,1044.2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [177]:
#Shuffle dataset
shuffle = df_slp_dnn.iloc[np.random.permutation(len(df_slp_dnn))]

predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da     slp  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Nov  Oct  Sep  \
774   2015   9  1020.1    0    0    0    0    0    0    0  ...    0    0    0   
2308  2017   1  1017.1    0    1    0    0    0    0    0  ...    0    0    0   
3546  2013   3  1011.1    0    0    1    0    0    0    0  ...    0    0    0   
1587  2019  18  1015.1    0    0    0    0    0    0    1  ...    0    0    0   
3594  2015  21  1025.2    0    0    1    0    0    0    0  ...    0    0    0   
2965  2013   8  1019.7    0    0    0    0    0    0    0  ...    0    1    0   

      Fri  Mon  Sat  Sun  Thu  Tue  Wed  
774     0    0    0    1    0    0    0  
2308    0    1    0    0    0    0    0  
3546    0    1    0    0    0    0    0  
1587    0    1    0    0    0    0    0  
3594    0    0    0    1    0    0    0  
2965    0    1    0    0    0    0    0  

[6 rows x 22 columns]


In [178]:
#Select Target
targets = shuffle.iloc[:,-1]

# print out the first 6 rows of the targets data.
print(targets[:6])

774     661
2308    705
3546    553
1587    721
3594    612
2965    574
Name: NUM_COLLISIONS, dtype: int64


In [179]:
#Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
#Calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

22


In [183]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_slp', ignore_errors=True)
#Setup model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_slp', hidden_units=[15,13,9], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

# Train model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS
#Test Model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));

avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21849832d0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_slp', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_slp/model.ckpt.
INFO:tensorflow:loss = 4011.2021, step = 1
INFO:tensorflow:global_step/sec: 349.044
INFO:tensorflow:loss = 0.0119057875, step = 101 (0.290 sec)
INFO:tensorflow:global_step/sec: 503.932
INFO:tensorflow:loss = 0.008684389, step = 201 (0.201 sec)
INFO:tensorflow:global_step/sec: 437.11
INFO:tensorflow:loss = 0.008322796, step = 301 (0.227 sec)
INFO:tensorflow:global_step/sec: 464.541
INFO:tensorflow:loss = 0.009944219, step = 401 (0.214 sec)
INFO:tensorflow:global_step/sec: 513.014
INFO:tensorflow:loss = 0.010556124, step = 501 (0.196 sec)
INFO:tensorflow:global_step/sec: 512.38
INFO:tensorflow:loss = 0.009488182, step = 601 (0.199 sec)
INFO:tensorflow:global_step/sec: 457.278
INFO:tensorflow:loss = 0.011358846, step = 701 (0.217 sec

DNNRegression has RMSE of 83.90977431833942
Just using average = 600.5239726027397 has RMSE of 101.79804376051676


In [187]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

#Ensure hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_slp', hidden_units=[15,13,9], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180da6910>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_slp', '_session_creation_timeout_secs': 7200}


511
511


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_slp/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.52125245 0.5034854  0.41281563 0.476237   0.5458724  0.50280666
 0.58176255 0.45544517 0.4562766  0.52456474 0.41281563 0.49613345
 0.48526388 0.53654647 0.55378985 0.5539732  0.6000898  0.5526836
 0.53982186 0.49123958 0.5526306  0.5224725  0.536389   0.5291162
 0.547381   0.508787   0.46184903 0.49955156 0.5659125  0.45230004
 0.53561115 0.57052994 0.5384188  0.49192184 0.43280986 0.4471418
 0.5352424  0.5217596  0.57406735 0.43629843 0.547762   0.517613
 0.5077557  0.5034744  0.48244873 0.5791251  0.5494153  0.56851065
 0.58760595 0.53095555 0.45680854 0.47154695 0.5294121  0.5141434
 0.48286062 0.5549488  0.5594372  0.4903386  0.5159296  0.41809854
 0.5472297  0.53090256 0.5271747  0.41281563 0.54033655 0.52469814
 0.44946957 0.49765342 0.52984214 0.5348796  0.46027333 0.47828263
 0.5220268  0.5768114  0.48278582 0.56255877 0.42644298 0.52820206
 0.5510924  0.57657695 0.54934394 0.52745605 0.55426097 0.54239506
 0.41281563 0.48803777 0.55779743 0.5424226  0.41281563 0.51517284
 

In [189]:
input = pd.DataFrame.from_dict(data = 
				{
            'year':[2019,2019,2019,2020],
          'da':[10,10,10,20],
         'slp' : [990.2,1022.4,1039.0,1022.4],
         'Apr' : [0,0,0,0],
         'Aug' : [1,1,1,1],
         'Dec' : [0,0,0,0],
         'Feb' : [0,0,0,0],
         'Jan' : [0,0,0,0],
         'Jul' : [0,0,0,0],
         'Jun' : [0,0,0,0],
         'Mar' : [0,0,0,0],
         'May' : [0,0,0,0],
         'Nov' : [0,0,0,0],
         'Oct' : [0,0,0,0],
         'Sep' : [0,0,0,0],
         'Fri' : [0,0,0,0],
         'Mon' : [1,1,1,1],
         'Sat' : [0,0,0,0],
         'Sun' : [0,0,0,0],
         'Thu' : [0,0,0,0],
         'Tue' : [0,0,0,0],
         'Wed' : [0,0,0,0]
      
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_slp', hidden_units=[15,13,9], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f218097e310>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_slp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_slp/model.

[112.89884 147.19633 164.878   141.17247]


Although no linear relationship was uncovered, it can be argued there is a relationship present due to the RMSE value which is lower than the mean value. This relationship is also shown in the error rate which is comparative to the other models produced.

##Gust
As with sea level pressure, no linear relationship between the maximum gust and the number of collisions was uncovered.

In [190]:
#Read data
df_gust_dnn = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/gust_clean_dnn.csv', index_col=0, )
print(df_gust_dnn[:6])

    year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd  gust  \
3   2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0  20.0   
11  2020  15             508  43.9  38.3  1019.4    8.2   5.4   14.0  15.0   
12  2021   1             257  39.6  29.3  1029.3   10.0   7.6   14.0  20.0   
14  2022  25             235  41.6  31.8  1013.2   10.0   9.6   15.0  19.0   
18  2021   3             186  41.1  32.3  1018.0   10.0  10.3   19.0  27.0   
19  2020   2             413  39.6  28.9  1011.8   10.0  13.0   19.0  26.0   

    ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
3   ...    0    0    0    0    0    0    0    1    0    0  
11  ...    0    0    0    0    0    0    0    0    1    0  
12  ...    0    0    0    0    0    0    0    1    0    0  
14  ...    0    0    0    0    1    0    0    0    0    0  
18  ...    0    0    0    0    0    1    0    0    0    0  
19  ...    0    0    0    0    0    0    0    0    0    1  

[6 rows x 39 columns]


In [191]:
#Remove Cols not Required 
df_gust_dnn = df_gust_dnn.drop(columns=['temp', 'prcp', 'dewp','visib','max','min','sndp','wdsp','mxpsd','slp','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_gust_dnn = df_gust_dnn.loc[df_gust_dnn["year"] != 2012]
df_gust_dnn = df_gust_dnn.loc[df_gust_dnn["year"] < 2020]
#Move target col to end
cols = df_gust_dnn['NUM_COLLISIONS']
df_gust_dnn = df_gust_dnn.drop(columns=['NUM_COLLISIONS'])
df_gust_dnn.insert(loc=22, column='NUM_COLLISIONS', value=cols)
print(df_gust_dnn[:6])
df_gust_dnn.describe()

    year  da  gust  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Oct  Sep  Fri  \
74  2016  17  18.1    0    0    0    0    1    0    0  ...    0    0    0   
76  2014   9  20.0    0    0    0    0    1    0    0  ...    0    0    0   
79  2019  19  21.0    0    0    0    0    1    0    0  ...    0    0    1   
80  2015  11  17.1    0    0    0    0    1    0    0  ...    0    0    0   
83  2015  29  20.0    0    0    0    0    1    0    0  ...    0    0    0   
85  2019  13  15.9    0    0    0    0    1    0    0  ...    0    0    0   

    Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
74    0    1    0    0    0    0             451  
76    0    0    0    0    0    1             561  
79    0    0    0    0    0    0             479  
80    0    1    0    0    0    0             341  
83    0    0    0    0    0    1             519  
85    0    1    0    0    0    0             374  

[6 rows x 23 columns]


Unnamed: 0,year,da,gust,Apr,Aug,Dec,Feb,Jan,Jul,Jun,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,...,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0
mean,2015.91283,15.702885,27.511602,0.095764,0.042357,0.104359,0.09515,0.108656,0.046041,0.061387,...,0.087784,0.071209,0.143646,0.139963,0.141191,0.139963,0.151627,0.138122,0.145488,596.513198
std,2.01341,8.667634,7.36677,0.294358,0.201465,0.305819,0.293513,0.311302,0.209637,0.240113,...,0.283067,0.257253,0.350839,0.347055,0.348325,0.347055,0.358769,0.345133,0.3527,104.47966
min,2013.0,1.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,526.0
50%,2016.0,16.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,597.0
75%,2018.0,23.0,31.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,663.0
max,2019.0,31.0,71.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [192]:
# Shuffle Data
shuffle = df_gust_dnn.iloc[np.random.permutation(len(df_gust_dnn))]

predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da  gust  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Nov  Oct  Sep  \
204   2018  15  25.1    0    0    0    0    1    0    0  ...    0    0    0   
797   2014   5  26.0    0    0    0    0    0    0    0  ...    0    0    0   
388   2019  20  18.1    0    0    0    1    0    0    0  ...    0    0    0   
3265  2018   4  32.1    0    0    0    0    0    0    0  ...    1    0    0   
3260  2014   8  32.1    0    0    0    0    0    0    0  ...    1    0    0   
3264  2016  13  28.0    0    0    0    0    0    0    0  ...    1    0    0   

      Fri  Mon  Sat  Sun  Thu  Tue  Wed  
204     0    0    0    1    0    0    0  
797     0    0    0    0    0    1    0  
388     0    0    0    0    0    1    0  
3265    0    0    1    0    0    0    0  
3260    1    0    0    0    0    0    0  
3264    0    0    1    0    0    0    0  

[6 rows x 22 columns]


In [193]:
# Select last col as a target
targets = shuffle.iloc[:,-1]
print(targets[:6])

204     523
797     491
388     615
3265    502
3260    543
3264    518
Name: NUM_COLLISIONS, dtype: int64


In [194]:
# Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
# Calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

22


In [195]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_gust', ignore_errors=True)

#Setup model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_gust', hidden_units=[23,19,11], optimizer=tf.train.AdamOptimizer(learning_rate=0.001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

# Train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS

#Test the model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180284350>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_gust', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_gust/model.ckpt.
INFO:tensorflow:loss = 120400.266, step = 1
INFO:tensorflow:global_step/sec: 405.676
INFO:tensorflow:loss = 7.0375686, step = 101 (0.254 sec)
INFO:tensorflow:global_step/sec: 443.996
INFO:tensorflow:loss = 5.923665, step = 201 (0.221 sec)
INFO:tensorflow:global_step/sec: 457.858
INFO:tensorflow:loss = 5.86942, step = 301 (0.219 sec)
INFO:tensorflow:global_step/sec: 515.249
INFO:tensorflow:loss = 6.0937967, step = 401 (0.193 sec)
INFO:tensorflow:global_step/sec: 488.124
INFO:tensorflow:loss = 4.2698884, step = 501 (0.209 sec)
INFO:tensorflow:global_step/sec: 478.422
INFO:tensorflow:loss = 4.467372, step = 601 (0.205 sec)
INFO:tensorflow:global_step/sec: 483.527
INFO:tensorflow:loss = 4.017475, step = 701 (0.208 sec)
INFO:tensorflo

DNNRegression has RMSE of 1018.3540724737733
Just using average = 595.6431312356101 has RMSE of 107.5373681714606


In [196]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

#Ensure the hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_gust', hidden_units=[23,19,11], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21809540d0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_gust', '_session_creation_timeout_secs': 7200}


326
326


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_gust/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[-0.3049663  -0.42005202 -0.51450616 -0.29954398 -0.3548231  -0.33809265
 -0.44262335 -0.39822903 -0.38705528 -0.31102693 -0.38927925 -0.29266882
 -0.41832456 -0.38759193 -0.32961118 -0.302848   -0.3841639  -0.35516366
 -0.39627936 -0.33033198 -0.3804949  -0.4882057  -0.3625027  -0.38677642
 -0.29193735 -0.32585177 -0.2907221  -0.35136354 -0.32740298 -0.38262534
 -0.3065188  -0.28920543 -0.31905624 -0.37320375 -0.3748302  -0.39114377
 -0.3285573  -0.3862763  -0.29360756 -0.357711   -0.4675518  -0.39903703
 -0.36056608 -0.31410253 -0.30655792 -0.516221   -0.35794276 -0.34592134
 -0.29268256 -0.3290304  -0.3581174  -0.3279106  -0.44864726 -0.40452954
 -0.32755962 -0.46700737 -0.34779194 -0.43543026 -0.3385287  -0.36341026
 -0.38059017 -0.33614656 -0.30604503 -0.40815502 -0.4003447  -0.32367313
 -0.36357993 -0.28842196 -0.31632268 -0.29129422 -0.28562742 -0.4094085
 -0.35539058 -0.3862904  -0.25717908 -0.28122094 -0.28967875 -0.34104088
 -0.39588308 -0.30864412 -0.3474564  -0.3633494  -0.

This model shows there is no relationship between gust and the number of collisions.

The RMSE value is over 4 times the mean value indicating the model is unable to make accurate predictions. This is also reflected in the very high error rate.

##Maximum Sustained Wind Speed (mxpsd)
In an attempt to predict the number of collisions given the maximum sustained wind speed a DNN will be used as no linear relationship was uncovered within assignment 1.

In [197]:
#Read the data
df_mxpsd_dnn = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/mxpsd_clean_dnn.csv', index_col=0, )
print(df_mxpsd_dnn[:6])

   year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd   gust  \
1  2020  24             524  37.3  33.7  1028.5    6.5   3.3    8.0  999.9   
2  2021  12             278  37.0  29.1  1019.0   10.0   6.5   12.0  999.9   
3  2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0   20.0   
4  2021  27             262  34.6  33.8  1012.8    8.0   7.8   12.0  999.9   
5  2021  26             263  31.9  23.4  1016.9    9.0   7.4   12.0  999.9   
6  2022  24             237  34.5  23.8  1010.6    9.7   7.4   12.0  999.9   

   ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  ...    0    0    0    0    0    0    0    1    0    0  
2  ...    0    0    0    0    1    0    0    0    0    0  
3  ...    0    0    0    0    0    0    0    1    0    0  
4  ...    0    0    0    0    0    0    0    0    1    0  
5  ...    0    0    0    0    1    0    0    0    0    0  
6  ...    0    0    0    0    0    0    1    0    0    0  

[6 rows x 39 columns]


In [198]:
#Remove the cols not required
df_mxpsd_dnn = df_mxpsd_dnn.drop(columns=['temp', 'prcp', 'dewp','visib','max','min','sndp','wdsp','gust','slp','thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
df_mxpsd_dnn = df_mxpsd_dnn.loc[df_mxpsd_dnn["year"] != 2012]
df_mxpsd_dnn = df_mxpsd_dnn.loc[df_mxpsd_dnn["year"] < 2020]
#Move the target to the end
cols = df_mxpsd_dnn['NUM_COLLISIONS']
df_mxpsd_dnn = df_mxpsd_dnn.drop(columns=['NUM_COLLISIONS'])
df_mxpsd_dnn.insert(loc=22, column='NUM_COLLISIONS', value=cols)
print(df_mxpsd_dnn[:6])
df_mxpsd_dnn.describe()

    year  da  mxpsd  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Oct  Sep  Fri  \
49  2016  28    8.9    0    0    0    0    1    0    0  ...    0    0    0   
51  2014  17    8.9    0    0    0    0    1    0    0  ...    0    0    0   
54  2016  25    8.9    0    0    0    0    1    0    0  ...    0    0    0   
55  2016  29    9.9    0    0    0    0    1    0    0  ...    0    0    0   
58  2017  20    9.9    0    0    0    0    1    0    0  ...    0    0    0   
59  2013  13    9.9    0    0    0    0    1    0    0  ...    0    0    0   

    Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
49    0    0    0    0    0    1             681  
51    0    0    0    1    0    0             589  
54    0    0    1    0    0    0             658  
55    0    0    0    1    0    0             645  
58    0    0    0    1    0    0             605  
59    0    1    0    0    0    0             373  

[6 rows x 23 columns]


Unnamed: 0,year,da,mxpsd,Apr,Aug,Dec,Feb,Jan,Jul,Jun,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,...,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0,2553.0
mean,2016.001567,15.737172,17.24011,0.082256,0.084998,0.084998,0.077164,0.084998,0.084998,0.082256,...,0.084998,0.081473,0.142969,0.143361,0.142969,0.142969,0.142969,0.142577,0.142186,599.033686
std,2.000587,8.797367,5.858333,0.274808,0.278933,0.278933,0.266904,0.278933,0.278933,0.274808,...,0.278933,0.273613,0.35011,0.350509,0.35011,0.35011,0.35011,0.34971,0.349309,100.284761
min,2013.0,1.0,5.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,531.0
50%,2016.0,16.0,15.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,602.0
75%,2018.0,23.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,665.0
max,2019.0,31.0,49.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [204]:
# shuffle the data
shuffle = df_mxpsd_dnn.iloc[np.random.permutation(len(df_mxpsd_dnn))]

predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da  mxpsd  Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Nov  Oct  Sep  \
2243  2014   3    9.9    0    1    0    0    0    0    0  ...    0    0    0   
1415  2014  29   17.1    0    0    0    0    0    0    0  ...    0    0    0   
3358  2018   3   32.1    0    0    0    0    0    0    0  ...    1    0    0   
573   2016  25   29.9    0    0    0    1    0    0    0  ...    0    0    0   
1876  2015  25    8.9    0    0    0    0    0    1    0  ...    0    0    0   
659   2016   8   13.0    0    0    0    0    0    0    0  ...    0    0    0   

      Fri  Mon  Sat  Sun  Thu  Tue  Wed  
2243    0    0    1    0    0    0    0  
1415    0    0    0    0    0    0    1  
3358    1    0    0    0    0    0    0  
573     0    0    0    0    0    0    1  
1876    1    0    0    0    0    0    0  
659     0    1    0    0    0    0    0  

[6 rows x 22 columns]


In [205]:
# Select the target as the last col
targets = shuffle.iloc[:,-1]
print(targets[:6])

2243    476
1415    663
3358    656
573     663
1876    451
659     632
Name: NUM_COLLISIONS, dtype: int64


In [206]:
#Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
# Calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

22


In [212]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_mxpsd', ignore_errors=True)

#Setup the model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_mxpsd', hidden_units=[23,13,9], optimizer=tf.train.AdamOptimizer(learning_rate=0.001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

# Train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS

# Test the model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2187f87450>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_mxpsd', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_mxpsd/model.ckpt.
INFO:tensorflow:loss = 144257.9, step = 1
INFO:tensorflow:global_step/sec: 399.379
INFO:tensorflow:loss = 3.7238111, step = 101 (0.256 sec)
INFO:tensorflow:global_step/sec: 485.75
INFO:tensorflow:loss = 0.23482522, step = 201 (0.204 sec)
INFO:tensorflow:global_step/sec: 503.828
INFO:tensorflow:loss = 0.2148373, step = 301 (0.201 sec)
INFO:tensorflow:global_step/sec: 509.616
INFO:tensorflow:loss = 0.20698708, step = 401 (0.193 sec)
INFO:tensorflow:global_step/sec: 476.841
INFO:tensorflow:loss = 0.20983556, step = 501 (0.212 sec)
INFO:tensorflow:global_step/sec: 522.498
INFO:tensorflow:loss = 0.20757458, step = 601 (0.190 sec)
INFO:tensorflow:global_step/sec: 484.111
INFO:tensorflow:loss = 0.21551603, step = 701 (0.210 sec)
INFO:t

DNNRegression has RMSE of 84.23085911504302
Just using average = 598.1993143976493 has RMSE of 104.23183328393432


In [213]:
print(len(predictors[trainsize:].values))
print(len(targets[trainsize:].values.reshape(testsize, noutputs)/SCALE_COLLISIONS))

# Ensure the hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_mxpsd', hidden_units=[23,13,9], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180284490>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_mxpsd', '_session_creation_timeout_secs': 7200}


511
511


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_mxpsd/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.5051993  0.45869815 0.43577182 0.47302425 0.5498141  0.600767
 0.5123404  0.53738964 0.5240153  0.52028644 0.5878581  0.5521468
 0.42057598 0.5662631  0.4849261  0.5182208  0.51844394 0.5679873
 0.48783863 0.55446804 0.38091075 0.54891384 0.43582904 0.455078
 0.5394019  0.5436133  0.52703655 0.52970874 0.42868984 0.51997936
 0.5729331  0.53062236 0.51496494 0.52057445 0.51554286 0.5385779
 0.54045093 0.5253848  0.484537   0.54777324 0.5728167  0.56724155
 0.43326557 0.5131911  0.508314   0.5161704  0.49952495 0.56435764
 0.5183524  0.5124892  0.52438915 0.48305118 0.4849547  0.46947658
 0.5106658  0.53285396 0.52265155 0.5624198  0.53384197 0.4575976
 0.41304004 0.46495235 0.56669605 0.5417403  0.502962   0.5676383
 0.48058116 0.37955463 0.54059017 0.5121707  0.47007358 0.42322338
 0.58208835 0.54260814 0.42033184 0.52436626 0.42542255 0.51416194
 0.5596732  0.56292903 0.48566806 0.5127467  0.51593006 0.5169562
 0.46529186 0.5085639  0.53960025 0.5076922  0.4327277  0.5140475
 0.507

In [218]:
input = pd.DataFrame.from_dict(data = 
				{
            'year':[2019,2019,2019,2020],
          'da':[10,10,10,20],
         'mxpsd' : [9.2,20.5,45.2,20.5],
         'Apr' : [0,0,0,0],
         'Aug' : [1,1,1,1],
         'Dec' : [0,0,0,0],
         'Feb' : [0,0,0,0],
         'Jan' : [0,0,0,0],
         'Jul' : [0,0,0,0],
         'Jun' : [0,0,0,0],
         'Mar' : [0,0,0,0],
         'May' : [0,0,0,0],
         'Nov' : [0,0,0,0],
         'Oct' : [0,0,0,0],
         'Sep' : [0,0,0,0],
         'Fri' : [0,0,0,0],
         'Mon' : [1,1,1,1],
         'Sat' : [0,0,0,0],
         'Sun' : [0,0,0,0],
         'Thu' : [0,0,0,0],
         'Tue' : [0,0,0,0],
         'Wed' : [0,0,0,0]
      
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_mxpsd', hidden_units=[23,13,9], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180da0410>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_mxpsd', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_mxpsd/mo

[787.65375 784.37445 777.20844 780.4174 ]


The low RMSE value suggests that a relationship between the maximum sustained wind speed and the number of collisions exist. The model produced can be used to predict the number of collisions with a degree of accuracy. As with the other DNN models produced the error rate is higher.

##Whole Dataset
As the purpose of this assignment is to accurately predict the number of collisions given the weather condition**s**, all available weather conditions are used as input variables to train a DNN model.

In [242]:
#Read the data
df = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/datadnn.csv', index_col=0, )
print(df[:6])

   year  da  NUM_COLLISIONS  temp  dewp     slp  visib  wdsp  mxpsd   gust  \
1  2020  24             524  37.3  33.7  1028.5    6.5   3.3    8.0  999.9   
2  2021  12             278  37.0  29.1  1019.0   10.0   6.5   12.0  999.9   
3  2021  22             254  36.5  28.4  1003.1   10.0   7.8   12.0   20.0   
4  2021  27             262  34.6  33.8  1012.8    8.0   7.8   12.0  999.9   
5  2021  26             263  31.9  23.4  1016.9    9.0   7.4   12.0  999.9   
6  2022  24             237  34.5  23.8  1010.6    9.7   7.4   12.0  999.9   

   ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  ...    0    0    0    0    0    0    0    1    0    0  
2  ...    0    0    0    0    1    0    0    0    0    0  
3  ...    0    0    0    0    0    0    0    1    0    0  
4  ...    0    0    0    0    0    0    0    0    1    0  
5  ...    0    0    0    0    1    0    0    0    0    0  
6  ...    0    0    0    0    0    0    1    0    0    0  

[6 rows x 39 columns]


As the whole dataset contains the error values for Dew Point, Sea Level Pressure, Maximum Sustained Wind Speed and Gust; these must be removed.

In [243]:
#Clean the data
dnn = df.loc[df["year"] != 2012]
dnn = dnn.loc[dnn["year"] < 2020]
dnn = dnn.loc[dnn["dewp"] != 9999]
dnn = dnn.loc[dnn["slp"] != 9999]
dnn = dnn.loc[dnn["mxpsd"] != 999.9]
dnn = dnn.loc[dnn["gust"] != 999.9]
#Move the target to the end
cols = dnn['NUM_COLLISIONS']
dnn = dnn.drop(columns=['NUM_COLLISIONS'])
dnn.insert(loc=38, column='NUM_COLLISIONS', value=cols)
print(dnn[:6])
dnn.describe()

    year  da  temp  dewp     slp  visib  wdsp  mxpsd  gust   max  ...  Oct  \
74  2016  17  40.2  32.3  1007.3    9.2   7.7   12.0  18.1  51.1  ...    0   
76  2014   9  23.5   8.3  1034.2   10.0   7.9   12.0  20.0  28.9  ...    0   
79  2019  19  34.5  29.7  1022.0    9.8   6.9   13.0  21.0  39.9  ...    0   
80  2015  11  27.1  12.1  1035.5   10.0   8.8   13.0  17.1  37.0  ...    0   
83  2015  29  29.2  20.9  1022.9   10.0   8.5   13.0  20.0  36.0  ...    0   
85  2019  13  26.0  12.8  1030.5   10.0   8.0   13.0  15.9  30.9  ...    0   

    Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
74    0    0    0    1    0    0    0    0             451  
76    0    0    0    0    0    0    0    1             561  
79    0    1    0    0    0    0    0    0             479  
80    0    0    0    1    0    0    0    0             341  
83    0    0    0    0    0    0    0    1             519  
85    0    0    0    1    0    0    0    0             374  

[6 rows x 39 columns]


Unnamed: 0,year,da,temp,dewp,slp,visib,wdsp,mxpsd,gust,max,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,...,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0,1629.0
mean,2015.91283,15.702885,47.909638,45.903254,1015.632904,8.225599,12.602087,20.060896,27.511602,55.73407,...,0.087784,0.071209,0.143646,0.139963,0.141191,0.139963,0.151627,0.138122,0.145488,596.513198
std,2.01341,8.667634,13.746339,247.35284,8.134237,2.227285,3.986056,5.294117,7.36677,13.52726,...,0.283067,0.257253,0.350839,0.347055,0.348325,0.347055,0.358769,0.345133,0.3527,104.47966
min,2013.0,1.0,5.8,-6.7,989.5,0.6,4.5,8.9,14.0,18.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,188.0
25%,2014.0,8.0,38.1,28.2,1010.6,7.0,10.0,15.9,22.0,46.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,526.0
50%,2016.0,16.0,47.0,40.2,1015.4,9.3,12.0,19.0,26.0,55.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,597.0
75%,2018.0,23.0,58.8,52.8,1021.1,10.0,14.4,22.9,31.1,66.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,663.0
max,2019.0,31.0,77.5,9999.9,1039.1,10.0,39.3,49.0,71.1,87.1,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1161.0


In [244]:
# Shuffle the data
shuffle = dnn.iloc[np.random.permutation(len(dnn))]
predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

      year  da  temp  dewp     slp  visib  wdsp  mxpsd  gust   max  ...  Nov  \
384   2013   6  34.8  30.4  1014.9    9.0   6.1   13.0  17.1  39.2  ...    0   
3569  2015  27  52.9  49.0  1013.0    4.5  13.1   19.0  24.1  57.0  ...    0   
2452  2013  23  70.6  65.0  1013.1    7.8  12.1   22.9  28.0  77.0  ...    0   
2918  2014  21  57.7  52.3  1015.5    9.3  10.5   15.9  21.0  60.8  ...    0   
760   2015   6  22.7  12.6  1028.3    8.3  11.7   19.0  26.0  37.0  ...    0   
1038  2015  17  47.5  43.2  1021.5    6.9  12.5   15.9  24.1  54.0  ...    0   

      Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
384     0    0    0    0    0    0    0    1    0  
3569    0    0    0    0    1    0    0    0    0  
2452    0    0    0    0    0    0    1    0    0  
2918    1    0    0    1    0    0    0    0    0  
760     0    0    0    0    0    0    1    0    0  
1038    0    0    0    0    0    0    1    0    0  

[6 rows x 38 columns]


In [245]:
#select the target as the last col
targets = shuffle.iloc[:,-1]
print(targets[:6])

384     566
3569    518
2452    636
2918    607
760     939
1038    679
Name: NUM_COLLISIONS, dtype: int64


In [246]:
# Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
# Calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)


38


In [247]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model', ignore_errors=True)

#Setup the model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[17,11,5,3], optimizer=tf.train.AdamOptimizer(learning_rate=0.001), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

# Train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_COLLISIONS, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_COLLISIONS
#Test the model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f217f9f3150>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model/model.ckpt.
INFO:tensorflow:loss = 2.1504958, step = 1
INFO:tensorflow:global_step/sec: 356.625
INFO:tensorflow:loss = 0.00729577, step = 101 (0.286 sec)
INFO:tensorflow:global_step/sec: 428.872
INFO:tensorflow:loss = 0.0069492403, step = 201 (0.231 sec)
INFO:tensorflow:global_step/sec: 469.751
INFO:tensorflow:loss = 0.008179862, step = 301 (0.215 sec)
INFO:tensorflow:global_step/sec: 477.627
INFO:tensorflow:loss = 0.008755423, step = 401 (0.207 sec)
INFO:tensorflow:global_step/sec: 490.636
INFO:tensorflow:loss = 0.0059746047, step = 501 (0.206 sec)
INFO:tensorflow:global_step/sec: 508.109
INFO:tensorflow:loss = 0.007540924, step = 601 (0.193 sec)
INFO:tensorflow:global_step/sec: 486.519
INFO:tensorflow:loss = 0.006275481, step = 701 (0.206 sec)


DNNRegression has RMSE of 88.93854176237826
Just using average = 594.9838833461243 has RMSE of 106.37525048712067


In [248]:
print(predictors[trainsize:].values)

#Ensure the hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[17,11,5,3], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_COLLISIONS
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2180139dd0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}


[[2.016e+03 2.000e+00 6.920e+01 ... 1.000e+00 0.000e+00 0.000e+00]
 [2.013e+03 6.000e+00 3.570e+01 ... 0.000e+00 0.000e+00 0.000e+00]
 [2.013e+03 4.000e+00 3.700e+01 ... 0.000e+00 0.000e+00 0.000e+00]
 ...
 [2.015e+03 3.000e+00 2.100e+01 ... 0.000e+00 0.000e+00 0.000e+00]
 [2.015e+03 1.000e+00 3.180e+01 ... 0.000e+00 0.000e+00 1.000e+00]
 [2.019e+03 2.800e+01 5.600e+01 ... 0.000e+00 0.000e+00 0.000e+00]]


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.5804361  0.36458325 0.5005898  0.56453085 0.4090857  0.53698355
 0.5216763  0.51892114 0.51624924 0.54910743 0.4637513  0.540053
 0.48329538 0.40637717 0.41842368 0.49793607 0.46268165 0.4855113
 0.47952563 0.50860465 0.5278021  0.47708243 0.5283677  0.45664427
 0.3662992  0.4640445  0.5292875  0.49303505 0.4456845  0.48155963
 0.5617424  0.52067834 0.3760687  0.42993337 0.45064798 0.52033186
 0.50704384 0.4621208  0.5054912  0.54431456 0.45516363 0.4289306
 0.5706761  0.5585941  0.55604494 0.5415943  0.55249345 0.37524503
 0.53940195 0.46085212 0.5589418  0.4878664  0.5276209  0.50284505
 0.5512342  0.5349946  0.4657466  0.49300832 0.5120399  0.5029217
 0.5759807  0.3934291  0.48674312 0.5018239  0.5229831  0.5630135
 0.463028   0.47338292 0.54236007 0.531649   0.5279408  0.36029345
 0.50616556 0.5223327  0.5123998  0.5269077  0.49584347 0.46650746
 0.5183588  0.44963884 0.5538241  0.4947494  0.45724574 0.50156635
 0.5072025  0.57479453 0.44723415 0.54868627 0.40763664 0.5558562
 0

In [249]:
input = pd.DataFrame.from_dict(data = 
				{
        'year':[2019,2019,2019,2020],
        'da':[10,10,10,20],
        'temp':[35,25.5,10,1],
        'dewp' : [33.2,10.5,66.1,10.5],
        'slp' : [990.2,1022.4,1039.0,1022.4],
        'visib' : [8.7,5,9.5,10],
        'wdsp':[5,10,15,10],
         'mxpsd' : [8.2,20.5,45.2,10.5],
        'gust': [9,11,20,12],
        'max':[37,26.2,12,2],
        'min':[30,22,8.5,-1.2],
        'prcp':[0,0,10.5,1.2],
        'sndp':[0,0,0,1],
        'fog':[1,0,0,1],
        'rain_drizzle':[0,0,1,0],
        'snow_ice_pellets':[0,0,0,0],
        'hail':[0,0,0,0],
        'thunder':[0,0,0,0],
        'tornado_thunder_cloud':[0,0,0,0],
         'Apr' : [0,0,0,0],
         'Aug' : [1,1,1,1],
         'Dec' : [0,0,0,0],
         'Feb' : [0,0,0,0],
         'Jan' : [0,0,0,0],
         'Jul' : [0,0,0,0],
         'Jun' : [0,0,0,0],
         'Mar' : [0,0,0,0],
         'May' : [0,0,0,0],
         'Nov' : [0,0,0,0],
         'Oct' : [0,0,0,0],
         'Sep' : [0,0,0,0],
         'Fri' : [0,0,0,0],
         'Mon' : [1,1,1,1],
         'Sat' : [0,0,0,0],
         'Sun' : [0,0,0,0],
         'Thu' : [0,0,0,0],
         'Tue' : [0,0,0,0],
         'Wed' : [0,0,0,0]
      
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[17,11,5,3], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))
preds = estimator.predict(x=input.values*SCALE_COLLISIONS)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f217f762490>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model/model.ckpt-100

[289.78656 242.54999 330.0182  280.4751 ]


The low RMSE and the low percentage error rate suggest that using the whole cleaned dataset is an accurate predictor of the number of collisions.

## Location
As identified in assignment 1, as the location tends towards the centre of New York there is stronger linear relationships. This suggests there is a link between the number of collisions, location and the observed weather conditions. A DNN will be trained to attempt to predict the number of collisions given the location, day and weather conditions.

In [233]:
#Read the data and extract from the zip
#Reference - (geeksforgeeks.org 2021)
df_loc = pd.read_csv('https://raw.githubusercontent.com/matthew110395/12004210_DataAnalytics/main/locdnn.zip', index_col=0,compression='zip' )
print(df_loc[:6])

   year  da  NUM_COLLISIONS   latitude  longitude  temp  dewp     slp  visib  \
1  2018   2               1  40.681750 -73.967480  14.7   2.0  1024.9   10.0   
2  2018   2               1  40.645370 -73.945110  14.7   2.0  1024.9   10.0   
3  2018   2               1  40.614830 -73.998380  14.7   2.0  1024.9   10.0   
4  2018   2               1  40.592190 -74.087395  14.7   2.0  1024.9   10.0   
5  2018   2               1  40.769817 -73.782370  14.7   2.0  1024.9   10.0   
6  2018   2               1  40.660175 -73.928200  14.7   2.0  1024.9   10.0   

   wdsp  ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
1  12.9  ...    0    0    0    0    1    0    0    0    0    0  
2  12.9  ...    0    0    0    0    1    0    0    0    0    0  
3  12.9  ...    0    0    0    0    1    0    0    0    0    0  
4  12.9  ...    0    0    0    0    1    0    0    0    0    0  
5  12.9  ...    0    0    0    0    1    0    0    0    0    0  
6  12.9  ...    0    0    0    0    1    0    0  

In [234]:
#Remove unrequired cols
df_loc_dnn = df_loc.drop(columns=['thunder','tornado_funnel_cloud','fog','rain_drizzle','snow_ice_pellets','hail'])
#Clean data
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["year"] != 2012]
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["year"] < 2020]
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["dewp"] != 9999]
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["slp"] != 9999]
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["mxpsd"] != 999.9]
df_loc_dnn = df_loc_dnn.loc[df_loc_dnn["gust"] != 999.9]
#Move the target to the end
cols = df_loc_dnn['NUM_COLLISIONS']
df_loc_dnn = df_loc_dnn.drop(columns=['NUM_COLLISIONS'])
df_loc_dnn.insert(loc=34, column='NUM_COLLISIONS', value=cols)
print(df_loc_dnn[:6])
df_loc_dnn.describe()

   year  da   latitude  longitude  temp  dewp     slp  visib  wdsp  mxpsd  \
1  2018   2  40.681750 -73.967480  14.7   2.0  1024.9   10.0  12.9   20.0   
2  2018   2  40.645370 -73.945110  14.7   2.0  1024.9   10.0  12.9   20.0   
3  2018   2  40.614830 -73.998380  14.7   2.0  1024.9   10.0  12.9   20.0   
4  2018   2  40.592190 -74.087395  14.7   2.0  1024.9   10.0  12.9   20.0   
5  2018   2  40.769817 -73.782370  14.7   2.0  1024.9   10.0  12.9   20.0   
6  2018   2  40.660175 -73.928200  14.7   2.0  1024.9   10.0  12.9   20.0   

   ...  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  NUM_COLLISIONS  
1  ...    0    0    0    1    0    0    0    0    0               1  
2  ...    0    0    0    1    0    0    0    0    0               1  
3  ...    0    0    0    1    0    0    0    0    0               1  
4  ...    0    0    0    1    0    0    0    0    0               1  
5  ...    0    0    0    1    0    0    0    0    0               1  
6  ...    0    0    0    1    0    0    

Unnamed: 0,year,da,latitude,longitude,temp,dewp,slp,visib,wdsp,mxpsd,...,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,NUM_COLLISIONS
count,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,...,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0,830297.0
mean,2016.070154,15.638823,40.723907,-73.920916,48.419836,47.2749,1015.556656,8.199143,12.598551,20.021545,...,0.093866,0.074335,0.133032,0.146763,0.114698,0.142687,0.170187,0.141897,0.150735,1.02709
std,1.991298,8.613159,0.078454,0.086634,13.750834,261.636367,8.13658,2.230079,3.921832,5.219745,...,0.291643,0.262315,0.339609,0.35387,0.318657,0.349754,0.375797,0.348945,0.357791,0.180994
min,2013.0,1.0,40.498949,-74.253006,5.8,-6.7,989.5,0.6,4.5,8.9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,2014.0,8.0,40.66886,-73.976715,38.4,28.8,1010.7,6.9,10.0,15.9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,2016.0,16.0,40.72247,-73.92921,47.8,41.3,1015.3,9.3,12.0,19.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2018.0,23.0,40.768165,-73.86665,59.5,53.4,1021.0,10.0,14.4,22.9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,2019.0,31.0,40.912884,-73.66301,77.5,9999.9,1039.1,10.0,39.3,49.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0


In [235]:
#Set the scale to the maximum value
SCALE_LOC=11
# Shuffle data
shuffle = df_loc_dnn.iloc[np.random.permutation(len(df_loc_dnn))]

predictors = shuffle.iloc[:,0:-1]
print(predictors[:6])

        year  da   latitude  longitude  temp  dewp     slp  visib  wdsp  \
756931  2015  24  40.710662 -73.798702  56.5  48.2  1026.6   10.0  15.5   
103311  2013  25  40.608203 -73.920710  27.0   4.4  1025.3   10.0  15.4   
500172  2016  22  40.862580 -73.925385  61.3  58.9   993.1    3.5  12.2   
355778  2019  19  40.735360 -73.919310  35.2  22.6  1027.5   10.0   8.2   
449942  2018  18  40.801190 -73.930030  32.4  29.5  1013.5    8.4  10.2   
976418  2019   4  40.638268 -74.078800  37.9  34.0  1004.8    7.7  15.3   

        mxpsd  ...  Nov  Oct  Sep  Fri  Mon  Sat  Sun  Thu  Tue  Wed  
756931   22.0  ...    0    0    0    0    0    1    0    0    0    0  
103311   27.0  ...    1    0    0    0    0    0    1    0    0    0  
500172   27.0  ...    0    1    0    1    0    0    0    0    0    0  
355778   15.9  ...    0    0    0    0    1    0    0    0    0    0  
449942   15.0  ...    0    0    0    0    0    0    0    0    0    1  
976418   28.0  ...    0    0    0    0    0    0

In [236]:
# Select last col as target
targets = shuffle.iloc[:,-1]
print(targets[:6])

756931    1
103311    1
500172    1
355778    1
449942    2
976418    1
Name: NUM_COLLISIONS, dtype: int64


In [237]:
# Split into test and train data
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

noutputs = 1
# Calculate the number of predictors
nppredictors = len(shuffle.columns) - noutputs
print(nppredictors)

34


In [238]:
# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_regression_trained_model_loc', ignore_errors=True)

#Setup the model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_loc', hidden_units=[19,15,11,7], optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

print("starting to train");

# Train the model.
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_LOC, steps=10000)

preds = estimator.predict(x=predictors[trainsize:].values)
predslistscale = preds['scores']*SCALE_LOC
#Test Model
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f21802a8c90>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_loc', '_session_creation_timeout_secs': 7200}


starting to train


INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/DNN_regression_trained_model_loc/model.ckpt.
INFO:tensorflow:loss = 96392.33, step = 1
INFO:tensorflow:global_step/sec: 295.648
INFO:tensorflow:loss = 0.02186725, step = 101 (0.341 sec)
INFO:tensorflow:global_step/sec: 419.821
INFO:tensorflow:loss = 0.0002902802, step = 201 (0.241 sec)
INFO:tensorflow:global_step/sec: 404.391
INFO:tensorflow:loss = 0.00025037606, step = 301 (0.245 sec)
INFO:tensorflow:global_step/sec: 390.266
INFO:tensorflow:loss = 0.00044077414, step = 401 (0.257 sec)
INFO:tensorflow:global_step/sec: 418.134
INFO:tensorflow:loss = 0.00018970028, step = 501 (0.241 sec)
INFO:tensorflow:global_step/sec: 395.707
INFO:tensorflow:loss = 0.00031834198, step = 601 (0.250 sec)
INFO:tensorflow:global_step/sec: 397.002
INFO:tensorflow:loss = 6.653251e-05, step = 701 (

DNNRegression has RMSE of 0.18208458200764135
Just using average = 1.027133387631222 has RMSE of 0.17962055916059586


In [239]:
print(predictors[trainsize:].values)
#Ensure the number of hidden units match the trained model
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model_loc', hidden_units=[19,15,11,7], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors[trainsize:].values)))
preds = estimator.predict(x=predictors[trainsize:].values)

print(preds['scores'])
print(targets[trainsize:].values/SCALE_COLLISIONS)

testdf = pd.DataFrame.from_dict(data={
    'pred': preds['scores'],
    'actual': targets[trainsize:].values/SCALE_LOC
})

testdf['diff'] = testdf['actual'] - testdf['pred']

error=testdf['diff'].mean()*SCALE_COLLISIONS
avgcol=targets[trainsize:].mean()
print('The trained model has an aproximate error rate of {0} which equates to {1}%'.format(error, round((error/avgcol)*100),1));


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f217aa7c9d0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model_loc', '_session_creation_timeout_secs': 7200}


[[2.01700000e+03 2.00000000e+01 4.07041700e+01 ... 1.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [2.01800000e+03 4.00000000e+00 4.06948740e+01 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [2.01400000e+03 2.20000000e+01 4.07432113e+01 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 ...
 [2.01500000e+03 3.00000000e+00 4.08240317e+01 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [2.01900000e+03 2.00000000e+01 4.05924100e+01 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]
 [2.01900000e+03 2.10000000e+01 4.06609340e+01 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]]


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model_loc/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


[0.09064212 0.09064212 0.09064212 ... 0.09064212 0.09064212 0.09064212]
[0.00086133 0.00086133 0.00086133 ... 0.00086133 0.00086133 0.00086133]
The trained model has an aproximate error rate of 3.1510186881339606 which equates to 307%


Despite extensive training, the model is not accurately able to predict the number of collisions. This is inferred by the RMSE which is close to that of the mean RMSE. 

#Conclusion
As shown above the given weather conditions for a particular day within New York, the number of collisions can be accurately predicted.

Arguments can be made for both models; the linear regression models appear to be less efficient due to the higher RMSE values in comparison to the DNN models. In contrast the error rate is lower for the linear regression models. This suggests that although the DNN models produce more errors, but the margin of error is lower, due to the RMSE placing a larger weighting on larger errors.

When testing the models using made data, the outputs followed the expected results. With the DNN models it is not as clear as the relationship between the variables is not as clear.

In hindsight the way location has been encoded does not allow for accurate predictions to be made as the number of collisions always tends towards 1 for a given location.

It is clear that the models produced above can accurately predict the number of collisions as set out in the specification of the assignment.


#References
geeksforgeeks.org (2021) Read a zipped file as a Pandas DataFrame [online]. Available from <<https://www.geeksforgeeks.org/read-a-zipped-file-as-a-pandas-dataframe/>? [12 November 2022] 

IBM (n.d.) What is linear regression? [online]. Available from <<https://www.ibm.com/uk-en/topics/linear-regression>> [17 November 2022] 

Karhunen, J., Raiko, T. and Cho, K. (2015) 'Chapter 7 - Unsupervised deep learning: A short review.' In Advances in Independent Component Analysis and Learning Machines. Academic Press. Ch. 7. 135-142.

Zhang, Z. (2019) Understand Data Normalization in Machine Learning [online]. Available from <<https://towardsdatascience.com/understand-data-normalization-in-machine-learning-8ff3062101f0>> [17 November 2022] 

Zulkifli, H. (2018) Understanding Learning Rates and How It Improves Performance in Deep Learning [online]. Available from <<https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10>> [17 November 2022] 

