In [None]:
Author = "Dennis C. Norton"
Collaborators = ["Bruno de Almeida",
                 "Anna Harris",
                 "Maggie Lau",
                 "Fan Ye",
                 "Echo Zhang"
                ]

# National Vehicle Collision Data Review

---

**Data Source**: Government of Canada National Collision Database

**Data File**: https://open.canada.ca/data/en/dataset/1eb9eba7-71d1-4b30-9fb1-30cbdab7e63a

**Data Dictionary**: https://open.canada.ca/data/en/dataset/1eb9eba7-71d1-4b30-9fb1-30cbdab7e63a/resource/09b74afc-2745-4382-8a02-3e256c4b28fd 

**Data Licence**: https://open.canada.ca/en/open-government-licence-canada

---

Contains information licensed under the Open Government Licence – Canada.  See below for details.

---

The data captures information about the the nature of collisions, vehicles involved, and people involved.  The data will be used in comparison to Toronto collision data.

This module will view characteristics of the data.

In [None]:
# Initialize the environment

import math
import pandas as pd
import numpy as np

file_path = './Data Files/'
file_in = 'NCDB_1999_to_2017.csv'

The source data is mainly numeric data but for the purpose of reviewing what values are in the file, the information will be read as string.

In [None]:
# Read all information into a dataframe

file_content = pd.read_csv(file_path + file_in,
                           nrows = None, dtype=str)

In [None]:
file_content.head(10)

Unnamed: 0,C_YEAR,C_MNTH,C_WDAY,C_HOUR,C_SEV,C_VEHS,C_CONF,C_RCFG,C_WTHR,C_RSUR,...,V_TYPE,V_YEAR,P_ID,P_SEX,P_AGE,P_PSN,P_ISEV,P_SAFE,P_USER,C_CASE
0,1999,1,1,20,2,2,34,UU,1,5,...,06,1990,1,M,41,11,1,UU,1,752
1,1999,1,1,20,2,2,34,UU,1,5,...,01,1987,1,M,19,11,1,UU,1,752
2,1999,1,1,20,2,2,34,UU,1,5,...,01,1987,2,F,20,13,2,02,2,752
3,1999,1,1,8,2,1,01,UU,5,3,...,01,1986,1,M,46,11,1,UU,1,753
4,1999,1,1,8,2,1,01,UU,5,3,...,NN,NNNN,1,M,05,99,2,UU,3,753
5,1999,1,1,17,2,3,QQ,QQ,1,2,...,01,1984,1,M,28,11,1,UU,1,820
6,1999,1,1,17,2,3,QQ,QQ,1,2,...,01,1991,1,M,21,11,1,UU,1,820
7,1999,1,1,17,2,3,QQ,QQ,1,2,...,01,1991,2,F,UU,13,2,UU,2,820
8,1999,1,1,17,2,3,QQ,QQ,1,2,...,01,1992,1,M,UU,11,2,UU,1,820
9,1999,1,1,15,2,1,04,UU,1,5,...,01,1997,1,M,61,11,1,UU,1,932


Save the row and column counts in variables for use later.

In [None]:
row_count, column_count = file_content.shape
print('Number of rows:', row_count, '\nNumber of columns:', column_count)

Number of rows: 6772563 
Number of columns: 23


In [None]:
file_content.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6772563 entries, 0 to 6772562
Data columns (total 23 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   C_YEAR  object
 1   C_MNTH  object
 2   C_WDAY  object
 3   C_HOUR  object
 4   C_SEV   object
 5   C_VEHS  object
 6   C_CONF  object
 7   C_RCFG  object
 8   C_WTHR  object
 9   C_RSUR  object
 10  C_RALN  object
 11  C_TRAF  object
 12  V_ID    object
 13  V_TYPE  object
 14  V_YEAR  object
 15  P_ID    object
 16  P_SEX   object
 17  P_AGE   object
 18  P_PSN   object
 19  P_ISEV  object
 20  P_SAFE  object
 21  P_USER  object
 22  C_CASE  object
dtypes: object(23)
memory usage: 1.2+ GB


In [None]:
file_content.isnull().sum()

C_YEAR    0
C_MNTH    0
C_WDAY    0
C_HOUR    0
C_SEV     0
C_VEHS    0
C_CONF    0
C_RCFG    0
C_WTHR    0
C_RSUR    0
C_RALN    0
C_TRAF    0
V_ID      0
V_TYPE    0
V_YEAR    0
P_ID      0
P_SEX     0
P_AGE     0
P_PSN     0
P_ISEV    0
P_SAFE    0
P_USER    0
C_CASE    0
dtype: int64

There are no null values in the data, but we know from the data dictionary that there are values which indicate that the data is not available for various reasons.  To simplify the analysis, these values are converted to a single value of "Unknown".

In [None]:
unknowns = ['N', 'NN', 'NNNN', 'Q', 'U', 'X', 
            'QQ', 'UU', 'XX', 'UUUU', 'XXXX']

file_content = file_content.applymap(lambda x: 'Unknown' if x in unknowns else x)
file_content.head(10)

Unnamed: 0,C_YEAR,C_MNTH,C_WDAY,C_HOUR,C_SEV,C_VEHS,C_CONF,C_RCFG,C_WTHR,C_RSUR,...,V_TYPE,V_YEAR,P_ID,P_SEX,P_AGE,P_PSN,P_ISEV,P_SAFE,P_USER,C_CASE
0,1999,1,1,20,2,2,34,Unknown,1,5,...,06,1990,1,M,41,11,1,Unknown,1,752
1,1999,1,1,20,2,2,34,Unknown,1,5,...,01,1987,1,M,19,11,1,Unknown,1,752
2,1999,1,1,20,2,2,34,Unknown,1,5,...,01,1987,2,F,20,13,2,02,2,752
3,1999,1,1,8,2,1,01,Unknown,5,3,...,01,1986,1,M,46,11,1,Unknown,1,753
4,1999,1,1,8,2,1,01,Unknown,5,3,...,Unknown,Unknown,1,M,05,99,2,Unknown,3,753
5,1999,1,1,17,2,3,Unknown,Unknown,1,2,...,01,1984,1,M,28,11,1,Unknown,1,820
6,1999,1,1,17,2,3,Unknown,Unknown,1,2,...,01,1991,1,M,21,11,1,Unknown,1,820
7,1999,1,1,17,2,3,Unknown,Unknown,1,2,...,01,1991,2,F,Unknown,13,2,Unknown,2,820
8,1999,1,1,17,2,3,Unknown,Unknown,1,2,...,01,1992,1,M,Unknown,11,2,Unknown,1,820
9,1999,1,1,15,2,1,04,Unknown,1,5,...,01,1997,1,M,61,11,1,Unknown,1,932


We can look at the quantity of values that are in the table and to make the information more meaningful, we will convert the values to percentages of the total number of rows.

In [None]:
file_summary = file_content.apply(pd.value_counts)
file_summary = file_summary.applymap(lambda x: 0.0 if math.isnan(x) else (x / row_count) * 100)
file_summary.tail(5)

Unnamed: 0,C_YEAR,C_MNTH,C_WDAY,C_HOUR,C_SEV,C_VEHS,C_CONF,C_RCFG,C_WTHR,C_RSUR,...,V_TYPE,V_YEAR,P_ID,P_SEX,P_AGE,P_PSN,P_ISEV,P_SAFE,P_USER,C_CASE
98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.003839,0.10891,0.0,0.0,0.0,0.0
99,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3e-05,0.0,0.005448,3.609077,0.0,0.0,0.0,0.0
F,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,41.767098,0.0,0.0,0.0,0.0,0.0,0.0
M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,53.740895,0.0,0.0,0.0,0.0,0.0,0.0
Unknown,0.0,0.006275,0.02014,0.979496,0.0,0.008372,8.005655,10.663408,1.708925,4.11063,...,4.903092,9.899945,0.262353,4.492007,6.762344,2.000173,6.414484,21.193838,3.262,0.0


We are particularly interested in understanding the quantity of Unknown values in the table.

In [None]:
unknown_values = file_summary.loc['Unknown']
unknown_values

C_YEAR     0.000000
C_MNTH     0.006275
C_WDAY     0.020140
C_HOUR     0.979496
C_SEV      0.000000
C_VEHS     0.008372
C_CONF     8.005655
C_RCFG    10.663408
C_WTHR     1.708925
C_RSUR     4.110630
C_RALN     7.525290
C_TRAF     5.314738
V_ID       0.006940
V_TYPE     4.903092
V_YEAR     9.899945
P_ID       0.262353
P_SEX      4.492007
P_AGE      6.762344
P_PSN      2.000173
P_ISEV     6.414484
P_SAFE    21.193838
P_USER     3.262000
C_CASE     0.000000
Name: Unknown, dtype: float64

Note: Not all unknowns are created equal.  The vehicle type (V_TYPE) is set to Unknown in the situation where a pedestrian is involved in the incident.  So the 4.9% of Unknown vehicle types is very reasonable.

There are other values where Unknown values cannot be explained, for example, the month, day and hour that an incident occured should be known, but there are Unknown values in C_MNTH, C_WDAY, and C_HOUR.  The Unknown values in these examples are 0.00675%, 0.020140%, and 0.979496% respectively and these values are all small enough that they will not have a significant impact on the analysis if they are dropped.

The highest percentage of Unknown values is in P_SAFE which represents the safety device that was being used by the person involved in the incident.  This is not part of the comparison to the Toronto data and is therfore not a concern.  Similarly, the road configuration (C_RCFG) and vehicle year (V_YEAR) have higher percentage of Unknown values, but are not used in the comparison to the Toronto data.

Overall the data is very good for use in analysis although information such as date and location are not included and would have increased the analysis opportunities.

# Open Government Licence - Canada

You are encouraged to use the Information that is available under this licence with only a few conditions.

**Using Information under this licence**

Use of any Information indicates your acceptance of the terms below.
The Information Provider grants you a worldwide, royalty-free, perpetual, non-exclusive licence to use the Information, including for commercial purposes, subject to the terms below.
You are free to:
Copy, modify, publish, translate, adapt, distribute or otherwise use the Information in any medium, mode or format for any lawful purpose.
You must, where you do any of the above:
Acknowledge the source of the Information by including any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence.
If the Information Provider does not provide a specific attribution statement, or if you are using Information from several information providers and multiple attributions are not practical for your product or application, you must use the following attribution statement:
Contains information licensed under the Open Government Licence – Canada.

The terms of this licence are important, and if you fail to comply with any of them, the rights granted to you under this licence, or any similar licence granted by the Information Provider, will end automatically.

**Exemptions**

This licence does not grant you any right to use:

Personal Information;
third party rights the Information Provider is not authorized to license;
the names, crests, logos, or other official symbols of the Information Provider; and
Information subject to other intellectual property rights, including patents, trade-marks and official marks.
Non-endorsement
This licence does not grant you any right to use the Information in a way that suggests any official status or that the Information Provider endorses you or your use of the Information.

**No Warranty**

The Information is licensed “as is”, and the Information Provider excludes all representations, warranties, obligations, and liabilities, whether express or implied, to the maximum extent permitted by law.

The Information Provider is not liable for any errors or omissions in the Information, and will not under any circumstances be liable for any direct, indirect, special, incidental, consequential, or other loss, injury or damage caused by its use or otherwise arising in connection with this licence or the Information, even if specifically advised of the possibility of such loss, injury or damage.

**Governing Law**

This licence is governed by the laws of the province of Ontario and the applicable laws of Canada.

Legal proceedings related to this licence may only be brought in the courts of Ontario or the Federal Court of Canada.

**Definitions**

In this licence, the terms below have the following meanings:

"Information"
means information resources protected by copyright or other information that is offered for use under the terms of this licence.
"Information Provider"
means Her Majesty the Queen in right of Canada.
“Personal Information”
means “personal information” as defined in section 3 of the Privacy Act, R.S.C. 1985, c. P-21.
"You"
means the natural or legal person, or body of persons corporate or incorporate, acquiring rights under this licence.
Versioning
This is version 2.0 of the Open Government Licence – Canada. The Information Provider may make changes to the terms of this licence from time to time and issue a new version of the licence. Your use of the Information will be governed by the terms of the licence in force as of the date you accessed the information.