# Data Processing
---
### First inspection of dataset, data filtering and cleaning

Creation: 05.02.2021

## Required Libraries
---

In [106]:
import pandas as pd # used to store the datasets
import numpy as np  # used for numerical calculations on individual columns
from collections import Counter #used for the number of vehicles

## Constants

In [107]:
# path lookup dictionary to store the relative paths from the directory containing the jupyter notebooks to important directories in the project
PATH = {}
PATH["data_raw"] = "../data/raw/"
PATH["data_interim"] = "../data/interim/"
PATH["data_processed"] = "../data/processed/"
PATH["data_external"] = "../data/external/"
PATH["references"] = "../data/references"

# filename lookup dictionary storing the most relevant filenames
FILENAME = {}
FILENAME["accidents"] = "Road Safety Data - Accidents 2019.csv"
FILENAME["casualties"] = "Road Safety Data - Casualties 2019.csv"
FILENAME["vehicles"] = "Road Safety Data- Vehicles 2019.csv" # the original dataset has a small typing mistake
FILENAME["variable_lookup"] = "variable lookup.xls"

DATA_RAW = {}

DATA_LEEDS = {}

# list of internal names for datasets
TABLENAMES = ["accidents", "casualties", "vehicles"]

## Loading in the Dataset
---
Starting of the data analysis, we import the given three datasets into a 'pandas' dataframe. 

In [108]:
# load all three datasets using pandas into the predefined dictionary 'DATA_RAW' where the key corresponds to the internal dataset name
for name in TABLENAMES:
    DATA_RAW[name] = pd.read_csv(PATH['data_raw'] + FILENAME[name])

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


## Initial Sanity Check
---
Before continuing with the data analysis, we want to make sure that the dataset is clean. We do this in a several ways. 
- (a) Are there multiple indexes in 'accidents.csv'
- (b) We check if there are indexes in the sub datasets 'raw_data_vehicles' and 'raw_data_casualties' that are not in the central dataset 'raw_accidents.csv'
- (c) We check the missing values in each column

### Mulitple Indexes
We are first evaluating if there are mulitple indexes in the big dataset 'accidents.csv'. This is a first, very basic metric to determine whether the data in the dataset has been inputted correctly. In this case it is even more important since the accident indexes link the main dataset 'accidents.csv' to the two sub datasets.

In [109]:
# here we check if there are mulitple indexes in the accidents dataset
DATA_RAW['accidents'].shape[0] == len(set(DATA_RAW['accidents']['Accident_Index']))

True

Perfect. There do not seem to be any multiple indexes. Each row in the dataset 'accidents.csv' seems to refer to a unique accident that we can reference in the two sub datasets.

### Wrong Indexes in Sub-Datasets
It would be bad if there is information on vehicles and casualties in the two subsets that are referenced by an Accident ID that is not in the main dataset 'accidents.csv'. So we check for those using the below code.

In [110]:
def check_indexes_in_subset(sub_dataset, _in=DATA_RAW['accidents']):
    """ 
    Helper-Function to evaluate whether there are indexes in the two linked sub datasets that do not appear in the main dataset.

    Parameters:
        sub_dataset         : pd.DataFrame
        _in                 : pd.DataFrame
    Return:
        #Wrong Indexes      : int (None if len() == 0)
    """
    
    accidents_indexes = set(_in['Accident_Index'])
    wrong_indexes = [i for i in set(sub_dataset['Accident_Index']) if i not in accidents_indexes]

    if len(wrong_indexes) == 0:
        return None
    else:
        return len(wrong_indexes)

In [111]:
# computing the wrong indexes for each sub dataset
for dataset in TABLENAMES:
    if dataset != 'accidents':
        print(f"#Missing indexes in {dataset.capitalize()}: {check_indexes_in_subset(DATA_RAW[dataset], DATA_RAW['accidents'])}")

#Missing indexes in Casualties: 19325
#Missing indexes in Vehicles: 21498


"This is rather bad. Roughly 21.500 indexes in the 'vehicles.csv' raw dataset are linking to an accident that is not registered in the 'accidents.csv', 19.300 in the 'casualties.csv' are linking casualties to accidents that are not registered int the 'accidents.csv'. If a similar behavior can be observed in the dataset that is filtered for Leeds, we have to think about how to deal with those obvious wrong input of data. 

### Check for missing values
Here we check in all columns of all datasets to see if there are missing values (empty string) to get a feeling on which columns we need to further do processing.

In [112]:
def check_column_for_missing_value(column):
    """ 
    For a column of a pd.DataFrame (One-Dimensional Array) the fucntion returns the number of Null-Values that correspond to the          number of missing values in the column.

    Parameters:
        column              : pd.DataFrame
    Return:
        #Null Values        : int
    """

    return sum(column.isnull())

In [113]:
def check_all_columns(data):
    """
    For a dataset provided as a pd.DataFrame the function returns an informative string about each column containing null values,         namely the number of missing values, the column index and the variable name of the column.

    Parameters:
        data                : pd.DataFrame
    Return:
        Informative String for each column containing null values, else None
    """

    for column in range(data.shape[1]):
        if check_column_for_missing_value(data.iloc[:,column]) != 0:
            print(f'{check_column_for_missing_value(data.iloc[:,column])} ({data.columns[column]}({column}))')

In [114]:
for dataset in TABLENAMES:
    print(dataset.capitalize())
    check_all_columns(DATA_RAW[dataset])
    print('\n')

Accidents
28 (Location_Easting_OSGR(1))
28 (Location_Northing_OSGR(2))
28 (Longitude(3))
28 (Latitude(4))
63 (Time(11))
5714 (LSOA_of_Accident_Location(31))


Casualties


Vehicles




That's not bad! There are no missing values in both sub datasets. 
It seems like in the accidents dataset, there are 28 accidents that have no information on their location. This only gets important for our analysis if one of those accidents is located in Leeds, then we would need to deal with this issue later. The LSOA Metric - which is another measure of the accident location of the accident - hasn't been registered for 5714 accidents. This is not important for our analysis, since we will use the longitude and latitude to plot the accidents' location. 
There are, however, 63 accidents for which the time of accident is not registered. If any of those accidents are located in Leeds, we have to deal with them later.

## Filtering 
---
Next, we filter the main dataset 'accidents.csv' for the city of interest 'Leeds', which can be identified by several variables in the dataset. We here chose the column 'Local Authority (District)', where 'Leeds' is identified as 204. The resulting, filtered dataframe is saved into a new variable.

In [115]:
DATA_LEEDS["accidents"] = DATA_RAW['accidents'][DATA_RAW['accidents']['Local_Authority_(District)'] == 204]

However, the other two datasets cannont be identified by the variable attributes, but need to be filtered through the unique accident indexes that we can obtain from our filtered dataframe of accidents in 'Leeds'. We obtain a list of all accident indexes of the accidents that occured in Leeds and use this index list to filter both the 'vehicles.csv' and 'casualties.csv' datasets.

In [116]:
leeds_indexes = list(DATA_LEEDS['accidents']['Accident_Index']) # we can do it with list because we know that all indexes are unique

In [117]:
DATA_LEEDS["casualties"] = DATA_RAW['casualties'][DATA_RAW['casualties']['Accident_Index'].isin(leeds_indexes)]
DATA_LEEDS["vehicles"] = DATA_RAW['vehicles'][DATA_RAW['vehicles']['Accident_Index'].isin(leeds_indexes)]

## Saving Filtered Data
---

In [118]:
for dataset in TABLENAMES:
    DATA_LEEDS[dataset].to_csv(PATH['data_interim'] + FILENAME[dataset], index=False)

## Sanity Check for Leeds
--- 
Now, that we filtered the dataset, we do the exact same snaity checks that we performed on the raw datasets.

### Mulitple Indexes

In [119]:
DATA_LEEDS['accidents'].shape[0] == len(set(DATA_LEEDS['accidents']['Accident_Index']))

True

### Wrong Indexes in Sub-Datasets

In [120]:
for dataset in TABLENAMES:
    if dataset != 'accidents':
        print(f"#Missing indexes in {dataset.capitalize()}: {check_indexes_in_subset(DATA_LEEDS[dataset], DATA_LEEDS['accidents'])}")

#Missing indexes in Casualties: None
#Missing indexes in Vehicles: None


### Missing Values

In [121]:
for dataset in TABLENAMES:
    print(dataset.capitalize())
    check_all_columns(DATA_LEEDS[dataset])
    print('\n')

Accidents
11 (Time(11))


Casualties


Vehicles




Perfect. None of the sanity checks reports any problems on our dataset. At this point we could export the dataset and work on the filtered ones. However, we are making some adjustments in the below section to make our analysis easier.

## Process Data
---
In this section, the 'Date' and 'Time' attributes in the 'accidents.csv' module will be cleaned for easy use in the single variable analysis.

### Time

In [122]:
time = np.array(DATA_LEEDS['accidents']['Time'])
for i in range(len(time)):
    try: 
        time[i] = time[i][:2]
    except:
        time[i] = '-1'

DATA_LEEDS['accidents']['Time'] = time

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DATA_LEEDS['accidents']['Time'] = time


### Date

In [123]:
date = np.array(DATA_LEEDS['accidents']['Date'])


for i in range(len(date)):
    date[i] = int(date[i][3:5])

DATA_LEEDS['accidents']['Date'] = date

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DATA_LEEDS['accidents']['Date'] = date


### 2nd Road Class
---

In [124]:
second_road_class = np.array(DATA_LEEDS['accidents']['2nd_Road_Class'])

for i in range(len(second_road_class)):
    if second_road_class[i] == -1:
        second_road_class[i] = 0

DATA_LEEDS['accidents']['2nd_Road_Class'] = second_road_class

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DATA_LEEDS['accidents']['2nd_Road_Class'] = second_road_class


## Overview 
---
For each of the datasets, we want to get a first good impression of its size and the information it stores. To gain this information, we print out each of the datasets, and get a summary of each of the columns and the uniques.

### Accidents
---

In [125]:
DATA_LEEDS['accidents'].shape # prints out the number of columns and rows

(1451, 32)

In [126]:
DATA_LEEDS['accidents'] # prints out an overview of the dataframe (and the number of rows and columns)

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location
41052,2019131901350,443576.0,438198.0,-1.339288,53.838231,13,2,4,1,9,...,0,0,1,2,2,0,0,2,1,E01011309
41053,20191358F1730,436147.0,434957.0,-1.452556,53.809670,13,3,2,7,8,...,0,0,1,1,1,0,0,1,2,E01011666
41055,2019136111190,435904.0,425850.0,-1.457300,53.727837,13,3,2,1,1,...,0,0,1,1,1,0,0,2,1,E01011636
41057,2019136111674,423194.0,438111.0,-1.649019,53.838752,13,3,1,1,1,...,0,0,1,1,1,0,0,1,1,E01011461
41058,2019136111836,429149.0,431736.0,-1.559127,53.781158,13,2,2,1,1,...,0,0,4,1,1,0,0,1,1,E01011366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44657,2019136CT0238,430040.0,434040.0,-1.545382,53.801815,13,2,1,1,12,...,0,0,4,1,1,0,0,1,1,E01033010
44661,2019136CU0181,442094.0,434619.0,-1.362295,53.806186,13,3,1,1,12,...,0,0,4,4,2,0,0,2,1,E01011297
44662,2019136CU0363,423019.0,437653.0,-1.651713,53.834643,13,2,1,1,12,...,0,0,1,1,4,7,0,1,1,E01011452
44669,2019136CV0723,436853.0,442515.0,-1.440932,53.877548,13,2,2,1,12,...,0,0,1,1,1,0,0,2,1,E01011713


In [127]:
DATA_LEEDS['accidents'].nunique() # prints out the column names and the corresponding number of unique values 

Accident_Index                                 1451
Location_Easting_OSGR                          1353
Location_Northing_OSGR                         1319
Longitude                                      1407
Latitude                                       1404
Police_Force                                      1
Accident_Severity                                 3
Number_of_Vehicles                                6
Number_of_Casualties                              9
Date                                             12
Day_of_Week                                       7
Time                                             25
Local_Authority_(District)                        1
Local_Authority_(Highway)                         1
1st_Road_Class                                    6
1st_Road_Number                                  38
Road_Type                                         6
Speed_limit                                       6
Junction_Detail                                   9
Junction_Con

We see, that the main dataset 'accidents_processed.csv' stores all recorded accidents in 2019 in Leeds. It consist of 1451 columns (which leads to 1450 recorded accidents) and has 32 columns providing more detailed information about the accident. The different variables and the number of its unique values can be studied in the output of the above cell. We see, that we can differentiate the attributes as follows:
- Categorical Attributes (Most of the columns are categorical)
- Geographical Attributes (There are several measures of the location of the accident)
- Time Attribute (Each accident specifies a date and time)

### Vehicles
---

In [128]:
DATA_LEEDS['vehicles'].shape # prints out the number of columns and rows

(2688, 23)

In [129]:
DATA_LEEDS['vehicles'] # prints out an overview of the dataframe (and the number of rows and columns)

Unnamed: 0,Accident_Index,Vehicle_Reference,Vehicle_Type,Towing_and_Articulation,Vehicle_Manoeuvre,Vehicle_Location-Restricted_Lane,Junction_Location,Skidding_and_Overturning,Hit_Object_in_Carriageway,Vehicle_Leaving_Carriageway,...,Journey_Purpose_of_Driver,Sex_of_Driver,Age_of_Driver,Age_Band_of_Driver,Engine_Capacity_(CC),Propulsion_Code,Age_of_Vehicle,Driver_IMD_Decile,Driver_Home_Area_Type,Vehicle_IMD_Decile
74565,2019131901350,1,21,1,11,0,0,0,0,0,...,1,1,39,7,12419,2,3,1,1,1
74566,2019131901350,2,21,0,4,0,0,0,0,0,...,1,1,56,9,6374,2,5,2,1,2
74567,2019131901350,3,21,1,4,0,0,0,0,0,...,1,1,40,7,10837,2,2,4,1,4
74568,2019131901350,4,21,1,4,0,0,0,0,0,...,1,1,35,6,-1,-1,-1,-1,-1,-1
74569,20191358F1730,1,9,0,18,0,0,0,0,0,...,6,3,-1,-1,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81157,2019136CV0723,1,9,0,18,0,0,0,0,0,...,6,1,72,10,1968,2,4,10,1,10
81158,2019136CV0723,2,1,0,18,0,0,0,0,0,...,6,1,59,9,-1,-1,-1,10,1,10
81159,2019136CV1518,1,9,0,4,0,0,0,0,0,...,6,3,-1,-1,1598,2,7,7,1,7
81160,2019136CV1518,2,9,0,4,0,0,0,0,0,...,6,1,41,7,2183,2,7,3,1,3


In [130]:
DATA_LEEDS['vehicles'].nunique() # prints out the column names and the corresponding number of unique values 

Accident_Index                      1451
Vehicle_Reference                      6
Vehicle_Type                          17
Towing_and_Articulation                5
Vehicle_Manoeuvre                     18
Vehicle_Location-Restricted_Lane       7
Junction_Location                      9
Skidding_and_Overturning               5
Hit_Object_in_Carriageway              9
Vehicle_Leaving_Carriageway            9
Hit_Object_off_Carriageway            11
1st_Point_of_Impact                    5
Was_Vehicle_Left_Hand_Drive?           2
Journey_Purpose_of_Driver              5
Sex_of_Driver                          3
Age_of_Driver                         84
Age_Band_of_Driver                    11
Engine_Capacity_(CC)                 229
Propulsion_Code                        6
Age_of_Vehicle                        30
Driver_IMD_Decile                     11
Driver_Home_Area_Type                  4
Vehicle_IMD_Decile                    11
dtype: int64

We see, that the side dataset 'vehicles_processed.csv' provides more detailed information about all vehicles involved in each of the accidents. It consist of 2688 columns (which leads to 2688 records on involved vehicles) and has 23 columns providing more detailed information about the vehicle. The different variables and the number of its unique values can be studied in the output of the above cell. We see, that we can differentiate the attributes as follows:
- Linking Attributes (Accident Indexes link the vehicles to the accidents dataset and the vehicle references the casualties)
- Categorical Attributes (Most of the columns are categorical)

### Casualties
---

In [131]:
DATA_LEEDS['casualties'].shape # prints out the number of columns and rows

(1908, 16)

In [132]:
DATA_LEEDS['casualties'] # prints out an overview of the dataframe (and the number of rows and columns)

Unnamed: 0,Accident_Index,Vehicle_Reference,Casualty_Reference,Casualty_Class,Sex_of_Casualty,Age_of_Casualty,Age_Band_of_Casualty,Casualty_Severity,Pedestrian_Location,Pedestrian_Movement,Car_Passenger,Bus_or_Coach_Passenger,Pedestrian_Road_Maintenance_Worker,Casualty_Type,Casualty_Home_Area_Type,Casualty_IMD_Decile
51140,2019131901350,1,1,1,1,39,7,2,0,0,0,0,0,21,1,1
51141,20191358F1730,2,1,2,2,6,2,3,0,0,0,4,0,11,1,1
51142,20191358F1730,2,2,2,1,9,2,3,0,0,0,4,0,11,1,1
51143,20191358F1730,2,3,2,2,39,7,3,0,0,0,4,0,11,1,2
51144,20191358F1730,2,4,2,1,5,1,3,0,0,0,4,0,11,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55919,2019136CU0181,1,1,1,1,27,6,3,0,0,0,0,0,9,2,9
55920,2019136CU0363,1,1,1,1,36,7,2,0,0,0,0,0,5,1,3
55929,2019136CV0723,2,1,1,1,59,9,2,0,0,0,0,0,1,1,10
55930,2019136CV1518,2,1,1,1,41,7,3,0,0,0,0,0,9,1,3


In [133]:
DATA_LEEDS['casualties'].nunique() # prints out the column names and the corresponding number of unique values 

Accident_Index                        1451
Vehicle_Reference                        5
Casualty_Reference                      13
Casualty_Class                           3
Sex_of_Casualty                          2
Age_of_Casualty                         92
Age_Band_of_Casualty                    11
Casualty_Severity                        3
Pedestrian_Location                      9
Pedestrian_Movement                     10
Car_Passenger                            3
Bus_or_Coach_Passenger                   5
Pedestrian_Road_Maintenance_Worker       1
Casualty_Type                           18
Casualty_Home_Area_Type                  4
Casualty_IMD_Decile                     11
dtype: int64

We see, that the side dataset 'casualties_processed.csv' provides more detailed information about the casualties of all lethal accidents. It consist of 1908 columns (which leads to 1907 records on casualties) and has 16 columns providing more detailed information about the vehicle. The different variables and the number of its unique values can be studied in the output of the above cell. We see, that we can differentiate the attributes as follows:
- Linking Attributes (Accident Indexes link the vehicles to the accidents dataset and the vehicle references the casualties)
- Categorical Attributes (Most of the columns are categorical)

## Export Processed Datasets
--- 
Finally, we export the processed datasets into a new subfolder. From now on, all Jupyter Notebooks will work with those processed datasets.

In [134]:
for dataset in TABLENAMES:
    DATA_LEEDS[dataset].to_csv(PATH['data_processed'] + FILENAME[dataset], index=False)

## Number of vehicles
--- 
Check that all Number_of_Vehicles and Number_of_Casualties values in the accidents table are correct, by counting all the corresponding records in the vehicles and casualties tables.

In [135]:
casualties = DATA_RAW['casualties']
vehicles = DATA_RAW['vehicles']
vehicles

Unnamed: 0,Accident_Index,Vehicle_Reference,Vehicle_Type,Towing_and_Articulation,Vehicle_Manoeuvre,Vehicle_Location-Restricted_Lane,Junction_Location,Skidding_and_Overturning,Hit_Object_in_Carriageway,Vehicle_Leaving_Carriageway,...,Journey_Purpose_of_Driver,Sex_of_Driver,Age_of_Driver,Age_Band_of_Driver,Engine_Capacity_(CC),Propulsion_Code,Age_of_Vehicle,Driver_IMD_Decile,Driver_Home_Area_Type,Vehicle_IMD_Decile
0,2019010128300,1,9,0,-1,-1,-1,-1,-1,-1,...,6,1,58,9,-1,-1,-1,2,1,2
1,2019010128300,2,9,0,-1,-1,-1,-1,-1,-1,...,6,3,-1,-1,-1,-1,-1,2,1,2
2,2019010152270,1,9,0,18,-1,0,-1,-1,-1,...,6,2,24,5,-1,-1,-1,3,1,3
3,2019010152270,2,9,0,18,-1,0,-1,-1,-1,...,6,3,-1,-1,-1,-1,-1,6,1,6
4,2019010155191,1,9,0,3,0,1,0,0,0,...,6,1,45,7,-1,-1,-1,4,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
216376,2019984107019,4,19,0,18,0,0,0,0,0,...,1,1,20,4,2198,2,4,5,3,5
216377,2019984107219,1,9,0,18,0,1,0,0,0,...,6,1,33,6,1997,2,12,3,2,3
216378,2019984107219,2,9,0,18,0,1,0,0,0,...,6,1,61,9,2967,2,5,7,3,7
216379,2019984107419,1,9,0,7,0,6,0,0,3,...,5,1,78,11,1597,2,6,8,3,8


In [136]:
#Casualties
Number_of_Casualties = [ID for ID in casualties['Accident_Index']]
Number_of_Casualties = Counter(Number_of_Casualties) #dict over number of vehicles in casualties, key is ID, value is number of vehicles

#Vehicles
Num_Veh_ID = [ID for ID in vehicles['Accident_Index']]
Num_Veh_Ref = [ID for ID in vehicles['Vehicle_Reference']]

Number_of_Vehicles = dict(zip(Num_Veh_ID, Num_Veh_Ref)) #dict over number of vehicles in vehicles, key is ID, value is number of vehicles

missing_vehicles_ID = {} #Contains number of missing vehicles for the ID


stats_missing_vehicles = {} #Contains stats over the number of missing values

for key, value in Number_of_Vehicles.items():
    if key in Number_of_Casualties and value == Number_of_Casualties[key]:
        pass
    else: 
        missing_vehicles_ID[key] = value - Number_of_Casualties[key]
        

missing_vehicles_ID

{2019010128300: -1,
 2019010152270: 1,
 2019010155191: 1,
 2019010155195: -1,
 2019010155198: -2,
 2019010155206: 1,
 2019010155207: 2,
 2019010155217: 1,
 2019010155220: 2,
 2019010155225: 1,
 2019010155226: 1,
 2019010155232: 1,
 2019010155234: 1,
 2019010155242: 1,
 2019010155254: 1,
 2019010155256: 1,
 2019010155257: -1,
 2019010155263: 1,
 2019010155276: 1,
 2019010155282: 1,
 2019010155284: 1,
 2019010155294: 1,
 2019010155298: 1,
 2019010155301: 1,
 2019010155302: 1,
 2019010155303: -1,
 2019010155305: 1,
 2019010155306: 1,
 2019010155310: 1,
 2019010155312: 2,
 2019010155315: 1,
 2019010155321: 1,
 2019010155325: -1,
 2019010155341: 1,
 2019010155351: 1,
 2019010155354: 1,
 2019010155363: 1,
 2019010155369: 1,
 2019010155371: -2,
 2019010155376: 1,
 2019010155400: 1,
 2019010155402: 1,
 2019010155414: 2,
 2019010155436: -1,
 2019010155440: 1,
 2019010155442: 1,
 2019010155446: 1,
 2019010155449: 1,
 2019010155453: -1,
 2019010155454: 2,
 2019010155455: 1,
 2019010155464: 1,
 20

In [137]:
#Casualties
Number_of_Casualties = [ID for ID in casualties['Accident_Index']]
Number_of_Casualties = Counter(Number_of_Casualties) #dict over number of vehicles in casualties, key is ID, value is number of vehicles

In [138]:
#Vehicles
Num_Veh_ID = [ID for ID in vehicles['Accident_Index']]
Num_Veh_Ref = [ID for ID in vehicles['Vehicle_Reference']]
Number_of_Vehicles = dict(zip(Num_Veh_ID, Num_Veh_Ref)) #dict over number of vehicles in vehicles, key is ID, value is number of vehicles

In [143]:
# Missing values
missing_vehicles_ID = {} #Contains number of missing vehicles for the ID

In [147]:
for key, value in Number_of_Vehicles.items():
    if key in Number_of_Casualties and value == Number_of_Casualties[key]:
        pass
    else: 
        missing_vehicles_ID[key] = value - Number_of_Casualties[key]

len(missing_vehicles_ID)

84353

Conclusion there are 84353 Vehicles accidents that doesn't have the same amount of vehicles in casualties. 