# Data Processing
---
### First inspection of dataset, data filtering and cleaning

Creation: 05.02.2021

## Required Libraries
---

In [2]:
import numpy as np
import pandas as pd

## Loading in the Dataset
---
Starting of the data analysis, we import the given three datasets into a 'pandas' dataframe. 

In [3]:
# load in the dataset into a numpy array
raw_data_accidents = pd.read_csv('../data/raw/accidents.csv') # 117.536 rows × 32 columns
raw_data_vehicles = pd.read_csv('../data/raw/vehicles.csv') # 216.381 rows × 23 columns
raw_data_casualties = pd.read_csv('../data/raw/casualties.csv') # 153158 rows × 16 columns

## Filtering 
---
Next, we filter the main dataset 'accidents.csv' for the city of interest 'Leeds', which can be identified by several variables in the dataset. We here chose the column 'Local Authority (District)', where 'Leeds' is identified as 204. The resulting, filtered dataframe is saved into a new variable.

In [4]:
leeds_accidents = raw_data_accidents[raw_data_accidents['Local_Authority_(District)'] == 204]

However, the other two datasets cannont be identified by the variable attributes, but need to be filtered through the unique accident indexes that we can obtain from our filtered dataframe of accidents in 'Leeds'. We obtain a list of all accident indexes of the accidents that occured in Leeds and use this index list to filter both the 'vehicles.csv' and 'casualties.csv' datasets.

In [5]:
leeds_indexes = list(leeds_accidents['Accident_Index'])

In [6]:
leeds_vehicles = raw_data_vehicles[raw_data_vehicles['Accident_Index'].isin(leeds_indexes)]
leeds_casualties = raw_data_casualties[raw_data_casualties['Accident_Index'].isin(leeds_indexes)]

## Process Data
---
In this section, the 'Date' and 'Time' attributes in the 'accidents.csv' module will be cleaned for easy use in the single variable analysis.

### Time

In [7]:
time = np.array(leeds_accidents['Time'])
for i in range(len(time)):
    try: 
        time[i] = time[i][:2]
    except:
        time[i] = '-1'

leeds_accidents['Time'] = time

### Date

In [8]:
date = np.array(leeds_accidents['Date'])

## Overview 
---
For each of the datasets, we want to get a first good impression of its size and the information it stores. To gain this information, we print out each of the datasets, and get a summary of each of the columns and the uniques.

### Accidents
---

In [9]:
leeds_accidents.shape # prints out the number of columns and rows

(1451, 32)

In [10]:
leeds_accidents # prints out an overview of the dataframe (and the number of rows and columns)

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location
41052,2019131901350,443576.0,438198.0,-1.339288,53.838231,13,2,4,1,27/09/2019,...,0,0,1,2,2,0,0,2,1,E01011309
41053,20191358F1730,436147.0,434957.0,-1.452556,53.809670,13,3,2,7,15/08/2019,...,0,0,1,1,1,0,0,1,2,E01011666
41055,2019136111190,435904.0,425850.0,-1.457300,53.727837,13,3,2,1,01/01/2019,...,0,0,1,1,1,0,0,2,1,E01011636
41057,2019136111674,423194.0,438111.0,-1.649019,53.838752,13,3,1,1,01/01/2019,...,0,0,1,1,1,0,0,1,1,E01011461
41058,2019136111836,429149.0,431736.0,-1.559127,53.781158,13,2,2,1,01/01/2019,...,0,0,4,1,1,0,0,1,1,E01011366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44657,2019136CT0238,430040.0,434040.0,-1.545382,53.801815,13,2,1,1,29/12/2019,...,0,0,4,1,1,0,0,1,1,E01033010
44661,2019136CU0181,442094.0,434619.0,-1.362295,53.806186,13,3,1,1,30/12/2019,...,0,0,4,4,2,0,0,2,1,E01011297
44662,2019136CU0363,423019.0,437653.0,-1.651713,53.834643,13,2,1,1,30/12/2019,...,0,0,1,1,4,7,0,1,1,E01011452
44669,2019136CV0723,436853.0,442515.0,-1.440932,53.877548,13,2,2,1,31/12/2019,...,0,0,1,1,1,0,0,2,1,E01011713


In [11]:
leeds_accidents.nunique() # prints out the column names and the corresponding number of unique values 

Accident_Index                                 1451
Location_Easting_OSGR                          1353
Location_Northing_OSGR                         1319
Longitude                                      1407
Latitude                                       1404
Police_Force                                      1
Accident_Severity                                 3
Number_of_Vehicles                                6
Number_of_Casualties                              9
Date                                            348
Day_of_Week                                       7
Time                                             25
Local_Authority_(District)                        1
Local_Authority_(Highway)                         1
1st_Road_Class                                    6
1st_Road_Number                                  38
Road_Type                                         6
Speed_limit                                       6
Junction_Detail                                   9
Junction_Con

We see, that the main dataset 'accidents_processed.csv' stores all recorded accidents in 2019 in Leeds. It consist of 1451 columns (which leads to 1450 recorded accidents) and has 32 columns providing more detailed information about the accident. The different variables and the number of its unique values can be studied in the output of the above cell. We see, that we can differentiate the attributes as follows:
- Categorical Attributes (Most of the columns are categorical)
- Geographical Attributes (There are several measures of the location of the accident)
- Time Attribute (Each accident specifies a date and time)

### Vehicles
---

In [None]:
leeds_vehicles.shape # prints out the number of columns and rows

In [None]:
leeds_vehicles # prints out an overview of the dataframe (and the number of rows and columns)

In [None]:
leeds_vehicles.nunique() # prints out the column names and the corresponding number of unique values 

We see, that the side dataset 'vehicles_processed.csv' provides more detailed information about all vehicles involved in each of the accidents. It consist of 2688 columns (which leads to 2688 records on involved vehicles) and has 23 columns providing more detailed information about the vehicle. The different variables and the number of its unique values can be studied in the output of the above cell. We see, that we can differentiate the attributes as follows:
- Linking Attributes (Accident Indexes link the vehicles to the accidents dataset and the vehicle references the casualties)
- Categorical Attributes (Most of the columns are categorical)

In [None]:
### Casualties
---

In [None]:
leeds_casualties.shape # prints out the number of columns and rows

In [None]:
leeds_casualties # prints out an overview of the dataframe (and the number of rows and columns)

In [None]:
leeds_casualties.nunique() # prints out the column names and the corresponding number of unique values 

We see, that the side dataset 'casualties_processed.csv' provides more detailed information about the casualties of all lethal accidents. It consist of 1908 columns (which leads to 1907 records on casualties) and has 16 columns providing more detailed information about the vehicle. The different variables and the number of its unique values can be studied in the output of the above cell. We see, that we can differentiate the attributes as follows:
- Linking Attributes (Accident Indexes link the vehicles to the accidents dataset and the vehicle references the casualties)
- Categorical Attributes (Most of the columns are categorical)

## Export Processed Datasets
--- 
Finally, we export the processed datasets into a new subfolder. From now on, all Jupyter Notebooks will work with those processed datasets.

In [7]:
leeds_accidents.to_csv('../data/processed/accidents_processed.csv', index=False)
leeds_vehicles.to_csv('../data/processed/vehicles_processed.csv', index=False)
leeds_casualties.to_csv('../data/processed/casualties_processed.csv', index=False)