# Data Processing
---
### First inspection of dataset, data filtering and cleaning

Creation: 05.02.2021

## Required Libraries
---

In [1]:
import numpy as np
import pandas as pd

## Loading in the Dataset
---
Starting of the data analysis, we import the given three datasets into a 'pandas' dataframe. 

In [2]:
# load in the dataset into a numpy array
raw_data_accidents = pd.read_csv('../data/raw/accidents.csv') # 117.536 rows × 32 columns
raw_data_vehicles = pd.read_csv('../data/raw/vehicles.csv') # 216.381 rows × 23 columns
raw_data_casualties = pd.read_csv('../data/raw/casualties.csv') # 153158 rows × 16 columns

## Filtering 
---
Next, we filter the main dataset 'accidents.csv' for the city of interest 'Leeds', which can be identified by several variables in the dataset. We here chose the column 'Local Authority (District)', where 'Leeds' is identified as 204. The resulting, filtered dataframe is saved into a new variable.

In [3]:
leeds_accidents = raw_data_accidents[raw_data_accidents['Local_Authority_(District)'] == 204]

However, the other two datasets cannont be identified by the variable attributes, but need to be filtered through the unique accident indexes that we can obtain from our filtered dataframe of accidents in 'Leeds'. We obtain a list of all accident indexes of the accidents that occured in Leeds and use this index list to filter both the 'vehicles.csv' and 'casualties.csv' datasets.

In [4]:
leeds_indexes = list(leeds_accidents['Accident_Index'])

In [5]:
leeds_vehicles = raw_data_vehicles[raw_data_vehicles['Accident_Index'].isin(leeds_indexes)]
leeds_casualties = raw_data_casualties[raw_data_casualties['Accident_Index'].isin(leeds_indexes)]

## Process Data
---
In this section, the 'Date' and 'Time' attributes in the 'accidents.csv' module will be cleaned for easy use in the single variable analysis.

### Time

In [6]:
time = np.array(leeds_accidents['Time'])
for i in range(len(time)):
    try: 
        time[i] = time[i][:2]
    except:
        time[i] = '-1'

leeds_accidents['Time'] = time

### Date

In [47]:
date = np.array(leeds_accidents['Date'])

(array(['01/01/2019', '01/02/2019', '01/03/2019', '01/05/2019',
       '01/06/2019', '01/07/2019', '01/08/2019', '01/09/2019',
       '01/10/2019', '01/11/2019', '01/12/2019', '02/01/2019',
       '02/02/2019', '02/03/2019', '02/04/2019', '02/05/2019',
       '02/07/2019', '02/08/2019', '02/09/2019', '02/10/2019',
       '02/11/2019', '03/01/2019', '03/03/2019', '03/04/2019',
       '03/05/2019', '03/07/2019', '03/09/2019', '03/10/2019',
       '03/11/2019', '03/12/2019', '04/01/2019', '04/02/2019',
       '04/03/2019', '04/04/2019', '04/05/2019', '04/06/2019',
       '04/07/2019', '04/08/2019', '04/09/2019', '04/10/2019',
       '04/11/2019', '04/12/2019', '05/02/2019', '05/03/2019',
       '05/04/2019', '05/06/2019', '05/07/2019', '05/08/2019',
       '05/09/2019', '05/10/2019', '05/11/2019', '05/12/2019',
       '06/02/2019', '06/03/2019', '06/04/2019', '06/05/2019',
       '06/06/2019', '06/07/2019', '06/08/2019', '06/09/2019',
       '06/10/2019', '06/11/2019', '06/12/2019', '07/0

## Export Processed Datasets
--- 
Finally, we export the processed datasets into a new subfolder. From now on, all Jupyter Notebooks will work with those processed datasets.

In [7]:
leeds_accidents.to_csv('../data/processed/accidents_processed.csv', index=False)
leeds_vehicles.to_csv('../data/processed/vehicles_processed.csv', index=False)
leeds_casualties.to_csv('../data/processed/casualties_processed.csv', index=False)