# Initial Exploratory Data Analysis

## Project 1 of First Year Project 2021

This notebook wrangles and explores the data set from the project.

Contact: Michael Szell (misz@itu.dk)  
Created: 2021-01-11  
Last modified: 2021-01-11

## Preliminaries

### Imports

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

### Constants

Constants are written all caps: https://www.python.org/dev/peps/pep-0008/#constants

In [None]:
PATH = {}
PATH["data_raw"] = "../data/raw/"
PATH["data_derived"] = "../data/derived/"

FILENAME = {}
FILENAME["accidents"] = "Road Safety Data - Accidents 2019.csv"
FILENAME["casualties"] = "Road Safety Data - Casualties 2019.csv"
FILENAME["vehicles"] = "Road Safety Data- Vehicles 2019.csv" # Note the inconsistent file naming (no space before "-" here)

## Data wrangling

The data were downloaded from here on Jan 4th: https://data.gov.uk/dataset/road-accidents-safety-data  
That page was updated afterwards (Jan 8th), so local and online data may be inconsistent.

Let's get a first overview using `head`. If this doesn't work for you, try `less`. There are 3 data tables: Accidents, Casualties, and Vehicles.

In [None]:
!head -n 6 "../data/raw/Road Safety Data - Accidents 2019.csv" 
!head -n 6 "../data/raw/Road Safety Data - Casualties 2019.csv"
!head -n 6 "../data/raw/Road Safety Data- Vehicles 2019.csv" 

### General insights

#### Variable types

Accidents have mixed data types, including strings, floats, integers.  
Casualties and Vehicles have categorical variables encoded as integers. The meaning of these categories can be looked up in `../references/variable lookup.xls`

#### Link between data tables

Records between data tables are linked through their `Accident_Index`.

Looking at the first AccidentID 2019010128300, we can see there is a one-to-many relation between accident->casualty and accident->vehicle, meaning there can be multiple casualties and vehicles involved in one accident (makes sense).

#### Dimensions

Number of records

In [None]:
!wc -l "../data/raw/Road Safety Data - Accidents 2019.csv" 
!wc -l "../data/raw/Road Safety Data - Casualties 2019.csv" 
!wc -l "../data/raw/Road Safety Data- Vehicles 2019.csv"

Number of fields (in first line)

https://www.geeksforgeeks.org/awk-command-unixlinux-examples/

In [None]:
!head -n 1 "../data/raw/Road Safety Data - Accidents 2019.csv" | awk -F ',' '{print NF}'

### Sanity checks

Has each record the same number of fields?

https://shapeshed.com/unix-uniq/

In [None]:
!awk -F ',' '{print NF}' "../data/raw/Road Safety Data - Accidents 2019.csv" | uniq -d

How many duplicate lines are there? (If more than 0, there could be a problem)

In [None]:
!uniq -d "../data/raw/Road Safety Data - Accidents 2019.csv" | wc -l
!uniq -d "../data/raw/Road Safety Data - Casualties 2019.csv" | wc -l
!uniq -d "../data/raw/Road Safety Data- Vehicles 2019.csv" | wc -l

Does every AccidentID in casualties and vehicles have their corresponding accident?

## Data exploration