The following project uses Python and PySpark to simulate how to leverage big data processing to analyze car crashes in the UK. The attached Jupyter Notebook could be used in conjunction with databricks to process the data across a real cluster.
-
data - contains the four files used in analysis:
a. Acc.csv - 2017 accident data reported by the UK police force.
b. Cas.csv - 2017 casualty data reported by the UK police force.
c. Veh.csv - 2017 vehicle data reported by the UK police force.
c. dictionary.xls - Data dictionary used to define coded categorical values within datasets. -
images - contains visualizations:
a. uk_accidents.png - Heatmap showing accidents in the UK by accident severity. -
car_crash.ipynb - Jupyter Notebook containing all analysis performed on the datasets, along with visualizations.
Python is used in conjunction with Pyspark for all analysis performed.
The following commands will import all necessary packages:
import pyspark, os, zipfile
import pandas as pd
import urllib.request
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql.types import IntegerType
PySpark takes special configuration to install and run within Jupyter Notebook:
-
If you're using windows, Michael Galarnyk has an excellent tutorial on installing PySpark for windows.
-
If you are installing on Linux or Mac OS, Charles Bochet's article will get you started.
- Would like to thank the UK goverment for posting the data on their website.
- Would like to thank the stackoverflow user whose function I stole, because of you lot I get to stand on the shoulders of giants.
MIT License Copyright (c) 2019 Ian Jeffries