# Assignment 1 - Data Mining Using Apache Spark 
## 1. Preparation
### 1.1 Requirements

1. Operating System : Ubuntu 18.04.2 LTS
2. Apache Spark 2.3.3 Binary (https://spark.apache.org/downloads.html)
3. Python 3.6.5 (Anaconda, Inc.)
4. PySpark 2.4.0 (Apache Spark Python API)
5. Findspark 1.3.0 (Python's library)
6. Jupyter Notebook (https://jupyter.org/install)

### 1.2 Installation
* [How To Install Apache Spark on Ubuntu 18.04 LTS](https://idroot.us/linux/install-apache-spark-ubuntu-18-04-lts/)
* [Pyspark and Jupyter notebook setup in Ubuntu](https://jmedium.com/pyspark-in-python/)

### 1.3 Dataset Description
* Dataset's name: [UK Road Safety : Traffic Accidents and Vehicles](https://www.kaggle.com/tsiaras/uk-road-safety-accidents-and-vehicles)
* Description: Detailed dataset of road accidents and involved vehicles in the UK (2005-2017). Each line represents a single traffic accident (identified by the Accident_Index column) and its various properties.
* Details of dataset
    * Number of rows: 2047256
    * Number of columns: 34
    * Size: 134MB
    * Format: CSV
    
## 2. Steps
### 2.1 Spark Initialization

In [1]:
# Import findspark to make pyspark importable as a regular library
import findspark
findspark.init('/home/mocatfrio/spark') 
# /home/mocatfrio/spark has symbolic link to /bin/spark-2.3.3-bin-hadoop2.7

In [2]:
# Import required python library
from pyspark.sql import SparkSession

# Create Spark Session
# The entry point to programming Spark with the Dataset 
spark = SparkSession \
    .builder \
    .appName("Traffic Accidents and Vehicles") \
    .getOrCreate()

In [3]:
# Print spark object ID
print(spark)

<pyspark.sql.session.SparkSession object at 0x7f79c9d88b00>


### 2.2 Load Dataset 

In [4]:
# Load the dataset
df = spark.read.csv("/home/mocatfrio/Documents/big-data/tugas-1/Accident_Information.csv", header=True, inferSchema=True)

In [5]:
# Print top 20 rows data
df.show()

+--------------+--------------+---------------+--------------+---------------+-----------------+-------------------+-------------------+-----------+-------------------------------------------+--------------------+--------------------+---------+--------------------+--------------------------+-------------------------+---------------------+----------------------+---------+-------------------------+--------------------+------------------+---------------------------------+---------------------------------------+-------------------+-----------------------+------------------+--------------------------+-----------+-----+-------------------+--------------------+----+----------+
|Accident_Index|1st_Road_Class|1st_Road_Number|2nd_Road_Class|2nd_Road_Number|Accident_Severity|Carriageway_Hazards|               Date|Day_of_Week|Did_Police_Officer_Attend_Scene_of_Accident|    Junction_Control|     Junction_Detail| Latitude|    Light_Conditions|Local_Authority_(District)|Local_Authority_(Highway)|Loc

In [6]:
# Count data rows
df.count()

2047256

In [7]:
# inferSchema is used to inference the actual datatype of columns, especially for dates and timestamp
df.printSchema()

root
 |-- Accident_Index: string (nullable = true)
 |-- 1st_Road_Class: string (nullable = true)
 |-- 1st_Road_Number: string (nullable = true)
 |-- 2nd_Road_Class: string (nullable = true)
 |-- 2nd_Road_Number: string (nullable = true)
 |-- Accident_Severity: string (nullable = true)
 |-- Carriageway_Hazards: string (nullable = true)
 |-- Date: timestamp (nullable = true)
 |-- Day_of_Week: string (nullable = true)
 |-- Did_Police_Officer_Attend_Scene_of_Accident: string (nullable = true)
 |-- Junction_Control: string (nullable = true)
 |-- Junction_Detail: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Light_Conditions: string (nullable = true)
 |-- Local_Authority_(District): string (nullable = true)
 |-- Local_Authority_(Highway): string (nullable = true)
 |-- Location_Easting_OSGR: string (nullable = true)
 |-- Location_Northing_OSGR: string (nullable = true)
 |-- Longitude: string (nullable = true)
 |-- LSOA_of_Accident_Location: string (nullable = true)
 |-

In [8]:
# Register the dataframe as a SQL temporary view
df.createOrReplaceTempView("traffic_accidents")

In [9]:
# SQL Query to find the type of accident severity 
result = spark.sql("SELECT DISTINCT Accident_Severity FROM traffic_accidents")

In [10]:
result.show()

+-----------------+
|Accident_Severity|
+-----------------+
|           Slight|
|            Fatal|
|          Serious|
+-----------------+



### 2.3 Data Mining Process