# Draw insights from car accident reports

(C) Copyright IBM Corp. 2016

This notebook shows you how to analyze car vehicle accidents based on accident reports for New York. The analysis steps in the notebook show how you can use the information about accidents to learn more about the possible causes for collisions. You will learn how to install additional Scala jars and how to perform descriptive data analysis.

## Table of contents
- [Get data](#get_data)
- [Access data](#access_data)
- [Load data](#load_data)
- [Load visualization packages](#load_visualization_packages)
- [Explore data](#explore_data)
- [Clean and shape the data](#data_cleaning)
- [Summary](#summary)

    

<a id="get_data"></a>
## Get data

Begin by getting the data about car accidents in the New York area. Click [NYPD Motor Vehicle Collisions](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95) to access the data set. Then click **Export** to download the data as a CSV file. 

This data set covers all reported vehicle collisions in New York starting in July 2012 until today  and contains detailed information about the incidents.

After you download the file, load the data file to the notebook by clicking **Palette>Data Sources**. Click **Add Source**, select **From file**, and browse to the data file. 

**Note**: Because the CSV file is relatively large, it may take a few minutes until the data file is loaded. The process bar indicates the load status of the file. 

This file is stored in the Object Storage instance that is associated with your IBM Analytics for Apache Spark service.

<a id="access_data"></a>
## Access data

Before you can access data in the data file in the Object Storage instance, you must set the Hadoop configuration with the Object Storage instance service credentials by using the following configuration function:

Note: You will not be using Hadoop in this sample; however Spark leverages some Hadoop components.

In [1]:
/*found at http://stackoverflow.com/questions/33725500/load-data-from-bluemix-object-store-in-spark*/
import scala.collection.breakOut

def setConfig(name:String, dsConfiguration:String) : Unit = {
    val pfx = "fs.swift.service." + name
    val settings:Map[String,String] = dsConfiguration.split("\\n").
        map(l=>(l.split(":",2)(0).trim(), l.split(":",2)(1).trim()))(breakOut)

    val conf = sc.getConf
    conf.set(pfx + "auth.url", settings.getOrElse("auth_url",""))
    conf.set(pfx + "tenant", settings.getOrElse("tenantId", ""))
    conf.set(pfx + "username", settings.getOrElse("username", ""))
    conf.set(pfx + "password", settings.getOrElse("password", ""))
    conf.set(pfx + "apikey", settings.getOrElse("password", ""))
    conf.set(pfx + "auth.endpoint.prefix", "endpoints")
}

Click the next code cell to set the focus on the cell. Now add the credentials to access the data  file to this code cell by selecting **Palette>Data Sources** and clicking the `Insert to code` function below the data file in the **Data Source** pane.  

When you select the `Insert to code` function, a code cell with a Scala Map is created for you. Adjust the credentials in the dictionary to correspond with the credentials inserted by the `Insert to code` function and run the dictionary code cell. The access credentials to the Object Storage instance in the dictionary are provided for later usage. 

In [3]:
setConfig("spark", credentials_1.toString())

<a id="load_data"></a> 
## Load data

Instead of specifying the schema for a Spark DataFrame programatically, you can use the `spark-csv` module. 

Load the NYPD motor vehicle collisions data set from Object Storage into an RDD and then convert the RDD into a Spark DataFrame:

In [4]:
/*found at https://github.com/databricks/spark-csv*/
import org.apache.spark.sql.SQLContext

val sqlctx = new SQLContext(sc);
import sqlctx.implicits._
sqlctx.setConf("spark.sql.shuffle.partitions", "10");

/* Parenthesis allows line continuation which helps with readability */
val df = (sqlctx.read
    .format("com.databricks.spark.csv")
    .option("delimiter",",")
    .option("header","true")
    .option("inferschema","true")
    .option("mode","DROPMALFORMED")
    .option("treatEmptyValuesAsNulls", "true")
    .load("swift://notebooks.spark/NYPD_Motor_Vehicle_Collisions.csv")
);

println("created df")
println("")

created df


Now that the data is loaded, check the inferred schema:

In [5]:
println(df.printSchema())
println("")

root
 |-- DATE: string (nullable = true)
 |-- TIME: string (nullable = true)
 |-- BOROUGH: string (nullable = true)
 |-- ZIP CODE: integer (nullable = true)
 |-- LATITUDE: double (nullable = true)
 |-- LONGITUDE: double (nullable = true)
 |-- LOCATION: string (nullable = true)
 |-- ON STREET NAME: string (nullable = true)
 |-- CROSS STREET NAME: string (nullable = true)
 |-- OFF STREET NAME: string (nullable = true)
 |-- NUMBER OF PERSONS INJURED: string (nullable = true)
 |-- NUMBER OF PERSONS KILLED: integer (nullable = true)
 |-- NUMBER OF PEDESTRIANS INJURED: integer (nullable = true)
 |-- NUMBER OF PEDESTRIANS KILLED: integer (nullable = true)
 |-- NUMBER OF CYCLIST INJURED: integer (nullable = true)
 |-- NUMBER OF CYCLIST KILLED: string (nullable = true)
 |-- NUMBER OF MOTORIST INJURED: string (nullable = true)
 |-- NUMBER OF MOTORIST KILLED: integer (nullable = true)
 |-- CONTRIBUTING FACTOR VEHICLE 1: string (nullable = true)
 |-- CONTRIBUTING FACTOR VEHICLE 2: string (nullable

Run the next code cell to look at the data itself:

In [33]:
df.take(1)

Array([03/07/2016,10:40,BROOKLYN,11208,40.6801678,-73.8765734,(40.6801678, -73.8765734),ATLANTIC AVENUE,FOUNTAIN AVENUE,null,1,0,0,0,0,0,1,0,Fatigued/Drowsy,Unspecified,Unspecified,null,null,3401775,TAXI,PASSENGER VEHICLE,TAXI,null,null])

Run the following cell to count the number of vehicle collisions:

In [7]:
println(df.count())
println("")

769053


<a id="load_visualization_packages"></a> 
## Load visualization packages

This notebook makes use of the [Brunel](https://github.com/Brunel-Visualization/Brunel) visualizaton package. `Brunel` is "a highly succinct and novel language that defines interactive data visualizations based on tabular data." IBM Bluemix adds this library by default, but if you are not on IBM Bluemix, you will need to uncomment the line below in order to add the Brunel JarFile.

In [None]:
//%AddJar -magic https://brunelvis.org/jar/spark-kernel-brunel-all-1.2.jar

<a id="explore_data"></a> 
## Explore data

Now that the data is loaded, you can start exploring it and visualizing patterns by using scatter plots. 

In the next code cell, select the columns with data that you want to explore, for example, NUMBER OF PERSONS INJURED or CONTRIBUTING FACTOR VEHICLE, and transform these columns into a DataFrame.

In [8]:
val trimmedDF = df.filter(df("LATITUDE") !== 0).select("LATITUDE", "LONGITUDE", "DATE", "TIME", "BOROUGH", "ON STREET NAME", "CROSS STREET NAME", 
                                                        "NUMBER OF PERSONS INJURED", "NUMBER OF PERSONS KILLED", "CONTRIBUTING FACTOR VEHICLE 1")

In [9]:
//found at http://stackoverflow.com/questions/35592917/renaming-column-names-of-a-data-frame-in-spark-scala
val columnNames = Seq("Latitude", "Longitude", "Date", "Time", "Borough", "On Street", 
                      "Cross Street", "Persons Injured", "Persons Killed", "Contributing Factor")

var collisionsDF = trimmedDF.toDF(columnNames: _*)

//this is being done to take a subset because Brunel can not handle this large of a data set
collisionsDF = collisionsDF.limit(25000)

### Create an explorative scatter plot of the data

Using an explorative scatter plot is a way to analyze certain characteristics of the data set. 

Create an intial explorative scatter plot of all collisions by using the latitude and longitude information in the raw data:

In [10]:
//creating a new dataframe is not required, but I am trying to minimize the amount of data passed to Brunel
val allCollisions = collisionsDF.select("Latitude", "Longitude", "Borough")

In [None]:
%%brunel
data('allCollisions') x(Longitude) y(Latitude) style('size:20%') interaction(none)
          ::
width=1200, height=600

Although this scatter plot is not a street map of New York City, you will notice that the scatter plot dots roughly correspond to the street map of New York City. We can see that very few collisions happen in Central Park and on bridges. We can see higher density of collisions in areas such street crossings and curves.

### Enhance the scatter plot with information about city boroughs

Now add information about the city boroughs and use a different color to depict each borough on the scatter plot:

In [11]:
//remove collisions with null values for Borough
val collisionsByBorough = allCollisions.na.drop("any", Seq("Borough"))

In [None]:
%%brunel
data('collisionsByBorough') x(Longitude) y(Latitude) color(Borough) style('size:20%') interaction(none) 
title('Motor Vehicle Collisions in New York City by Borough') axes(x:'Longitude',y:'Latitude')
          ::
width=1200, height=600

### Create a bar graph to visualize patterns

Create a bar graph to show the total number of collisions by borough:

In [12]:
//replace null values with 'None' because Brunel will not display null values
val boroughBarDF = collisionsDF.groupBy("Borough").count().sort("count").na.fill("NONE")

In [None]:
%%brunel
data('boroughBarDF') transpose bar x(Borough) y(count) color(Borough) legends(none) sort(count:ascending) interaction(none) 
          ::
width=600, height=300

The bar graph clearly shows that the most collisions happen in Brooklyn and the least on Staten Island.

Adjust the scatter plot settings to use color codes to indicate collisions resulting in car body damage, personal injury, and fatal accidents:

In [13]:
import org.apache.spark.sql._
import org.apache.spark.sql.types._

var severityRDD = collisionsDF.select("Latitude", "Longitude", "Persons Injured", "Persons Killed").rdd.map(row => 
    if (row(3) != 0) {
        Row.fromSeq(row.toSeq :+ "fatal accidents")
    } else if (row(2) != "0" && row(3) == 0){
        //may need to change the type of row(2) in the future
        Row.fromSeq(row.toSeq :+ "personal injury")
    } else {
        Row.fromSeq(row.toSeq :+ "car body damage")
    })

var struct =
    StructType(
    StructField("Latitude", DoubleType, true) ::
    StructField("Longitude", DoubleType, true) ::
    StructField("Persons Injured", StringType, true) ::
    StructField("Persons Killed", IntegerType, true) ::
    StructField("Severity", StringType, true) :: Nil)


In [14]:
val severityDF = sqlctx.createDataFrame(severityRDD, struct)

In [None]:
%%brunel
data('severityDF') x(Longitude) y(Latitude) color(Severity) size(Severity:[30%,80%,40%]) interaction(none)
          ::
width=1200, height=600

The resulting scatter plot shows that there are fatal accident hot spots throughout the city. You can see that in some areas car body damage is prevalent, while in other areas personal injuries happen more often.

<a id="data_cleaning"></a> 
## Clean and shape the data

After using scatter plots to analyze certain characteristics of the raw data set, you will now learn how to clean and shape the data set to enable more plotting and further analysis. 

Begin by looking at the column names again to better assess which information you can use:

In [15]:
/*
Many of the following charts have less data points than the scatter plots above. 
For this reason, I went back to using the entire dataframe instead of the 25000
subset used above.
*/
df.columns

Array(DATE, TIME, BOROUGH, ZIP CODE, LATITUDE, LONGITUDE, LOCATION, ON STREET NAME, CROSS STREET NAME, OFF STREET NAME, NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUMBER OF PEDESTRIANS INJURED, NUMBER OF PEDESTRIANS KILLED, NUMBER OF CYCLIST INJURED, NUMBER OF CYCLIST KILLED, NUMBER OF MOTORIST INJURED, NUMBER OF MOTORIST KILLED, CONTRIBUTING FACTOR VEHICLE 1, CONTRIBUTING FACTOR VEHICLE 2, CONTRIBUTING FACTOR VEHICLE 3, CONTRIBUTING FACTOR VEHICLE 4, CONTRIBUTING FACTOR VEHICLE 5, UNIQUE KEY, VEHICLE TYPE CODE 1, VEHICLE TYPE CODE 2, VEHICLE TYPE CODE 3, VEHICLE TYPE CODE 4, VEHICLE TYPE CODE 5)

Run the next cell to obtain the number of entries with valid information about the street and borough:

In [16]:
val collisions = df.na.drop("any", Seq("ON STREET NAME", "BOROUGH"))
collisions.count()

577477

### Spatial and temporal normalization by using Spark

To obtain a consistent representation of the spatial and temporal information about collisions, you have to normalize the data. Normalization is the process of organizing the columns (attributes) and tables (relations) to minimize data redundancy. This step will help you in future analyses.

In [17]:
import org.apache.spark.sql.Row

var collisionsOutArray = Array("Time", "Street", "Borough")
for(col <- df.columns){
    if (!(Array("ON STREET NAME", "OFF STREET NAME", "CROSS STREET NAME", "BOROUGH", "DATE", "TIME") contains col)) {
        collisionsOutArray = collisionsOutArray :+ col
    }
}

val normalizationCode = scala.collection.mutable.HashMap[String, String](
    "avenue" -> "av",
    "ave" -> "av",
    "avnue" -> "av",
    "street" -> "st",
    "road" -> "rd",
    "boulevard" -> "blvd",
    "place" -> "pl",
    "plaza" -> "pl",
    "square" -> "sq",
    "drive" -> "dr",
    "lane" -> "ln",
    "parkway" -> "pkwy",
    "turnpike" -> "tp",
    "terrace" -> "ter",
    "1st" -> "1",
    "2nd" -> "2",
    "3rd" -> "3",
    "1th" -> "1",
    "2th" -> "2",
    "3th" -> "3",
    "4th" -> "4",
    "5th" -> "5",
    "6th" -> "6",
    "7th" -> "7", 
    "8th" -> "8",
    "9th" -> "9",
    "0th" -> "0",
    "west " -> "w ",
    "north " -> "n ",
    "east " -> "e ",
    "south " -> "s "
)

def isalnum(c: Char) : Boolean = {
    if (c.isLetter && c <= 'z') {
        return true
    } else if (c.isDigit) {
        return true
    }
    return false;
}

def normalizeStreet(s:String) : String = {
    // Lowercase
    var str = s.toLowerCase

    // Delete all non-alphanumeric characters
    if (!str.matches("[a-zA-Z0-9]*")) {
        str = str.filter(isalnum(_))
    }

    // Replace common abbreviations
    for (k <- normalizationCode.keys) {
        str = str.replace(k, normalizationCode(k))
    }

    return str
}

In [18]:
def getSpatial(row:Row) : List[Any] = {
    /*
    Computes the location identifier from the input row

    Returns:
        List of spatial key columns
    */
    val rowMap = row.getValuesMap(Array("ON STREET NAME", "BOROUGH"))
    var loc : String = rowMap("ON STREET NAME")
    loc = normalizeStreet(loc)
    var borough : String = rowMap("BOROUGH")
    borough = borough.toLowerCase
    
    return List(loc, borough) 
}

def getTemporal(row:Row) : List[Any] = {
    /*
    Computes the temporal key from a given row. 
    Unlike the python example, I do not return a Datetime object. 
    I ran into multiple issues with Datetimes, and there seemed to be no real advantage to
    having them since later examples returned the date and time to multiple columns.
    */
    val rowMap = row.getValuesMap(Array("DATE", "TIME"))
    var date : String = rowMap("DATE")
    var time : String = rowMap("TIME")
    
    //return 
    return List(date.split("/")(2).toInt, date.split("/")(0).toInt, date.split("/")(1).toInt, time.split(":")(0).toInt)

}

def getRest(row:Row) : List[Any] = {
    /*
    Computes the rest from a given row
    */
    var columns = Seq.empty[String]
    for (field <- row.schema) {
        columns :+= field.name
    }
    val rowMap = row.getValuesMap(columns)
    var rest = List.empty[Any]
    for (col <- columns) {
        if (!(Array("ON STREET NAME", "OFF STREET NAME", "CROSS STREET NAME", "BOROUGH", "DATE", "TIME")  contains col)) {
            rest :+= rowMap(col)
        }
    }
    return rest
}

In [19]:
def getCollisionsOutStruct(row:Row) : StructType = {
    var struct = StructType(
    StructField("Year", IntegerType, true) ::
    StructField("Month", IntegerType, true) ::
    StructField("Day", IntegerType, true) ::
    StructField("Hour", IntegerType, true) ::
    StructField("Street", StringType, true) ::
    StructField("Borough", StringType, true) :: Nil)
    for (field <- row.schema) {
        if (collisionsOutArray contains field.name) {
            struct = struct.add(field)
        }
    }
    return struct
}

In [20]:
val collisionsOut = collisions.rdd.map(row => Row(getTemporal(row) ::: getSpatial(row) ::: getRest(row):_*))
var row = collisions.take(1)(0)
var struct = getCollisionsOutStruct(row)
var collisionsOutDF = sqlctx.createDataFrame(collisionsOut, struct)

In [21]:
collisionsOutDF.columns

Array(Year, Month, Day, Hour, Street, Borough, ZIP CODE, LATITUDE, LONGITUDE, LOCATION, NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUMBER OF PEDESTRIANS INJURED, NUMBER OF PEDESTRIANS KILLED, NUMBER OF CYCLIST INJURED, NUMBER OF CYCLIST KILLED, NUMBER OF MOTORIST INJURED, NUMBER OF MOTORIST KILLED, CONTRIBUTING FACTOR VEHICLE 1, CONTRIBUTING FACTOR VEHICLE 2, CONTRIBUTING FACTOR VEHICLE 3, CONTRIBUTING FACTOR VEHICLE 4, CONTRIBUTING FACTOR VEHICLE 5, UNIQUE KEY, VEHICLE TYPE CODE 1, VEHICLE TYPE CODE 2, VEHICLE TYPE CODE 3, VEHICLE TYPE CODE 4, VEHICLE TYPE CODE 5)

## Investigating Variables

We have to investigate the variables and find out wheather they are useful or not. We begin with plotting one of the contributing factors.

### Contributing factors to collisions

In [22]:
var collisionsByFactor = (collisionsOutDF
                            .groupBy(collisionsOutDF("CONTRIBUTING FACTOR VEHICLE 1").alias("contributingFactor"))
                            .count().sort($"count".desc).limit(24))

In [None]:
%%brunel
data('collisionsByFactor') transpose bar x(contributingFactor) y(count:linear) sort(count:ascending) interaction(none)
          ::
width=800, height=600

Running the code cell above shows you that the contributing factor can't be specified in most cases. However, factors like distraction, failure to yield right-of-way and fatigue can play a role. You can investigate and plot the other contribution factos by modifying the code above.

### Sorting the vehicle types into groups

The data set has entries for a large number of car types. The following code cell regroups the car types into main categories like auto, bus, truck, taxi or other.

In [23]:
val grouping = scala.collection.mutable.HashMap[String, String](
    "TAXI" -> "Taxi",
    "AMBULANCE" -> "Other",
    "BICYCLE" -> "Other",
    "BUS" -> "Bus",
    "FIRE TRUCK" -> "Other", 
    "LARGE COM VEH(6 OR MORE TIRES)" -> "Truck",
    "LIVERY VEHICLE" -> "Truck",
    "MOTORCYCLE" -> "Other", 
    "OTHER" -> "Other",
    "PASSENGER VEHICLE" -> "Auto",
    "PICK-UP TRUCK" -> "Other",
    "PEDICAB" -> "Other", 
    "SCOOTER" -> "Other",
    "SMALL COM VEH(4 TIRES)" -> "Truck",
    "SPORT UTILITY / STATION WAGON" -> "Auto", 
    "UNKNOWN" -> "Other",
    "VAN" -> "Auto",
    "UNSPECIFIED" -> "Other"
)

In [24]:
def groupVehicle(row : Row) : Row = {
    var rowMap = row.getValuesMap(row.schema.fieldNames)
    var rowSeq = row.toSeq
    var resultSeq = Seq.empty[Any]
    for (field <- row.schema.fieldNames) {
        if (field.startsWith("VEHICLE TYPE CODE")) {
            if (rowMap(field) != null) {
                rowSeq = rowSeq.updated(row.fieldIndex(field), grouping(rowMap(field)))
            }
        }
    }
    for (field <- List("Year", "Month", "Day", "Hour", "Street", "Borough", "NUMBER OF PERSONS INJURED", "NUMBER OF PERSONS KILLED")) {
        resultSeq :+= rowMap(field)
    }
    for (vehicle <- List("Auto", "Bus", "Truck", "Taxi", "Other")) {
        resultSeq :+= rowSeq.count(_ == vehicle)
    }
    resultSeq :+= (if (rowMap("NUMBER OF PERSONS INJURED").asInstanceOf[String].toInt > 0) 1 else 0)
    resultSeq :+= (if (rowMap("NUMBER OF PERSONS KILLED").asInstanceOf[Integer] > 0) 1 else 0)
    return Row(resultSeq:_*)
}

def getCollisionsOutCategoriesStruct(row:Row) : StructType = {
    var struct = StructType(Nil)
    for (field <- row.schema) {
        if (List("Year", "Month", "Day", "Hour", "Street", "Borough", "NUMBER OF PERSONS INJURED", "NUMBER OF PERSONS KILLED") contains field.name) {
            struct = struct.add(field)
        }
    }
    for (vehicle <- List("Auto", "Bus", "Truck", "Taxi", "Other")) {
        struct = struct.add(StructField(vehicle, IntegerType, true))
    }
    struct = struct.add(StructField("AccidentsWithInjuries", IntegerType, true))
    struct = struct.add(StructField("AccidentsWithDeaths", IntegerType, true))
    return struct
}

In [25]:
var collisionsOutCategories = collisionsOutDF.map(row => groupVehicle(row))

In [26]:
var row = collisionsOutDF.take(1)(0)
var struct = getCollisionsOutCategoriesStruct(row)
var collisionsOutCategoriesDF = sqlctx.createDataFrame(collisionsOutCategories, struct)
collisionsOutCategoriesDF = collisionsOutCategoriesDF.withColumnRenamed("NUMBER OF PERSONS INJURED", "Injured")
collisionsOutCategoriesDF = collisionsOutCategoriesDF.withColumnRenamed("NUMBER OF PERSONS KILLED", "Killed")

Count the number of accidents by car type, severity, street name, and borough that occurred down to the hour:

In [27]:
var aggregationColumns = scala.collection.mutable.HashMap[String, String]("*"->"count")
for (field <- List("AccidentsWithInjuries", "AccidentsWithDeaths", "Auto", "Bus", "Truck", "Taxi", "Other", "Injured", "Killed")) {
    aggregationColumns(field) = "sum"
}

In [28]:
var collisionsGrouped = collisionsOutCategoriesDF.groupBy("Year", "Month", "Day", "Hour", "Street", "Borough").agg(aggregationColumns.toMap)
for (c <- collisionsGrouped.columns) {
    if (c.startsWith("sum")) {
        collisionsGrouped = collisionsGrouped.withColumnRenamed(c, c.substring(4,c.length -1))
    }
    if (c.startsWith("count")) {
        collisionsGrouped = collisionsGrouped.withColumnRenamed(c, "NumberOfAccidents")
    }
}

### Determine the streets with the most collisions

Find the top ten streets in New York where the most vehicle collisions occurred. Display the results in a bar graph and as a scatter plot:

In [29]:
var collisionsByStreet = collisionsOutDF.groupBy("Borough", "Street").count().sort($"count".desc).limit(10)

In [None]:
%%brunel
data('collisionsByStreet') transpose bar x(Street) y(count:linear) sort(count:ascending) interaction(none)
          ::
width=800, height=400

Now you can add the information about the top 10 streets into the scatter plot.

In [30]:
var top10Streets = collisionsByStreet.select("Borough", "Street").rdd.map(row => (row(0), row(1))).collect()

def filterStreets(row : Row) : Row = {
    /**
        We only want the top 10 streets to be colored. To accomplish this we
        will remove all street values that are not in the top 10. The nulls will
        all be colored the same in the display generated by Brunel.
    */
    var rowMap = row.getValuesMap(Array("Borough", "Street"))
    var rowSeq = row.toSeq
    if (!(top10Streets contains (rowMap("Borough"),rowMap("Street")))) {
      return Row(rowSeq.updated(row.fieldIndex("Street"), null):_*)
    }
    return Row(rowSeq:_*)
}

In [31]:
var top10StreetsRDD = collisionsOutDF.map(row => filterStreets(row))
var top10StreetsDF = sqlctx.createDataFrame(top10StreetsRDD, collisionsOutDF.schema)
//this is being done to take a subset because Brunel can not handle this large of a data set
top10StreetsDF = top10StreetsDF.limit(25000)

In [None]:
%%brunel
data('top10StreetsDF') x(Longitude) y(Latitude) color(Street) size(Street:[80%,80%,80%,80%,80%,80%,80%,80%,80%,80%,20%]) interaction(none)
          ::
width=1200, height=600

### Determining when the most collisions occurred

Now find out at what time of the day the most accidents occurred and see if you can detect any interesting patterns by running the following cell:

In [32]:
import org.apache.spark.sql.functions._
collisionsGrouped = (collisionsGrouped
                      .select("Bus", "Truck", "Taxi", "Other", "Hour", "Auto")
                      .groupBy("Hour")
                      .agg(sum("Bus").alias("Bus"), sum("Truck").alias("Truck"), 
                           sum("Taxi").alias("Taxi"), sum("Other").alias("Other"), sum("Auto").alias("Auto")))

In [None]:
%%brunel
data("collisionsGrouped") stack bar x(Hour) y(Auto, Taxi, Truck, Bus) color(#series) interaction(none)

This plot shows collisions spread across a day, with peaks during the morning and afternoon rush hours. You can see that significantly more collisions occurred during the afternoon rush hour than during the morning rush hour. Also, the most collisions involve cars by far, while buses, taxis, and trucks are involved in accidents a lot less frequently.

<a id="summary"></a>
## Summary

This notebook showed you how to analyze car vehicle accidents based on accident reports for New York and how you can use this information to learn more about the causes for collisions. If you extract  this type of information from the data, you can use it to help develop measures for preventing  vehicle accidents in accident hotspots.