# Collision Data Exploration

The goal of this exercise is to visualize and understand the data before further processing.

The dataset is available from https://data.lacity.org/A-Safe-City/Traffic-Collision-Data-from-2010-to-Present/d5tf-ez2w. You can try to download directly using wget. If the connection fails, download manually onto your computer and upload to Collab. If you do so, make sure to name the file: 'Traffic_Collision_Data.csv'

In [None]:
pip install wget

In [None]:
import wget
wget.download('https://data.lacity.org/api/views/d5tf-ez2w/rows.csv?accessType=DOWNLOAD','Traffic_Collision_Data.csv')

## Exploring the tabular data - Manipulating Dataframes

The collision data is time series data stored in tabular format. The cell below prints out the name of the columns in the csv and the first few rows of the table.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import datetime
df = pd.read_csv("Traffic_Collision_Data.csv")
df.head()

Use the following cell to get a sense of the size of the dataset.

In [None]:
df.shape

If you're interested in the code, the csv was imported into a [pandas](https://pandas.pydata.org) dataframe. Pandas is a widly use library to deal with this kind of data. The `df.info()` function allows you to output the name of the columns, the number of non-null values in each column, giving you a quick overview about the number of missing data, as well as the format of the data. 

In [None]:
df.info()

In [None]:
df.isnull().sum()

The follwing cell allows you to transform the time stamp into the datetime format, which is used by many python libraries. 

In [None]:
df['Date Reported'] = pd.to_datetime(df['Date Reported']).dt.year 
df['Date Occurred'] = pd.to_datetime(df['Date Occurred']).dt.year
df.head()

## Number of collisions through time

This chart summarizes the trend in collisions over the past decade.


In [None]:
plt.subplots(figsize = (20,5))
# We have skiped 2023 because it doesn't have the entire year's data. 
df1 = df[(df['Date Occurred'].isin(['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022']))]
sns.countplot(x=df1['Date Occurred'])
plt.title('Collisions per year') 
# sns.countplot(x=df['Date Occurred'])
plt.show()

## Location of collisions

In [None]:
df['Premise Description'].value_counts().head(10)

## Collisions by age group

In [None]:
plt.subplots(figsize = (15,7))
sns.countplot(x=df['Victim Age'].sort_values(ascending = False))
plt.title('Collisions by Victim Age') 
plt.xticks(rotation = 90)
plt.show()

## Collisions by time of day

In [None]:
import datetime as dt
def convert(x):
  return dt.datetime.strptime(x, '%H:%M')
  
def getTime(t):
    t = str(t)
    if len(t)==1:
      return t[0]+':'+'00'
    if len(t)<4:
      return t[:1] + ':' + t[1:]
    else:
      return t[:2] + ':' + t[2:]

In [None]:
df['Time Occurred']= df['Time Occurred'].apply(getTime)

df['Time Occurred']=df['Time Occurred'].apply(convert)

In [None]:
hours = [t.hour for t in df['Time Occurred'] ]
numbers=[x for x in range(0,24)]
labels=map(lambda x: str(x), numbers)
plt.subplots(figsize = (15,6))
sns.countplot(x=hours)