<a href="https://colab.research.google.com/github/jiobu1/DS-Unit-1-Sprint-4-Data-Storytelling-Portfolio-Project/blob/master/Jisha_Obukwelu_Unit_1_Data_Storytelling_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**ATLANTA ARRESTS RECORDS FOR THE PAST DECADE**
#**2009-2019**

####Loading Libraries 

In [0]:
#Import different libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.cm as mplcm
import matplotlib.colors as colors
import seaborn as sns

# Plotly imports
import plotly
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, plot, iplot

from scipy import stats 
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel
from scipy.stats import chisquare
from scipy.stats import normaltest
from scipy.stats import ks_2samp

####Loading Directories

In [0]:
# #Loading files
# from google.colab import files
# upload = files.upload()

Initially, I was working with this file locally. I was able to use command line to push the file to github (after installing their large file package - git lfs)

Links:

https://help.github.com/en/github/authenticating-to-github/adding-a-new-ssh-key-to-your-github-account
https://help.github.com/en/github/managing-large-files/working-with-large-files
https://git-lfs.github.com/
https://github.com/git-lfs/git-lfs/releases/tag/v2.10.0

In [0]:
url = "https://raw.githubusercontent.com/jiobu1/DS-Unit-1-Sprint-4-Data-Storytelling-Portfolio-Project/master/COBRA-2009-2019.csv"

In [0]:
arrests = pd.read_csv(url)
                    
#did not pass through dtype becasue this deleted a lot of the information

####Exploring Data

In [0]:
arrests.shape

In [0]:
arrests.head()

## Data Cleaning

In [0]:
arrests[['UCR #', 'IBR Code']].nunique()

In [0]:
#Choosing columns
arrests = arrests[['Report Date', 'Occur Date', 'Occur Time',
                   'Beat', 'Location', 'Shift Occurence', 'Location Type', 
                   'UCR Literal', 'UCR #', 'IBR Code', 'Neighborhood', 'NPU', 
                   'Latitude','Longitude']]
arrests.head()

In [0]:
arrests.shape

####Still cleaning data
* Analyzing how many null values are in the dataframe 

In [0]:
arrests.isnull().sum()
#Not sure if I want to delete null functions just yet

##**Exploration**

In [0]:
arrests.head()

In [0]:
arrests.describe()

In [0]:
arrests.describe(exclude  = 'number')

In [0]:
sns.pairplot(arrests, hue = 'UCR Literal');

###**Has crime gone down?**

1.   Crime increase/decrease by years
2.   Specific crimes by year
3.   Crime by the time of year

In [0]:
#Parsing dates into years and months

##### **Creating New Features**

In [0]:
Using occur date instead of report date or possible date
arrests['Occur Date'].head()

In [0]:
#arrests['Occur Date'] = pd.to_datetime(arrests['Occur Date'], infer_datetime_format=True) This create a SetCopyWarning
arrests['Occur Date'] = arrests['Occur Date'].apply(pd.to_datetime)

In [0]:
arrests['Occur_Year'] = arrests['Occur Date'].dt.year
arrests['Occur_Month'] = arrests['Occur Date'].dt.month

####**Crime increase/decrease by years**

In [0]:
# Not including 2020 
year = arrests['Occur_Year'].value_counts()[:11].to_dict()
year

In [0]:
# sorted by key,  a list of tuples
lists = sorted(year.items()) 

# unpack a list of pairs into two tuples
x, y = zip(*lists) 

# fig size
fig, ax = plt.subplots(figsize = (13, 5))
fig.facecolor = 'white'

# graphing line
plt.plot(x, y, marker = 'o', markerfacecolor='blue')

# labeling graph
plt.text (x = 2012, 
          y = 41000, 
          s = 'Atlanta Crime Statistics', 
          fontsize = 18, 
          fontweight = 'bold')

plt.xlabel('Years', fontsize = 12, fontweight = 'bold')
plt.ylabel('Crime', fontsize = 12, fontweight = 'bold')

# labeling data
for i, txt in enumerate(y):
    plt.annotate(txt, (x[i], y[i]))

plt.xticks (range(2009, 2020, 1))

plt.show()

From this graph we can see that overall crime has been down but does this reflect across the board for crimes, or are some crimes becoming more or less prevalent?

###**Have specific crimes gone down?**

#### **Has all crime been declining?**

In [0]:
arrests['UCR Literal'].value_counts()

In [0]:
#Calculating  specific crimes by years
crime_year = pd.crosstab(arrests['Occur_Year'], arrests['UCR Literal'])

crime_year  = crime_year.drop([1916, 1920, 1970, 1973, 1976,	1979, 1980,	1991, 
                 1993, 2000,	2001,	2003,	2004,	2005,	2006, 2007,	2008])

In [0]:
chi_squared, p_value, dof, expected = stats.chi2_contingency(crime_year)
print('Chi2:',chi_squared,'\n', 'P-Value:', p_value,'\n', 'DOF:',dof,'\n')

According to the chi-squared test, we get a p-value of 0. This means that crime rate and year are not random but there seems to be a correlation between the years and the crime. This means that we will have to dig further to see which crimes have increased or decreased over the years.

In [0]:
crime_year

In [0]:
# Playing with loop to generate both lines and labels
plt.style.use('fivethirtyeight')
plt.figure(figsize = (10, 8))
plt.backgroundcolor = '#f0f0f0'

# Creating line plots
plot_points = []

crimes = crime_year.columns

colors_array = ['#003f5c','#2f4b7c','#665191','#a05195',
          '#d45087','#f95d6a','#ff7c43','#ffa600', 
          'black', 'purple', 'green']

for crime, color in zip(crimes, colors_array):
    points = plt.plot(crime_year[crime], marker='*', color = color, lw=1)
    plot_points.append(points)

plot_points;

# Title, Axes
plt.title ("Crime Rates in Atlanta over the Past Decade",
          fontsize = 16,
          fontweight = 'bold', 
          loc = 'left');

plt.xticks(range(2009, 2020, 1 ))
plt.yticks(range(0, 11000, 1000))

#Sometimes shows up and somethimes does not. 
plt.xlabel = ('Years')
plt.ylabel = ('Arrests')

#Labeling Lines
text = []

x_coordinates = [2013.1, 2010.1, 2012.1, 2012.5, 2009.5, 2013, 2014.5, 2013.5, 2011.1, 2010.1, 2016.1]
y_coordinates = [2550, 5200, 1000, 4800, 100, 9800, 7000, -10, 400, 2100, 400]
labels = ['Aggrevated Assault', 'Auto Theft', 'Burglary - Nonresidential', 
          'Burglary - Residential', '', 'Larceny from Vehicle', 'Larceny Non-Vehicle', 
          '', 'Robbery - Commercial', 'Robbery - Pedestrian', 'Robbery - Residential']
rotations = [-2, 6, 3, -17, 0, 5, -16, 0, 0, 2, -1]


for x, y, label, color, rotation in zip(x_coordinates, y_coordinates, labels, colors_array, rotations):
  position = plt.text(x = x, y = y, s = label , color = color, fontsize = 9, weight = 'bold', rotation = rotation)
  text.append(position)

text;

#Two Other Labels
plt.text(x = 2009.5, y = 100, s = 'Homicide', color = colors_array[4], 
         fontsize = 9,  weight = 'bold', backgroundcolor = '#f0f0f0', rotation = 0)
plt.text(x = 2013.5, y = -15, s = 'Manslaughter', color = colors_array[7], 
         fontsize = 9,  weight = 'bold', backgroundcolor = '#f0f0f0', rotation = 0 )


plt.show()

In [0]:
types = pd.DataFrame(arrests.groupby(by = ['Occur_Year','UCR Literal']).size()).unstack('Occur_Year')
types = types.reset_index(level = ['UCR Literal'])
types[(0,'Change')] = (types[(0, 2019)] - types[(0, 2018)])/types[(0, 2018)]*100

In [0]:
types.loc[:,[('UCR Literal', ''),(0,'Change')]]

Overall crime might have been on a downward trajectory but several types of crime have rise - aggravated assault (12%), non-vehicle larceny (4%), homicide(18%), and commercial (9%) and residential (10%) burglary.

### **Neighborhood**

### **Looking at Crime in 2019**
Creating a plotly.scatter_mapbox of crime in Atlanta for 2019

In [0]:
!pip install chart_studio
import chart_studio

In [0]:
condition = arrests['Occur_Year'] == 2019
crime_2019 = arrests[condition]
crime_2019

In [0]:
import plotly.express as px

fig = px.scatter_mapbox(crime_2019, lat='Latitude', lon='Longitude', color='UCR Literal', opacity=1.0)
fig.update_layout(mapbox_style='stamen-terrain')
fig.show()

####Creating an Interactive Plotly Graph

In [0]:
# username = '' # your username
# api_key =  '' # your api key - go to profile > settings > regenerate key
# chart_studio.tools.set_credentials_file(username=username, api_key=api_key)

In [0]:
# import chart_studio.plotly as py
# py.plot(fig, filename = 'index.html', auto_open=True)

Did not get to do this because my file was too large

In [0]:
import plotly.io as pio
pio.write_html(fig, file = 'index.html', auto_open=True)

In [0]:
import chart_studio.tools as tls
tls.get_embed('https://plot.ly/~jiobu/1/') #change to your url

Source:https://towardsdatascience.com/how-to-create-a-plotly-visualization-and-embed-it-on-websites-517c1a78568b

#### **Crime by neighborhoods**

In [0]:
# Too many neighborhoods to evaluate
# Will look at top 10

In [0]:
arrests['Neighborhood'].value_counts()[:10]

In [0]:
areas = arrests['Neighborhood'].unique()
list_to_remove = ['Downtown','Midtown','Old Fourth Ward','West End','Lenox',
                  'North Buckhead ', 'Greenbriar','Vine City', 'Sylvan Hills',
                  'Grant Park']

final_list= list(set(areas).difference(set(list_to_remove)))
final_list.pop(0)
final_list.remove('North Buckhead')
final_list;

In [0]:
#Checking that the top 10 neighborhoods are not in the final list to be deleted.
'Vine City' in final_list

In [0]:
neighborhood = pd.crosstab(arrests['Neighborhood'], arrests['Occur_Year'])
neighborhood = neighborhood.drop(final_list);

In [0]:
neighborhood = neighborhood.T.drop([1916, 1920, 1970, 1973, 1976,	1979, 1980,	1991, 
                                    1993, 2000,	2001,	2003,	2004,	2005,	2006, 2007,	2008])
neighborhood

According to this chart you can see that 3 out of these top 10 neighborhoods have seen a surge in crime in 2019 - Grant Park, Midtown and Old Fourth Ward

In [0]:
#Plot figure
plt.figure(figsize = (10, 8))
ax = sns.lineplot(data = neighborhood, hue = 'Neighborhood', dashes= False, legend='brief', lw = 1)
plt.legend(fontsize = 10,  loc = 'best')

# Title, Axes
plt.title ("Change in Crime in Speicific Neighborhoods",
          fontsize = 16,
          fontweight = 'bold', 
          loc = 'left');

plt.xticks(range(2009, 2020, 1 ))
plt.yticks(range(0, 3500, 500))

#Sometimes shows up and somethimes does not. 
ax.set(xlabel='Years', ylabel='Arrests')

plt.show()

#### **Further Analysis of Crime** 
Looking specifically at neighborhoods where crime has risen in the last year. 

In [0]:
condition = (arrests['Neighborhood']=='Grant Park')|(arrests['Neighborhood']=='Midtown') | (arrests['Neighborhood']=='Old Fourth Ward')
rise = arrests[condition]
rise.head()

In [0]:
#Calculating  specific crimes by years
rise_crime = pd.crosstab(rise['Occur_Year'], rise['UCR Literal'])

rise_crime  = rise_crime.drop([1916, 1976, 1993, 2004, 2007, 2008])

rise_crime

In [0]:
plt.figure(figsize = (10, 8))
ax = sns.lineplot(data = rise_crime, hue = 'Neighborhood', dashes= False, legend='brief', lw = 1)
ax.legend(loc='upper center', bbox_to_anchor=(1, 1), shadow=True, ncol=1, fontsize = 9)

# Title, Axes
plt.title ("Change in Crime in Speicific Neighborhoods",
          fontsize = 16,
          fontweight = 'bold', 
          loc = 'left');

plt.xticks(range(2009, 2020, 1 ))
plt.yticks(range(0, 1800, 250))

#Sometimes shows up and somethimes does not. 
ax.set(xlabel='Years', ylabel='Arrests')

# Title, Axes
plt.title ("Crime Rise?",
          fontsize = 16,
          fontweight = 'bold', 
          loc = 'left');

plt.text (x = 2008.5, y = 1900,  s = 'Looking at crime in neighborhoods where there has been an increase in the last year', 
          color = 'black', fontsize = 12)

plt.show ()

According to the graph, 8 out of the 10 reported crimes have increased in these neighborhoods in the past year.

#### **Looking Deeper**
Taking a look to see which crimes increased and by how much in these neighborhoods.

In [0]:
condition_year = (rise['Occur_Year'] == 2019) |(rise['Occur_Year'] == 2018)

In [0]:
rise_year = rise[condition_year]
rise_year

In [0]:
test = rise_year[['Occur_Year','Neighborhood','UCR Literal']]
increase = pd.DataFrame(test.groupby(by = ['Occur_Year','UCR Literal']).size()).unstack('Occur_Year')
increaseT = increase.reset_index(level = ['UCR Literal'])
increaseT[(0,'Change')] = (increaseT[(0, 2019)] - increaseT[(0, 2018)])/increaseT[(0, 2018)]*100

In [0]:
increaseT

## **Other Questions**

### **Time**

#### **Finding out if time of year affects crime type**

In [0]:
cut_points = [0, 3, 6, 9, 12]
label_names = ['0-3','4-6','7-9','10-12']
arrests['months_categories'] = pd.cut(arrests['Occur_Month'], cut_points, labels=label_names)
arrests['months_categories'].value_counts()

In [0]:
specific_months = pd.crosstab(arrests['Occur_Year'], arrests['months_categories']).drop([1916, 1920, 1970, 1973, 1976,	1979, 1980,	1991, 
                                    1993, 2000,	2001,	2003,	2004,	2005,	2006, 2007,	2008])
specific_months = specific_months.T
specific_months

In [0]:
#Plot figure
plt.figure(figsize = (8, 8))
ax = sns.lineplot(data = specific_months, hue = 'Occur_year', dashes= False, legend='brief', lw = 2)
ax.legend(loc='upper center', bbox_to_anchor=(1.45, 1), shadow=True, ncol=1, fontsize = 11)

# Title, Axes
plt.title ("Crimes View By Quarter",
          fontsize = 16,
          fontweight = 'bold', 
          loc = 'left')


plt.xticks(np.arange(4), ('Q1', 'Q2', 'Q3', 'Q4'))
      

plt.legend(fontsize = 8, 
           loc = 'upper right')

plt.show()

### **Time of Day** 
For 2019 sample data, looking how time of day affects the types of crimes that occur.

In [0]:
crime_2019['Occur Time'].value_counts()

Tried several methods to sort this column. This column has a dtype of object, so first I tried to convert to an hours and minutes format using the pd.to_date(crime_2019['Occur Time'], unit = 'm').dt.strftime('%H:%M'). This converted the the column into hours and minute but unfortunately, I was unable to sort this colum. Therefore, I did the next best thing and changed the column into an integers so that this column could be sorted and used for further exploration.



In [0]:
crime_2019['Occur Time'].dtype

In [0]:
crime_2019 = crime_2019.copy()

In [0]:
crime_2019['Occur Time'] = crime_2019['Occur Time'].astype(int)

In [0]:
crime_2019['Occur Time'].dtype

In [0]:
bins = pd.interval_range(start = 0, freq = 100, end = 2400, closed ='left')
crime_2019['time_categories'] = pd.cut(crime_2019['Occur Time'], bins = bins, duplicates = 'raise')

In [0]:
crime_2019['time_categories'].value_counts()

In [0]:
def new_column(x):
  if 0 < x <100:
    val = '12:00am'
  elif 100 <= x < 200:
    val = '1:00am'
  elif 200 <= x < 300:
    val = '2:00am' 
  elif 300 <= x < 400:
    val = '3:00am'
  elif 400 <= x < 500:
    val = '4:00am'
  elif 500 <= x < 600:
    val = '5:00am'
  elif 600 <= x < 700:
    val = '6:00am'
  elif 700 <= x < 800:
    val = '7:00am'
  elif 800 <= x < 900:
    val = '8:00am'
  elif 900 <= x < 1000:
    val = '9:00am'
  elif 1000 <= x < 1100:
    val = '10:00am'
  elif 1100 <= x < 1200:
    val = '11:00am'
  elif 1200 <= x < 1300:
    val = '12:00pm'
  elif 1300 <= x < 1400:
    val = '1:00pm'
  elif 1400 <= x < 1500:
    val = '2:00pm' 
  elif 1500 <= x < 1600:
    val = '3:00pm'
  elif 1600 <= x < 1700:
    val = '4:00pm'
  elif 1700 <= x < 1800:
    val = '5:00pm'
  elif 1800 <= x < 1900:
    val = '6:00pm'
  elif 1900 <= x < 2000:
    val = '7:00pm'
  elif 2000 <= x < 2100:
    val = '8:00pm'
  elif 2100 <= x < 2200:
    val = '9:00pm'
  elif 2200 <= x < 2300:
    val = '10:00pm'
  else:
    val = '11:00pm'
  return val

In [0]:
crime_2019['Hours'] = crime_2019['Occur Time'].apply(new_column)

In [0]:
crime_2019['Hours'] = crime_2019['Occur Time'].apply(new_column)

#### **Plotly Histogram (used loosely)**

In [0]:
fig2 = px.histogram(crime_2019, x = "Hours", color="UCR Literal")
fig2.update_xaxes({'categoryorder':'array', 'categoryarray':['12:00am', '1:00am','2:00am','3:00am','4:00am', '5:00am', '6:00am', '7:00am','8:00am','9:00am','10:00am', '11:00am', '12:00pm', '1:00pm','2:00pm','3:00pm','4:00pm', '5:00pm', '6:00pm', '7:00pm','8:00pm','9:00pm','10:00pm', '11:00pm']})
fig2.show()

#### **GeoPandas**
Looking at times of crime by neighborhood

Source: https://towardsdatascience.com/geopandas-101-plot-any-data-with-a-latitude-and-longitude-on-a-map-98e01944b972

In [0]:
!pip install geopandas

In [0]:
!pip install descartes

In [0]:
import geopandas as gpd
import pyproj 
from pyproj import CRS
import descartes
import shapely
from shapely.geometry import Point, Polygon

In [0]:
from google.colab import files
upload = files.upload()

In [0]:
shape = gpd.read_file('Cities_Georgia.shp')

In [0]:
df = crime_2019.copy()
crs_4326 = CRS("WGS84")
df.head()

#https://pyproj4.github.io/pyproj/stable/gotchas.html#init-auth-auth-code-should-be-replaced-with-auth-auth-code (+init=<auth>:<auth_code> should be replaced with <auth>:<auth_code> )

In [0]:
geometry = [Point(xy) for xy in zip(df['Longitude'], df['Latitude'])]
geometry[:3]

In [0]:
geo_df = gpd.GeoDataFrame(df, #specify our data
                          crs = crs_4326, #specify our coordinate reference system
                          geometry = geometry) #specify the geometry list we created
geo_df.head()

In [0]:
print(geo_df['Latitude'].min(),geo_df['Latitude'].max() )
print(geo_df['Longitude'].min(),geo_df['Longitude'].max() )

In [0]:
fig, ax = plt.subplots(figsize = (10, 15))

plt.xlim(-84.55, -84.29)
plt.ylim(33.64, 33.88)
shape.plot(ax = ax, alpha = 0.4, color = 'grey' )

geo_df[geo_df['Hours'] == '12:00am'].plot(ax = ax, markersize = 20, color = '#a6cee3', marker = 'o', label ='12:00am')
geo_df[geo_df['Hours'] == '1:00am' ].plot(ax = ax, markersize = 20, color = '#1f78b4', marker = 'o', label ='1:00am' )
geo_df[geo_df['Hours'] == '2:00am' ].plot(ax = ax, markersize = 20, color = '#b2df8a', marker = 'o', label ='2:00am' )
geo_df[geo_df['Hours'] == '3:00am' ].plot(ax = ax, markersize = 20, color = '#33a02c', marker = 'o', label ='3:00am' )
geo_df[geo_df['Hours'] == '4:00am' ].plot(ax = ax, markersize = 20, color = '#fb9a99', marker = 'o', label ='4:00am' )
geo_df[geo_df['Hours'] == '5:00am' ].plot(ax = ax, markersize = 20, color = '#e31a1c', marker = 'o', label ='5:00am' )
geo_df[geo_df['Hours'] == '6:00am' ].plot(ax = ax, markersize = 20, color = '#fdbf6f', marker = 'o', label ='6:00am' )
geo_df[geo_df['Hours'] == '7:00am' ].plot(ax = ax, markersize = 20, color = '#ff7f00', marker = 'o', label ='7:00am' )
geo_df[geo_df['Hours'] == '8:00am' ].plot(ax = ax, markersize = 20, color = '#cab2d6', marker = 'o', label ='8:00am' )
geo_df[geo_df['Hours'] == '9:00am' ].plot(ax = ax, markersize = 20, color = '#6a3d9a', marker = 'o', label ='9:00am' )
geo_df[geo_df['Hours'] == '10:00am'].plot(ax = ax, markersize = 20, color = '#ffff99', marker = 'o', label ='10:00am')
geo_df[geo_df['Hours'] == '11:00am'].plot(ax = ax, markersize = 20, color = '#b15928', marker = 'o', label ='11:00am')
geo_df[geo_df['Hours'] == '12:00pm'].plot(ax = ax, markersize = 20, color = '#8dd3c7', marker = 'o', label ='12:00pm')
geo_df[geo_df['Hours'] == '1:00pm' ].plot(ax = ax, markersize = 20, color = '#ffffb3', marker = 'o', label ='1:00pm' )
geo_df[geo_df['Hours'] == '2:00pm' ].plot(ax = ax, markersize = 20, color = '#bebada', marker = 'o', label ='2:00pm' )
geo_df[geo_df['Hours'] == '3:00pm' ].plot(ax = ax, markersize = 20, color = '#fb8072', marker = 'o', label ='3:00pm' )
geo_df[geo_df['Hours'] == '4:00pm' ].plot(ax = ax, markersize = 20, color = '#80b1d3', marker = 'o', label ='4:00pm' )
geo_df[geo_df['Hours'] == '5:00pm' ].plot(ax = ax, markersize = 20, color = '#fdb462', marker = 'o', label ='5:00pm' )
geo_df[geo_df['Hours'] == '6:00pm' ].plot(ax = ax, markersize = 20, color = '#b3de69', marker = 'o', label ='6:00pm' )
geo_df[geo_df['Hours'] == '7:00pm' ].plot(ax = ax, markersize = 20, color = '#fccde5', marker = 'o', label ='7:00pm' )
geo_df[geo_df['Hours'] == '8:00pm' ].plot(ax = ax, markersize = 20, color = '#d9d9d9', marker = 'o', label ='8:00pm' )
geo_df[geo_df['Hours'] == '9:00pm' ].plot(ax = ax, markersize = 20, color = '#bc80bd', marker = 'o', label ='9:00m'  )
geo_df[geo_df['Hours'] == '10:00pm'].plot(ax = ax, markersize = 20, color = '#ccebc5', marker = 'o', label ='10:00pm')
geo_df[geo_df['Hours'] == '11:00pm'].plot(ax = ax, markersize = 20, color = '#ffed6f', marker = 'o', label ='11:00pm')
#Geopandas did not seem to like using a loop for choosing colors, so I just cut and paste the above code - 24x to avoid an error

ax.legend(loc='upper center', bbox_to_anchor=(1, 1), shadow=True, ncol=1, fontsize = 9)
plt.show()

Wanted to work with a shp file to showcase crimes by time but did not find this as useful as the plotly express map.

#### **Scatter_mapbox**

In [0]:
fig = px.scatter_mapbox(crime_2019, lat='Latitude', lon='Longitude', color='Hours', opacity=1.0)
fig.update_layout(mapbox_style='stamen-terrain')
fig.show()

This plot while interactive does not give as much information as the bar graph created above using the hours column in the crime_2019 sample dataset.