# INTRODUCTION

#### Context of data
 Crime incident reports are provided by Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred. (Records begin in June 14, 2015 and continue to September 3, 2018.)
 
 
#### [Informations about Boston](https://en.wikipedia.org/wiki/Boston)



## Content: 

1. [Load and Prepare Data](#1)
     1. [Explanation of Features](#2)
     1. [Missing Value Analysis](#3)    
1. [Explotary Data Analysis](#4) 
1. [Analysis Objective](#8)
     1. [How has crime changed over the years?](#5)
     1. [Is it possible to predict where or when a crime will be committed?](#7) 
     1. [What can you say about the distribution of different offenses over the city?](#6)
1. [Serious Crime Analysis](#9)



<a id = "1"></a><br>
<font color='Grey'>
## Load and Prepare Data
    Firstly, all libraries needed are loaded. And, 'crime.csv' data set is also loaded.

In [None]:
!pip install plotly --user

In [None]:
!pip install missingno --user

In [None]:
!pip install cufflinks --user

In [None]:
!pip install folium --user

In [3]:
#data tools:
import numpy as np 
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import pandas_profiling 
#visual tools:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode, plot, iplot
import plotly as py
init_notebook_mode(connected=True) 
import plotly.graph_objs as go
import plotly.graph_objs as go
import plotly.tools as tls
import missingno as msno
import cufflinks as cf
import folium
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False) 

#data set:
data = pd.read_csv("crime.csv", engine='python')

In [4]:
data.profile_report()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



<a id = "2"></a><br>
<font color='Grey'>
### Explanation of Features
    
<font color='Black'>
    Pandas Profiling Report provides detailed information about each column. I also examined each one below with using above table.
    
**BOSTON CRIME DATA** includes 17 features that are:

1. **INCIDENT_NUMBER:**  It is a unique number given each case.
    (unique incident number is not equal total incident number because some cases include lots of crime type, some of them have been shown below)
   **#data.INCIDENT_NUMBER.nunique() :282517** (As you can see above, just 88.5% is a unique number in this column.)
1. **OFFENSE_CODE:**     It shows type of crime, also we have another list explain                    of each of them. (Because of existing Offense Code Group, it will be not used in the analysis.)
1. **OFFENSE_CODE_GROUP:**  The general name of each crime type.
1. **OFFENSE_DESCRIPTION:** Explanation of specific crime.
    (It can be useful for further investigation)
1. **DISTRICT:**     Code of zone that crime happened.
    (Because of code is meaningless, it will be changed with name of district)
1. **REPORTING_AREA:** Area number that crime reported.
1. **SHOOTING:**    If the crime included shooting, it shows with 'Y'.
1. **OCCURRED_ON_DATE:** It shows exact time of crime. (year, month, day and time)
1. **YEAR:**     2015,2016,2017,2018
1. **MONTH:**    the month that crime happened.
1. **DAY OF WEEK:** the week that crime happened.
1. **HOUR:**        the hour that crime happened.
1. **UCR_PART:** [Uniform Crime Reporting](https://en.wikipedia.org/wiki/Uniform_Crime_Reports) Offence types that is defined by The Federal Bureau of Investigation for reporting data on crimes.
1. **STREET:** the street name that crime happened
1. **LAT:**    the location latitude that crime happened.
1. **LONG:** the location longitude that crime happened.
1. **LOCATION:** the location latitude and longitude together that crime happened.

## District Names

District code is hard to understand exactly which area it is so I added neighborhoods names to analyze districts.

In [8]:
data['district_name'] = data.DISTRICT

data.district_name.replace({'A1' : 'Downtown',
'A15': 'Charlestown',
'A7': 'East Boston',
'B2': 'Roxbury',
'B3': 'Mattapan',
'C6': 'South Boston',
'C11': 'Dorchester',
'D4': 'South End',
'D14': 'Brighton',
'E5': 'West Roxbury',
'E13': 'Jamaica Plain',
'E18':'Hyde Park'}, inplace=True)
#https://www.boston.gov/departments/police

In [9]:
data.head()

Unnamed: 0,INCIDENT_NUMBER,OFFENSE_CODE,OFFENSE_CODE_GROUP,OFFENSE_DESCRIPTION,DISTRICT,REPORTING_AREA,SHOOTING,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,Location,district_name
0,I182070945,619,Larceny,LARCENY ALL OTHERS,D14,808,,"Sunday, September 2, 2018",2018,9,Sunday,13,Part One,LINCOLN ST,42.357791,-71.139371,"(42.35779134, -71.13937053)",Brighton
1,I182070943,1402,Vandalism,VANDALISM,C11,347,,"Tuesday, August 21, 2018",2018,8,Tuesday,0,Part Two,HECLA ST,42.306821,-71.0603,"(42.30682138, -71.06030035)",Dorchester
2,I182070941,3410,Towed,TOWED MOTOR VEHICLE,D4,151,,"Monday, September 3, 2018",2018,9,Monday,19,Part Three,CAZENOVE ST,42.346589,-71.072429,"(42.34658879, -71.07242943)",South End
3,I182070940,3114,Investigate Property,INVESTIGATE PROPERTY,D4,272,,"Monday, September 3, 2018",2018,9,Monday,21,Part Three,NEWCOMB ST,42.334182,-71.078664,"(42.33418175, -71.07866441)",South End
4,I182070938,3114,Investigate Property,INVESTIGATE PROPERTY,B3,421,,"Monday, September 3, 2018",2018,9,Monday,21,Part Three,DELHI ST,42.275365,-71.090361,"(42.27536542, -71.09036101)",Mattapan


## Duplication

I realized that some case written ones more, because even if it was a one case, it could include lots of offense type. But if we count just offense type instead of case number, we can mistaken.

In [6]:
data['INCIDENT_NUMBER'].value_counts().to_frame()

Unnamed: 0,INCIDENT_NUMBER
I162030584,13
I152080623,11
I172013170,10
I182065208,10
I172096394,10
...,...
I182052638,1
I162037379,1
I172048065,1
I172081978,1


For exact case number, I will delete all duplicate rows. However, I want to keep crime types to examine distribution of crimes.For that reason a copy of data set will be done.

In [None]:
# check duplication
#case_count[case_count['INCIDENT_NUMBER'] == 'I162030584']

In [None]:
case_count = data.copy()
case_count.sort_values("INCIDENT_NUMBER",inplace=True)#prepare to duplication
case_count.drop_duplicates(subset="INCIDENT_NUMBER", inplace=True)#delete all duplicate rows but first row will be remain

<a id = "3"></a><br>
<font color='Grey'>
### Missing Value Analysis
    
<font color='Black'>
    
1. Missing value tables are in Pandas Profiling table above. The highest missing      value belongs to shooting and when it examines closely, it is clearly seen that only crimes that included shooting reported. Because of that, shooting cases must analyze separately. (It can be wrong assumption of missing values change 'No' or 'None' because data set does not inform us about this.)
    
1. Secondly, UCR_PART has only 90 missing values but, it shows that offence part and it means that this column related to all other offence columns. However, the others does not have any missing value. So these 90 lines must examine closely.
    
1. Thirdly, DISTRICT,REPORTING_AREA, STREET, Lat, Long and Location are all related to crime location, however while REPORTING_AREA and Location have no missing value, the others have different number of missing values. District_name is name of 'DISTRICT' code. using names are more meaningful than codes. Additionally, 'DISTRICT' has minimum missing values (1765/319073 = ‰5) to analyze correctly places, I will drop all missing value in District column.
    

In [None]:
#data.isnull().sum() #can be learn missing number this code
#sns.set() #visualzation of missingno library
#msno.bar(data)
#plt.show()

In [None]:
data.dropna(subset=['DISTRICT'], inplace=True)

## 1 Shooting Column Analyze
- Which crimes
- Density of place
* - When 

        

In [None]:
x = data[data.SHOOTING == 'Y'] #shooting cases

fig = px.pie(x, values=x.OFFENSE_CODE_GROUP.value_counts(), names=x.OFFENSE_CODE_GROUP.unique(),title='Shooting Crimes',labels={'OFFENSE_CODE_GROUP':'Crime Type'})
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

The above pie chart shows that the majority of crime types of shooting is 'Aggravated Assault'.

In [None]:
max_districtname_crime = x['district_name'].value_counts().index[0]
max_street_crime = x['STREET'].value_counts().index[0]
max_year_crime = x['YEAR'].value_counts().index[0]
max_hour_crime = x['HOUR'].value_counts().index[0]
max_month_crime = x['MONTH'].value_counts().index[0]
max_day_crime = x['DAY_OF_WEEK'].value_counts().index[0]

month = ['January','February','March','April','May','June','July',
         'August','September','October','November','December']

print('Street with higher occurrence of shooting crimes:', max_street_crime)
print('Year with highest shooting crime occurrence:', max_year_crime)
print('Hour with highest shooting crime occurrence:', max_hour_crime)
print('Month with highest shooting crime occurrence:', month[max_month_crime-1])
print('Day with highest shooting crime occurrence:', max_day_crime)

Because number of these crimes are low, and the column has very little information about it, and after this short analyze, we can drop 'shooting' column our analyze.

In [None]:
data.drop("SHOOTING", axis=1, inplace = True)

## UCR Part Missing Value Analyze

In [None]:
a = data['OFFENSE_CODE_GROUP'][data['UCR_PART'].isnull()]
b = a.to_frame() 
fig = px.pie(b, values=b.OFFENSE_CODE_GROUP.value_counts(), names=b.OFFENSE_CODE_GROUP.unique(),title='UCR Part Missing Values',labels={'OFFENSE_CODE_GROUP':'Crime Type'})
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

The above-mentioned information is related to missing values of UCR_PART and it is seen clearly that 'Home Invasion', 'Investigate Person', 'Human Trafficking' and 'Human Trafficking-Involuntary Servitude' do not include UCR_PART codes. I tried to find 'Home Invasion' UCR part type on internet, however I could not find legal information about it so I skip this step.

However, the data set provides detail information about the crime types. It is known that UCR Part 1 offences are quite serious crimes so it can be analyzed closely UCR Part 1 offences without seen effect of missing values.

<a id = "4"></a><br>
<font color='Grey'>
## Explotary Data Analysis

In [None]:
data.describe()

Even if the numbers do not have any meaning, it can be said that all variables seems to be correct. It is mean that all years in between 2015 and 2018, all hours at between 0 and 23, all months at between 1 and 12.

<a id = "5"></a><br>
<font color='Grey'>
## 1. **How has crime changed over the years?**
<font color='Black'>   
Unique Incident Number will be used as a crime number because some cases include lots of crime type but actually it is just one case. Additionally, in this question I will examine offense groups pattern for each year to find yearly pattern of each crime type.

#### Case Number Count, Yearly

In [None]:
fig = px.histogram(case_count, x=['YEAR'], template='plotly_white', 
                opacity=0.7,log_y=True, labels={'x':'YEARS', 'y':'Case Number Count'} )
fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), showlegend=False)

fig.show()

Because of 2015 and 2018 do not have full data, number of cases for these years are lower. However, we can compare 2016 and 2017. And, it looks like 2017 has more cases than 2016.

<a id = "10"></a><br>
<font color='Black'>
#### Offense Type Distribution, Yearly

In [None]:
df_2015 = data.loc[data['YEAR'] == 2015]
value_counts1 = df_2015.OFFENSE_CODE_GROUP.value_counts()
crime_counts1 = pd.DataFrame(value_counts1)
crime_counts1 = crime_counts1.reset_index()
crime_counts1.columns = ['Offense Type', '2015']

df_2016 = data.loc[data['YEAR'] == 2016]
value_counts2 =  df_2016.OFFENSE_CODE_GROUP.value_counts()
crime_counts2 = pd.DataFrame(value_counts2)
crime_counts2 = crime_counts2.reset_index()
crime_counts2.columns = ['Offense Type', '2016']

df_2017 = data.loc[data['YEAR'] == 2017]
value_counts3 =  df_2017.OFFENSE_CODE_GROUP.value_counts()
crime_counts3 = pd.DataFrame(value_counts3)
crime_counts3 = crime_counts3.reset_index()
crime_counts3.columns = ['Offense Type', '2017']

df_2018 = data.loc[data['YEAR'] == 2018]
value_counts4 =  df_2018.OFFENSE_CODE_GROUP.value_counts()
crime_counts4 = pd.DataFrame(value_counts4)
crime_counts4 = crime_counts4.reset_index()
crime_counts4.columns = ['Offense Type', '2018']

crime_counts = pd.concat([crime_counts1,crime_counts2,crime_counts3,crime_counts4],axis=1)


#crime_counts

In [None]:
crime_counts.describe()

These descriptive statistics values gives clue about distribution of offenses groups in each year. As it can be seen that all years have similar pattern, and do not have simetric distrubution, mean is higher than median. This means that the distributions are the right skewed.

In [None]:
data.head()

In [None]:
fig = px.histogram(data, x='OFFENSE_CODE_GROUP', template='plotly_white',opacity=0.7,
                   animation_frame='YEAR' )
fig.update_layout(xaxis_tickangle=60)
padding_top = 200

fig['layout']['sliders'][0]['pad']['t'] = padding_top
fig['layout']['updatemenus'][0]['pad']['t'] = padding_top

fig.show()

While the first graph shows that total number of crimes each year, the second one shows that distribution of crimes each year.
- It seems to while the trend of crimes tended to increase and this trend broke in 2018. However, our data set does not incluede full year of 2018. (It has just until september,2018) Also, four years are not enough for yearly trend analyze. Because of all reasons, we cannot say the number of cases are decreasing.
- Motor Vehicle Accident Response has the highest number for each year and respectively Medical Assistance and Larceny follows it. 


*      However, one of remarkable point is that Medical Assistance has increasing trend.While Medical Assitance was    fifth highest in 2015, it was third in 2016 and 2017 , second in 2018. 
*       The other remarkable point is that 'Vandalism','Residential Burglary', 'Larceny From Motor Vehicle','Assembly or Gathering Violations','Robbery' because the ranks of them decrease year by year.
*       Additionally, 'Verbal Disputes', 'Fraud', 'Auto Theft Recovery', 'Property Found' of the ranks have remakable increase trend.



<a id = "7"></a><br>
<font color='Grey'>
## 2.  Is it possible to predict where or when a crime will be committed?

There is a strong association between past and present criminal behaviour(Nagin and Paternoster,2000). Future events are based on the past events, which has helped in the prediction of crime(Johnson and Bowers 2004). According to time data and records, **crime prediction can be done with time series and other ML prediction methods.** However, in this stage we can also use probability ratio(or finding the highest frequency) to predict most highly criminal place and time. So , firstly I will try to find which place has the highest frequency of criminal records. Then, I will try to find the highest frequency of criminal time interval.

In [None]:
df_2015 = data.loc[data['YEAR'] == 2015]
value_count1 = df_2015.district_name.value_counts()
place_counts1 = pd.DataFrame(value_count1)
place_counts1 = place_counts1.reset_index()
place_counts1.columns = ['district_name', 'count']
place_counts1['year'] = 2015


df_2016 = data.loc[data['YEAR'] == 2016]
value_count2 = df_2016.district_name.value_counts()
place_counts2 = pd.DataFrame(value_count2)
place_counts2 = place_counts2.reset_index()
place_counts2.columns = ['district_name', 'count']
place_counts2['year'] = 2016

df_2017 = data.loc[data['YEAR'] == 2017]
value_count3 = df_2017.district_name.value_counts()
place_counts3 = pd.DataFrame(value_count3)
place_counts3 = place_counts3.reset_index()
place_counts3.columns = ['district_name', 'count']
place_counts3['year'] = 2017

df_2018 = data.loc[data['YEAR'] == 2018]
value_count4 = df_2018.district_name.value_counts()
place_counts4 = pd.DataFrame(value_count4)
place_counts4 = place_counts4.reset_index()
place_counts4.columns = ['district_name', 'count']
place_counts4['year'] = 2018

place_counts = pd.concat([place_counts1,place_counts2,place_counts3,place_counts4],axis=0)

place_counts['DISTRICT'] = place_counts.district_name
place_counts['population'] = place_counts.district_name
place_counts['median_household_income'] = place_counts.district_name


place_counts.DISTRICT.replace({ 'Downtown':'A1',
'Charlestown':'A15',
'East Boston':'A7',
'Roxbury':'B2',
'Mattapan':'B3',
'South Boston':'C6',
'Dorchester':'C11',
'South End':'D4' ,
'Brighton':'D14' ,
'West Roxbury':'E5',
'Jamaica Plain':'E13' ,
'Hyde Park':'E18'}, inplace=True)
#https://www.boston.gov/departments/police

place_counts.population.replace({ 'Downtown':39286,
'Charlestown':16685,
'East Boston':40508,
'Roxbury':76917,
'Mattapan':36480,
'South Boston':35200,
'Dorchester':91982,
'South End':77773,
'Brighton':74997 ,
'West Roxbury':50983,
'Jamaica Plain':37468 ,
'Hyde Park':30631}, inplace=True)
#Source: https://en.wikipedia.org/wiki/Boston_Police_Department

place_counts.median_household_income.replace({ 'Downtown':93484,
'Charlestown':91998,
'East Boston':56961,
'Roxbury':34616,
'Mattapan':45798,
'South Boston':86753,
'Dorchester':58915,
'South End':72022,
'Brighton':61281 ,
'West Roxbury':91763,
'Jamaica Plain':75652 ,
'Hyde Park':65408}, inplace=True)
#Source: https://en.wikipedia.org/wiki/Boston_Police_Department

#place_counts

#### Maps of Boston's Neighborhoods crime distribution, yearly based

In [None]:
from urllib.request import urlopen
import json
with open('../input/police-districts-boston/Police_Districts (1).geojson') as response:
    geojson = json.load(response)
#geojson["features"][0]
#https://opendata.arcgis.com/datasets/9a3a8c427add450eaf45a470245680fc_5.geojson

In [None]:
fig = px.choropleth_mapbox(place_counts, geojson=geojson, color='count',
                    locations="DISTRICT", featureidkey="properties.DISTRICT",
                   hover_name="district_name", animation_frame=place_counts["year"],color_continuous_scale=px.colors.sequential.Cividis_r,
                           
                           mapbox_style="carto-positron",
                           zoom=9, center = {"lat": 42.35843, "lon": -71.05977},
                           opacity=0.7,
                           labels={'color':'crime count','DISTRICT': 'DISTRICT', 'district_name':'district name'}
                          )

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
place_counts.corr() 

In [None]:
place_counts

Actually, I found each district population number and the median household income. I thought that number of crimes could be highly correlated with population or income, but the results are not enough high for using further analysis. 

In [None]:
#place_counts['crime_density_ratio'] = place_counts['count'] / place_counts['population']
place_counts['probability'] = place_counts['count'] / place_counts['count'].sum()
#place_counts.head()

In [None]:
b = place_counts.pivot(index='district_name', columns='year', values=['count','median_household_income'])

In [None]:
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

b.style.apply(highlight_max)

In [None]:
fig = px.line(place_counts, x="year", y="probability", template='plotly_white', color='district_name' ,title='Crime Ratios in Boston')
fig.show()

   `The map shows that distrubition of case numbers among neighborhoods year by year, and in this situation, while Roxbury has the highest number for four years, Charlestown has the lowest number. 

The line graph shows each district trend; 

1.  Roxbury
1.  Dorchester & South End(nearly same )
1.  Downtown & Mattapan(nearly same )
1.  South Boston
1.  Brighton
1.  Jamaica Plain & Hyde Park (nearly same )
1.  East Boston & West Roxbury (nearly same )
1.  Charlestown
 
Density of Police officers can be arranged according to this list, and as it is seen that any district has no remarkable change among years.
 
 Rank of the median Household income:
 
1.  Downtown(Central)
1.  Charleston
1.  West Roxbury
1.  South Boston
1.  Jamaica Plain
1.  South End
1.  Hyde Park
1.  Brighton
1.  Dorchester
1.  East Boston
1.  Mattapan
1.  Roxbury
 
Secondly, I found the median household incomes and populations of the neighborhoods. I thought that crowd affect crime numbers. However, correlation between population and crime numbers around %50 so it is not enough for using normalization. Additionaly, I thought that the median income of householders can also affect crime numbers. This correlation also does not high enough. However, when I examined rank of income and rank of crime numbers, without some districts, it looks like highly corelate
there is an interesting point; because of downtown is a city center, I guess lots of work place there, so probably, beacuse of that reason even if downtown has higher income, also crime number also high. Similiarly, it looks like that 'South End', 'South Boston','East Boston' do not follow income crime number relation(I guess that these places have lots of work places. So I wanted to check the correlation without these districts and results showed me the other districts are highly correlated with income.


[](http://)What if I remove some districts:

In [None]:
d = b.copy()
b.drop(['Downtown','South End', 'South Boston','East Boston'],inplace= True)

In [None]:
b.corr()

In [None]:
sns.heatmap(b.corr())


Because number of criminal cases are the highest in Roxbury, Roxbury has the highest probablity to next criminal activity. So, the next stage I will examine Roxbury's time of commiting crimines.

### Crime Time Frequency of Roxbury:

In [None]:
dfr = data.loc[data['district_name'] == 'Roxbury']
df16 = dfr.loc[dfr['YEAR'] == 2016]
df3 = df16.loc[df16['MONTH'] == 3]
dfw = df3.loc[df3['DAY_OF_WEEK'] == 'Tuesday']


#### Which year, month, day and hour in specific sequence for Roxbury

In [None]:
fig, ax =plt.subplots(1,4, figsize=(30, 6), sharey=False)
sns.countplot(dfr['YEAR'], ax=ax[0])
sns.countplot(df16['MONTH'], ax=ax[1])
sns.countplot(df3['DAY_OF_WEEK'], ax=ax[2])
sns.countplot(dfw['HOUR'], ax=ax[3])
fig.show()

In this graphs I tried to find the highest specific time interval, so I found that eleven o'clock, Tuesday in March, 2016 is the highest criminal activity interval for Roxbury. This information can show us a special day or time but it is needed to be investigated.

#### Frequency of year, month, day and hour total numbers for Roxbury

In [None]:
fig, ax =plt.subplots(1,4, figsize=(30, 6), sharey=False)
sns.countplot(dfr['YEAR'], ax=ax[0])
sns.countplot(dfr['MONTH'], ax=ax[1])
sns.countplot(dfr['DAY_OF_WEEK'], ax=ax[2])
sns.countplot(dfr['HOUR'], ax=ax[3])
fig.show()

In this set of graphics, we can see that total crime numbers frequency. While 2016 and August have the highest crime number as a year and month, criminal activity has the highest on Thursday and six p.m. as a day and hour. 

#### Frequency of year, month, day and hour total numbers for all places

In [None]:
fig, ax =plt.subplots(1,4, figsize=(30, 6), sharey=False)
sns.countplot(data['YEAR'], ax=ax[0])
sns.countplot(data['MONTH'], ax=ax[1])
sns.countplot(data['DAY_OF_WEEK'], ax=ax[2])
sns.countplot(data['HOUR'], ax=ax[3])
fig.show()

As you can see clearly, Roxbury and all data set have nearly same pattern according to time data. 

In [None]:
max_districtname_crime = dfr['district_name'].value_counts().index[0]
max_street_crime = dfr['STREET'].value_counts().index[0]
max_year_crime = dfr['YEAR'].value_counts().index[0]
max_hour_crime = dfr['HOUR'].value_counts().index[0]
max_month_crime = dfr['MONTH'].value_counts().index[0]
max_day_crime = dfr['DAY_OF_WEEK'].value_counts().index[0]

month = ['January','February','March','April','May','June','July',
         'August','September','October','November','December']
print('For Roxbury:')
print('Street with higher occurrence of crimes:', max_street_crime)
print('Year with highest  crime occurrence:', max_year_crime)
print('Hour with highest  crime occurrence:', max_hour_crime)
print('Month with highest  crime occurrence:', month[max_month_crime-1])
print('Day with highest  crime occurrence:', max_day_crime)

The previous analysis we found that Roxbury has the highest crime number so probabilty of it is also highest. Because of that, we can say that prediction of new crime place can be Roxbury. In the second stage, we tried to find the most frequency of crime time. And, as you can see above the graph and small notes, the most crime case had happened in 2016 so I  chose 2016 to continue the analyse. And, in 2016, Roxbury the most crimes comitted in March so March 2016 in Roxbury the most crimes comitted on Tuesday and this day the most crimes comitted around 11am. However, as I marked above this can be showed that a specific day or an event, it is needed to be examined. 
However, if we look at the general trend of crime numbers in Roxbury, it looks like summer months crime rate is increasing, and also rush hours like 17:00-19:00 has the highest crime rate. So we can say that probabilty of summer months and rush hours are the highest for new crime prediction. Days of week trend is very close each other,so it is hard to predict it.

<a id = "6"></a><br>
<font color='Grey'>
## 3. What can you say about the distribution of different offenses over the city?

Actually, in yearly based we find the answer of [this question above](#10). In this stage I will eximane crime numbers without filtering time interval and try to find specific crime types distrubutions for each neighborhoods.

**The most frequent 15 crimes among to Boston's Neighborhoods**

In [None]:
order = data['OFFENSE_CODE_GROUP'].value_counts().head(15).index
plt.figure(figsize = (30,10))
sns.countplot(data = data, x='OFFENSE_CODE_GROUP',hue=data.district_name, order = order ,palette="cubehelix");
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xticks(rotation=90)
plt.show()

While the first graph show that most frequent fifteen types of crime in Boston's neighboorhoods, the second one focuses on the five most frequent crime groups. Additionally, the second graph shows more clearly that crime distribution for each neighborhoods. Generally, distribution of crime types are similar for each neighborhood. 

In [None]:
order = data['OFFENSE_CODE_GROUP'].value_counts().head(5).index
plt.figure(figsize = (30,10))
sns.countplot(data = data, x='OFFENSE_CODE_GROUP',hue='district_name', order = order,palette="cubehelix");
plt.title('The most frequent 5 crimes among to Bostons Neighborhoods')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.show()

In [None]:
df_d = data.groupby(['district_name','OFFENSE_CODE_GROUP'])['INCIDENT_NUMBER'].count().reset_index()

In [None]:
xy = df_d.pivot(index='OFFENSE_CODE_GROUP', columns='district_name', values=['INCIDENT_NUMBER'])
xy.style.apply(highlight_max)

In [None]:
xy.describe()

In [None]:
plt.figure(figsize = (30,10))
sns.boxplot(x="district_name", y="INCIDENT_NUMBER", data=df_d)

plt.xticks(rotation=90)
plt.show()

These descriptive statistics values gives clue about distribution of offenses groups for each neighborhoods. As it can be seen that they does not have simetric distrubution, mean is higher than median. This means that the distributions are the right skewed.

In [None]:
fig = px.histogram(df_d, x='district_name', y='INCIDENT_NUMBER',color='OFFENSE_CODE_GROUP')
fig.show()

This graph shows the distribution of offence groups among neighborhoods. Because of type of crimes too much, the graph looks like confusing. However, above graphs explain the distribution more clearly. Respectively, Motor Vehicle Accident Response, Larceny, Medical Assistance,Investigate Person have the highest frequency, and for all these types, Roxbury has the highest number. On the other side, while 'Motor Vehicle Accident Response' has the highest frequency for all neighborhoods except Downtown and South End, for these neighborhoods, 'Larceny' is the highest crime.

In [None]:
order = data['STREET'].value_counts().head(15).index
sns.countplot(data = data, x='STREET', order = order);
plt.xticks(rotation= 90)
plt.show()

Simply, I wanted to mention street frequency, because number of streets in Boston too much, I selected the most criminal fifteen street. And, 'Washington street' has the highest frequency, but I guess because of length and crowdency of the street affect this situation.Criminal density also can be calculated based on streets but it need more detailed analysis.

In [None]:
max_street_crime = data['STREET'].value_counts().index[0]
max_year_crime = data['YEAR'].value_counts().index[0]
max_hour_crime = data['HOUR'].value_counts().index[0]
max_month_crime = data['MONTH'].value_counts().index[0]
max_day_crime = data['DAY_OF_WEEK'].value_counts().index[0]

month = ['January','February','March','April','May','June','July',
         'August','September','October','November','December']

print('Street with higher occurrence of crimes:', max_street_crime)
print('Year with highest crime occurrence:', max_year_crime)
print('Hour with highest crime occurrence:', max_hour_crime)
print('Month with highest crime occurrence:', month[max_month_crime-1], max_month_crime)
print('Day with highest crime occurrence:', max_day_crime)

## Conclusions

In summary, this EDA shows:

Crimes rates are low between 1-8 in the morning, and gradually rise throughout the day, peaking around 6 pm. There is some variation across days of the week, with Friday having the highest crime rate and Sunday having the lowest. The month also seems to have some influence, with the winter months of February-April having the lowest crime rates, and the summer/early fall months of June-October having the highest crime rates. There is also a spike in crime rates in the month of January.

The most frequent "crime" in Boston are Motor Vehicle Accidents while the second most are Larcenies. Guns are mostly used in Aggravated Assault followed by Homicides. Most of the crime happens in Roxbury numerically, However to consider population parameter, Mattapan has the highest density. 

Finally, because UCR Part one is declared as the most serious crimes types, I find valuable to analyse this type more closely;

<a id = "9"></a><br>
<font color='Grey'>
# - Serious Crimes:(UCR Part one)

In [None]:
dfs = data[data.UCR_PART == 'Part One']

In [None]:
dfs.describe()

Even if the numbers do not have any meaning, it can be said that all variables seems to be correct. It is mean that all years in between 2015 and 2018, all hours at between 0 and 23, all months at between 1 and 12.

In [None]:
fig, ax =plt.subplots(1,5, figsize=(30, 6), sharey=False)
sns.countplot(dfs['OFFENSE_CODE_GROUP'], ax=ax[0])
sns.countplot(dfs['YEAR'], ax=ax[1])
sns.countplot(dfs['MONTH'], ax=ax[2])
sns.countplot(dfs['DAY_OF_WEEK'], ax=ax[3])
sns.countplot(dfs['HOUR'], ax=ax[4])
fig.show()

In [None]:
dfsL = dfs[dfs.OFFENSE_CODE_GROUP == 'Larceny']

In [None]:
fig, ax =plt.subplots(1,4, figsize=(30, 6), sharey=False)

sns.countplot(dfsL['YEAR'], ax=ax[0])
sns.countplot(dfsL['MONTH'], ax=ax[1])
sns.countplot(dfsL['DAY_OF_WEEK'], ax=ax[2])
sns.countplot(dfsL['HOUR'], ax=ax[3])
fig.show()

In [None]:
fig = px.histogram(dfs, x='OFFENSE_CODE_GROUP', template='simple_white',
                   animation_frame='YEAR' )
fig.update_layout(xaxis_tickangle=15)

fig.show()

In [None]:
df_2015 = dfs.loc[data['YEAR'] == 2015]
value_count1 = df_2015.district_name.value_counts()
place_counts1 = pd.DataFrame(value_count1)
place_counts1 = place_counts1.reset_index()
place_counts1.columns = ['district_name', 'count']
place_counts1['year'] = 2015


df_2016 = dfs.loc[data['YEAR'] == 2016]
value_count2 = df_2016.district_name.value_counts()
place_counts2 = pd.DataFrame(value_count2)
place_counts2 = place_counts2.reset_index()
place_counts2.columns = ['district_name', 'count']
place_counts2['year'] = 2016

df_2017 = dfs.loc[data['YEAR'] == 2017]
value_count3 = df_2017.district_name.value_counts()
place_counts3 = pd.DataFrame(value_count3)
place_counts3 = place_counts3.reset_index()
place_counts3.columns = ['district_name', 'count']
place_counts3['year'] = 2017

df_2018 = dfs.loc[data['YEAR'] == 2018]
value_count4 = df_2018.district_name.value_counts()
place_counts4 = pd.DataFrame(value_count4)
place_counts4 = place_counts4.reset_index()
place_counts4.columns = ['district_name', 'count']
place_counts4['year'] = 2018

place_countsS = pd.concat([place_counts1,place_counts2,place_counts3,place_counts4],axis=0)

place_countsS['DISTRICT'] = place_counts.district_name
place_countsS['population'] = place_counts.district_name
place_countsS['median_household_income'] = place_counts.district_name


place_countsS.DISTRICT.replace({ 'Downtown':'A1',
'Charlestown':'A15',
'East Boston':'A7',
'Roxbury':'B2',
'Mattapan':'B3',
'South Boston':'C6',
'Dorchester':'C11',
'South End':'D4' ,
'Brighton':'D14' ,
'West Roxbury':'E5',
'Jamaica Plain':'E13' ,
'Hyde Park':'E18'}, inplace=True)
#https://www.boston.gov/departments/police

place_countsS.population.replace({ 'Downtown':39286,
'Charlestown':16685,
'East Boston':40508,
'Roxbury':76917,
'Mattapan':36480,
'South Boston':35200,
'Dorchester':91982,
'South End':77773,
'Brighton':74997 ,
'West Roxbury':50983,
'Jamaica Plain':37468 ,
'Hyde Park':30631}, inplace=True)
#Source: https://en.wikipedia.org/wiki/Boston_Police_Department

place_countsS.median_household_income.replace({ 'Downtown':93484,
'Charlestown':91998,
'East Boston':56961,
'Roxbury':34616,
'Mattapan':45798,
'South Boston':86753,
'Dorchester':58915,
'South End':72022,
'Brighton':61281 ,
'West Roxbury':91763,
'Jamaica Plain':75652 ,
'Hyde Park':65408}, inplace=True)
#Source: https://en.wikipedia.org/wiki/Boston_Police_Department


#place_countsS

In [None]:
fig = px.choropleth_mapbox(place_countsS, geojson=geojson, color='count',
                    locations="DISTRICT", featureidkey="properties.DISTRICT",
                   hover_name="district_name", animation_frame=place_counts["year"],color_continuous_scale='Inferno',
                           
                           mapbox_style="carto-positron",
                           zoom=9, center = {"lat": 42.35843, "lon": -71.05977},
                           opacity=0.7,
                           labels={'color':'crime count','DISTRICT': 'DISTRICT', 'district_name':'district name'}
                          )

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
place_countsS.corr()

In [None]:
a = place_countsS.pivot(index='district_name', columns='year', values=['count'])

In [None]:
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

a.style.apply(highlight_max)

Larceny is by far the most common type of serious crime. Serious crimes are most likely to occur in the afternoon and evening. Serious crimes are most likely to occur on Friday and least likely to occur on Sunday. Serious crimes are most likely to occur in the summer and early fall, and least likely to occur in the winter (with the exeption of January, which has a crime rate more similar to the summer). (There is no obvious connection between major holidays and crime rates.) Serious crimes are most common in the city center, especially districts South End and Downtown. 
Another interesting direction would be to combine this with other data about Boston, such as demography or even the weather, to investigate what factors predict crime rates across time and space.