<h1 align='center'>Denver Crime Report (2014-Present)</h1>

In [65]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from IPython.display import Markdown as md
%matplotlib inline

## Section 1: Dataset Description and Exploratory Analysis

Read in the data from the [csv url]('https://www.denvergov.org/media/gis/DataCatalog/crime/csv/crime.csv') and `denver_offense_codes.csv`.

In [70]:
url = 'https://www.denvergov.org/media/gis/DataCatalog/crime/csv/crime.csv'
df1 = pd.read_csv(url)
df2 = pd.read_csv('denver_offense_codes.csv')

These dataframes can easily be merged by columns, the second dataset `df2` is essentially a 'key' for `df1`. The reason to do this is because `df2` contains more grammatically correct values/entries for our crime categories/types.

In [71]:
df = df1.merge(df2)

In [74]:
rows, columns = df.shape

In [78]:
md("We now have a dataset, `df`, containing %i rows and %i columns of data - that's %i datapoints!"%(rows, columns, rows*columns))

We now have a dataset, `df`, containing 510871 rows and 21 columns of data - that's 10728291 datapoints!

### Dataset Description
The dataset `df` contains information about criminal and traffic incidents taking place in the city and county of Denver, CO reported to the police in the timeframe 1/2/2014 - present.

A general description of each column follows:
- `INCIDENT_ID`: identifier for an incident of an offense or multiple offenses (root of `OFFENSE ID`)
- `OFFENSE_ID`: identifier for a singular offense from an incident
- `OFFENSE_CODE`: codified value for particular `OFFENSE_TYPE_ID`, `OFFENSE_CATEGORY_ID`, `OFFENSE_TYPE_NAME`, and `OFFENSE_CATEGORY_NAME` (primarily serves as part of the key between `df1` and `df2` above)
- `OFFENSE_CODE_EXTENSION`: extension to `OFFENSE_CODE`, serves similar purpose
- `OFFENSE_TYPE_ID`: a descriptive name for type of offense committed, more specific than `OFFENSE_CATEGORY_ID` (in dash separated format - 'stolen-property-possession`)
- `OFFENSE_CATEGORY_ID`: a more general categorical name for the type of offense committed (dash separated format)
- `FIRST_OCCURRENCE_DATE`: the date and time the incident first occurred
- `LAST_OCCURRENCE_DATE`: the date and time the incident ended
- `REPORTED_DATE`: the date and time the incident was reported
- `INCIDENT_ADDRESS`: the street address where the incident took place (if applicable)
- `GEO_X`: the 'easting' value of the location of an incident in the Colorado Central (C-0502) zone of the State Plane Coordinate System (SPS)
- `GEO_Y`: the 'northing' value of the location of an incident in the Colorado Central (C-0502) zone of the State Plane Coordinate System (SPS)
- `GEO_LON`: the longitude location of the incident
- `GEO_LAT`: the latitude location of the incident
- `DISTRICT_ID`: police districts for the city of Denver, sectioning of the districts can be seen [here](https://www.denvergov.org/content/denvergov/en/police-department/police-stations.html)
- `PRECINCT_ID`: police precincts for the city of Denver, sectioning of the precincts can be seen [here](https://www.denvergov.org/content/dam/denvergov/Portals/720/documents/maps/Citywide_Map.pdf)
- `NEIGHBORHOOD_ID`: the name of the neighborhood in dash format
- `IS_CRIME`: if offense is criminal this value is 1, otherwise 0
- `IS_TRAFFIC`: if offense is traffic related this value is 1, otherwise 0
- `OFFENSE_TYPE_NAME`: a descriptive name for type of offense committed, more specific than `OFFENSE_CATEGORY_NAME` (in normal phrase format - 'Possession of stolen property')
- `OFFENSE_CATEGORY_NAME`: a more general categorical name for the type of offense committed (normal phrase format)

The source for the dataset, as well as additional documentation, can be found at the [Denver Open Data Catalog](https://www.denvergov.org/opendata/dataset/city-and-county-of-denver-crime).

More specific definitions of the types/categories of crime can be found [here](https://www.denvergov.org/media/gis/DataCatalog/crime/pdf/NIBRS_Crime_Types.pdf).

### Dataset Manipulation

In [57]:
#convert date and time columns to datetime dtype
df.loc[:, 'FIRST_OCCURRENCE_DATE'] = pd.to_datetime(df.loc[:, 'FIRST_OCCURRENCE_DATE'], format="%m/%d/%Y %I:%M:%S %p")

In [60]:
df.loc[:, 'REPORTED_DATE'] = pd.to_datetime(df.loc[:, 'REPORTED_DATE'], format="%m/%d/%Y %I:%M:%S %p")

In [None]:
df.info()

In [None]:
#convert offense category names to category type (save memory)
df.loc[:, 'OFFENSE_CATEGORY_NAME'] = df.OFFENSE_CATEGORY_NAME.astype('category')

In [None]:
#remove columns with incomplete data
df = df.dropna(axis=1)

In [None]:
#check out how the dataframe is looking
df.info(memory_usage='deep')

In [None]:
#latest accurate date for the dataset is said to be 30 days prior to update date
latest_date = pd.to_datetime('August 26, 2019')
acc_date = latest_date - pd.Timedelta(30, unit='d')

In [None]:
#eliminate rows that occurred too recently to be deemed 'accurate'
df = df[df.FIRST_OCCURRENCE_DATE < acc_date]

In [None]:
#graph the total incidents by their category name
df.OFFENSE_CATEGORY_NAME.value_counts().plot(kind='barh', legend=False, figsize=(10, 5))
plt.gca().invert_yaxis()
plt.show()

In [None]:
#eliminate 'Traffic Accident' rows --> not crime
df = df[df.IS_CRIME == 1]

In [None]:
#create a semi-flexible horizontal graph function for our dataset df
def graph_valcts_by_col(column, column2, val=None):
    if val == None:
        for value in df.loc[:, column].unique():
            category = df[df.loc[:, column] == value].loc[:, column2].value_counts() 
            if len(category) <= 20:
                category.plot(kind='barh', legend=False, figsize=(10,5), title='{} - Value Counts'.format(value))
                plt.xlabel('Number of Offenses')
                plt.ylabel(column2)
                plt.gca().invert_yaxis()
                plt.show()
            else:         
                other = category[20:].sum()
                top = category.nlargest(19)
                top['All others'] = other
                top.plot(kind='barh', legend=False, figsize=(10,5), title='{} - Value Counts'.format(value))
                plt.xlabel('Number of Offenses')
                plt.ylabel(column2)
                plt.gca().invert_yaxis()
                plt.show()
    else:
        value = val
        category = df[df.loc[:, column] == value].loc[:, column2].value_counts() 
        if len(category) <= 20:
            category.plot(kind='barh', legend=False, figsize=(10,5), title='{} - Value Counts'.format(value))
            plt.xlabel('Number of Offenses')
            plt.ylabel(column2)
            plt.gca().invert_yaxis()
            plt.show()
        else:         
            other = category[20:].sum()
            top = category.nlargest(19)
            top['All others'] = other
            top.plot(kind='barh', legend=False, figsize=(10,5), title='{} - Value Counts'.format(value))
            plt.xlabel('Number of Offenses')
            plt.ylabel(column2)
            plt.gca().invert_yaxis()
            plt.show()

In [None]:
#get rid of redundant columns (mostly useful just for merging of initial datasets)
df = df.drop(['OFFENSE_TYPE_ID', 'OFFENSE_CATEGORY_ID', 'OFFENSE_ID', 'OFFENSE_CODE', 'OFFENSE_CODE_EXTENSION'], axis=1)

In [None]:
#quick look at overview of data
df.describe(include='all')

In [None]:
df.loc[:, 'NEIGHBORHOOD_ID'] = df.NEIGHBORHOOD_ID.str.replace('-', ' ').str.title()

In [None]:
df.rename(columns={'NEIGHBORHOOD_ID': 'NEIGHBORHOOD_NAME'}, inplace=True)

In [None]:
graph_valcts_by_col('OFFENSE_CATEGORY_NAME', 'OFFENSE_TYPE_NAME')

In [None]:
df.OFFENSE_CATEGORY_NAME.value_counts()
#why does 'Traffic Accident' still appear here?'

In [None]:
#MULT_COUNTS = pd.DataFrame(df.INCIDENT_ID.value_counts())
#cols = ['MULT_COUNTS']
#MULT_COUNTS.columns = cols
#df = df.join(MULT_COUNTS, how='outer')

In [None]:
#df.loc[:, 'MULT_INCIDENT'] = df.TOTAL_INCIDENT_COUNTS > 1

In [None]:
df.describe(include='all')

In [None]:
#decent looking dataframe - questions of interest?
#What are the most prominent types of crime in Denver?
df.OFFENSE_CATEGORY_NAME.value_counts().nlargest(14).plot(kind='barh', figsize=(10, 5), title='Denver Crime by Category')
plt.gca().invert_yaxis()
plt.xlabel('Number of Offenses')
plt.ylabel('Categories')
plt.tight_layout()
plt.show()

In [None]:
#'All Other Crimes' and 'Other Crimes Against Persons' don't seem very descriptive - drilling in should be useful
graph_valcts_by_col('OFFENSE_CATEGORY_NAME', 'OFFENSE_TYPE_NAME', val='All Other Crimes')
graph_valcts_by_col('OFFENSE_CATEGORY_NAME', 'OFFENSE_TYPE_NAME', val='Other Crimes Against Persons')

In [None]:
#Seems like Denver's categorization of crime could use an update, not sure why Assault doesn't have its own category, also
#wonder why 'Traffic offense - other' isn't a category in and of itself - it certainly contains enough occurrences to warrant
#the change, in fact it appears there are a few more traffic related crimes on this list, we can get an idea of all others 
#using a quick value count
df[df.OFFENSE_CATEGORY_NAME == 'All Other Crimes'].OFFENSE_TYPE_NAME.value_counts()[20:]

In [None]:
#looks like even more assualt is hidden within this 'All Other Crimes' category...
#we can look into the most criminal neighborhoods now
plt.subplot(1,2,1)
df.NEIGHBORHOOD_NAME.value_counts().nlargest(15).plot(kind='barh', figsize=(12,5), title='Denver - Fifteen Most Criminal Neighborhoods')
plt.xlabel('Number of Offenses')
plt.ylabel('Neighborhood')
plt.gca().invert_yaxis()
plt.tight_layout()

plt.subplot(1,2,2)
df[df.OFFENSE_TYPE_NAME != 'Traffic offense - other'].NEIGHBORHOOD_NAME.value_counts().nlargest(15).plot(kind='barh', figsize=(12,5), title='Denver - Fifteen Most Criminal Neighborhoods - No Traffic')
plt.xlabel('Number of Offenses')
plt.gca().invert_yaxis()
plt.tight_layout()

plt.show()

In [None]:
#Taking 'Traffic offense - other' out of the equation moves Capitol Hill up two ranks - above both Stapleton and Montbello. We
#also see Gateway Green Valley Ranch swap with West Colfax, Union Station moves up two positions and East Colfax swaps with 
#Civic Center. What will we see when we sort neighborhoods by category type?
plt.figure(figsize=(12,32))
graph_ct = 1
for value in df.OFFENSE_CATEGORY_NAME.unique():
    plt.subplot(7, 2, graph_ct)
    df[df.OFFENSE_CATEGORY_NAME == value].NEIGHBORHOOD_NAME.value_counts().nlargest(15).plot(kind='barh',
                                                                                             title='{} - Top 15 Neighborhoods'.format(value))
    plt.ylabel('Neighborhood')
    plt.xlabel('Number of Offenses')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    graph_ct += 1
plt.show()                                                                                             

In [None]:
ranks = pd.DataFrame(index=df.NEIGHBORHOOD_NAME.unique())

In [None]:
for value in df.OFFENSE_CATEGORY_NAME.unique():
    x = pd.DataFrame(df[df.OFFENSE_CATEGORY_NAME == value].NEIGHBORHOOD_NAME.value_counts().rank(ascending=False))
    x.rename(columns={'NEIGHBORHOOD_NAME': str(value)}, inplace=True)
    ranks = ranks.join(x, how='outer')

In [None]:
ranks.loc[:, 'Total Score'] = ranks.sum(axis=1)

In [None]:
comp_top15 = pd.DataFrame(ranks.loc[ranks.sort_values('Total Score').loc[:, 'Total Score'].nsmallest(15).index, :].loc[:, 'Total Score'].rank())
comp_top15.columns = ['cat_rank']
comp_top15 = comp_top15.join(pd.DataFrame(df.NEIGHBORHOOD_NAME.value_counts().nlargest(15).rank(ascending=False)))
comp_top15.loc[:, 'diff'] = comp_top15.NEIGHBORHOOD_NAME - comp_top15.cat_rank
comp_top15.columns = ['cat_rank', 'occ_rank', 'occ-cat_diff']

In [None]:
comp_top15

In [None]:
ranks.loc[comp_top15.index]

From this quick ranking score, we can find the most criminal neighborhoods by their prevalence in a particular crime category rather than by the actual counts of crime, this is useful in comparing the top categories of crime for a particular neighborhood, rather than the number of occurrences.

### Thoughts on some Neighborhoods
1. Five Points holds the top spot for nine of the fourteen categories. There is also only one category that lands higher than fifth, White Collar Crime. This shows that five points not only has the most occurrences of crime, but that basically every single category of crime is prevalent here.
2. Montbello may not have as much overall crime as CBD or Stapleton, but ranks higher categorically than the other two. This indicates more variation in types of crime in Montbello. It holds twelve of the fourteen top five positions and all fourteen categories rank in the top fourteen of any neighborhood.
3. Capitol Hill is similar in vein to Montbello, however the variation of crime looks different (ranks higher in Larceny, Drug and Alcohol, and Sexual Assualt). Thirteen of the fourteen categories land in the top fifteen - the only exception being arson (where Montbello ranks supreme).
4. Stapleton, like Capitol Hill, has thirteen of the fourteen categories in the top fifteen rank with the exception of arson. However, it appears certain categories are more prevalent than others. Five of the fourteen categories land in the top five of any other neighborhood, these include: Larceny(rank 1), Theft from motor vehicle(3), Auto theft(4), White Collar Crime(2), and Burglary(2). Based on this, it would appear Stapleton's main categories of crime seem to be theft related.
5. East and West Colfax have almost an identical score, could be of interest to see how closely they are related to one another.

In [None]:
ranks.loc[('West Colfax', 'East Colfax'), 'All Other Crimes':'Arson'].transpose().plot(kind='bar', figsize=(12,5))
plt.tight_layout()
plt.show()

Their rankings certainly appear to be quite similar. This is interesting, because these neighborhoods are actually about 8 miles apart from one another, they just center around the same street - Colfax Ave. Drug and Alcohol crime appears to be more prevalent in East Colfax than West Colfax, as is robbery. However, larceny and theft from motor vehicle are more prevalent in West Colfax. Perhaps most interesting is the that these two neighborhoods both rank so highly in arson! West Colfax is second and East Colfax is third.

In [None]:
ranks.loc[comp_top15.index]

6. Gateway / Green Valley Ranch (along with West Colfax) had the greatest increase in position, this makes sense because it ranks very highly in numerous categories, but because one or two categories rank lower (especially 'All Other Crimes' which is offense heavy) the number of total offenses is not as high. Even though the majority of the categories rank in the top fifteen, we can see that 'All Other Crimes' and especially 'Drug & Alcohol' rank much lower than most of the ranks on this dataslice. Could this be because it is a much greater distance from downtown denver than most of these neighborhoods? I wonder if there is a correllation between location and category of crime - this could be a hint that there is.
7. CBD (Central Business District) is right in the heart of downtown Denver, mostly makes sense. If you consider that some of these crime categories require certain personal properties to be present (Theft from Motor Vehicle - car, Auto Theft - car, Burglary - house) then it makes sense why CBD would rank lower in these categories. These ranks aren't very surprising - also consider the higher law enforcement presence in these areas would probably have an effect on many of these categories.
8. Perhaps the most interesting neighborhood on this list is Northeast Park Hill, where its highest ranking is murder. It has some top fifteen categories, but eight of the fourteen categories are over fifteen. I do wonder why the murder rank is so much higher here though.

In [None]:
df[(df.NEIGHBORHOOD_NAME == 'Northeast Park Hill') & (df.OFFENSE_CATEGORY_NAME == 'Murder')].sort_values('FIRST_OCCURRENCE_DATE')

In [None]:
df[(df.NEIGHBORHOOD_NAME == 'Northeast Park Hill') & (df.OFFENSE_CATEGORY_NAME == 'Murder')].shape

Looks like seventeen murders occurred in Northeast Park Hill since the beginning of 2014. In comparison to the other crimes, I wonder why.

In [None]:
#What about the most dangerous/violent neighborhoods?
#There are a few ideas on what violent crime includes, for the purpose of this project the FBI's UCR classification will be
#used, the definition is as follows:

In the FBI’s Uniform Crime Reporting (UCR) Program, violent crime is composed of four offenses:  murder and nonnegligent manslaughter, forcible rape, robbery, and aggravated assault. Violent crimes are defined in the UCR Program as those offenses which involve force or threat of force.

In [None]:
#In order to create a DataFrame called 'v_crime' we need to find the relevant rows from the original DataFrame. This will likely
#require some filtering and value counting to ensure we get the relevant rows. Murder should be relatively easy, since it has
#the least amount of occurrences.

In [None]:
murder = df[df.OFFENSE_CATEGORY_NAME == 'Murder']
#checked for manslaughter in OFFENSE_TYPE_NAME, but did not return any results

In [None]:
def cat():
    return df.OFFENSE_CATEGORY_NAME.value_counts()

In [None]:
df[(df.OFFENSE_CATEGORY_NAME == 'Sexual Assault')].OFFENSE_TYPE_NAME.value_counts()

From this we can see that this is all Rape/sexual assualt, so we should make a DataFrame for it. But first let's see if there are any other sexual crimes that may need to be considered.

In [None]:
df[(df.OFFENSE_CATEGORY_NAME != 'Sexual Assault')].OFFENSE_TYPE_NAME.str.contains('sex', case=False).sum()

Looks like there are 1,879 cases where this is true, we can dig in deeper to see what these OFFENSE_TYPE_NAME series contains.

In [None]:
df[(df.OFFENSE_CATEGORY_NAME != 'Sexual Assault')][df[(df.OFFENSE_CATEGORY_NAME != 'Sexual Assault')].OFFENSE_TYPE_NAME.str.contains('sex', case=False)].OFFENSE_TYPE_NAME.value_counts()

Based on the above, it does not appear these need to be added to the violent crimes DataFrame. Sexual harassment could be included, but without further details of this type, or further classification, it is not right to add.

In [None]:
rape = df[(df.OFFENSE_CATEGORY_NAME == 'Sexual Assault')]

On to robbery then!

In [None]:
cat()

In [None]:
def cat_bool(value):
    return df.OFFENSE_CATEGORY_NAME == value

In [None]:
def types_from_cat(value):
    return df[cat_bool(value)].OFFENSE_TYPE_NAME.value_counts()

In [None]:
types_from_cat('Larceny')

These are all non-violent, no need to add.

In [None]:
cat()

In [None]:
types_from_cat('Other Crimes Against Persons')

Good to look at these, however assualt with minor bodily injury is not considered aggravated assualt - so it will not be included in violent crimes.

In [None]:
cat()

In [None]:
types_from_cat('Robbery')

In [None]:
robbery = df[cat_bool('Robbery')]

Last category is aggravated assault, luckily enough - all these incidents are nicely packaged in the OFFENSE_CATEGORY_NAME column under 'Aggravated Assault'.

In [None]:
types_from_cat('Aggravated Assault')

In [None]:
assault = df[cat_bool('Aggravated Assault')]

In [None]:
v_crime = pd.concat([murder, rape, robbery, assault])

In [None]:
len(murder) + len(rape) + len(robbery) + len(assault)

In [None]:
v_crime.shape

## Success! We have our violent crime DataFrame `v_crime`!

So, the question was, what neighborhoods have the most violent crime? But quick aside, how much of the total crime is actually violent crime?

In [None]:
len(v_crime)/len(df)

We can see out of all the crime that occurs in Denver, only 6.1% is violent. 

In [None]:
v_crime.NEIGHBORHOOD_NAME.value_counts().nlargest(10).index

By nearly 700 occurrances, Five Points again leads the way. Let's look at its breakdown.

In [None]:
def v_types_from_n(value):
    return v_crime[v_crime.NEIGHBORHOOD_NAME == value].OFFENSE_TYPE_NAME.value_counts()
def v_cats_from_n(value):
    return v_crime[v_crime.NEIGHBORHOOD_NAME == value].OFFENSE_CATEGORY_NAME.value_counts()

In [None]:
v_types_from_n('Five Points')

In [None]:
plt.figure(figsize=(18,25))
graph_ct = 1
for value in v_crime.NEIGHBORHOOD_NAME.value_counts().nlargest(10).index:
    plt.subplot(5, 2, graph_ct)
    v_cats_from_n(value).plot(kind='pie', title='Rank {}: {}'.format(graph_ct, value))
    plt.ylabel('Violent Crime Category')
    graph_ct += 1
plt.show()

## Is there a significant difference in the number of violent offenses in the Summer compared to violent offenses the rest of the year? If not, how about in comparison to Winter?

To answer this we should consider the proper rate at which to compare the two populations. It could be possible to go by month, however summer is not constrained by a monthly parameter, it is constrained to particular days in a year. So likely the best way to separate and compare these two sets of data would be by days of year that occur in summer for a particular year (summer occurs in a different day of year range each year) against those days which occur outside the range of summer.

### Step 1: Seperating `v_crime` into two DataFrames - one with summer offenses and one without

First, we need to find when summer actually occurred each year, this isn't too hard with a quick search on Google.

In [None]:
s2014_start = pd.to_datetime('June 21, 2014')
s2014_end = pd.to_datetime('September 22, 2014')
s2015_start = pd.to_datetime('June 21, 2015')
s2015_end = pd.to_datetime('September 23, 2015')
s2016_start = pd.to_datetime('June 20, 2016')
s2016_end = pd.to_datetime('September 22, 2016')
s2017_start = pd.to_datetime('June 20, 2017')
s2017_end = pd.to_datetime('September 22, 2017')
s2018_start = pd.to_datetime('June 21, 2018')
s2018_end = pd.to_datetime('September 22, 2018')
s2019_start = pd.to_datetime('June 21, 2019')
s2019_end = pd.to_datetime('September 23, 2019')

Now let's multifilter `v_crime` to get only summer offenses.

In [None]:
def f_occ():
    return v_crime.FIRST_OCCURRENCE_DATE

In [None]:
#format we need to properly filter
print(v_crime[((s2014_start < f_occ()) & (f_occ() < s2014_end))].FIRST_OCCURRENCE_DATE.min())
print(v_crime[((s2014_start < f_occ()) & (f_occ() < s2014_end))].FIRST_OCCURRENCE_DATE.max())

In [None]:
s2014 = ((s2014_start < f_occ()) & (f_occ() < s2014_end))
s2015 = ((s2015_start < f_occ()) & (f_occ() < s2015_end))
s2016 = ((s2016_start < f_occ()) & (f_occ() < s2016_end))
s2017 = ((s2017_start < f_occ()) & (f_occ() < s2017_end))
s2018 = ((s2018_start < f_occ()) & (f_occ() < s2018_end))
s2019 = ((s2019_start < f_occ()) & (f_occ() < s2019_end))
sfilt = s2014 | s2015 | s2016 | s2017 | s2018 | s2019

In [None]:
#see if we get same result as above for 2014
print(v_crime[s2014].FIRST_OCCURRENCE_DATE.min())
print(v_crime[s2014].FIRST_OCCURRENCE_DATE.max())

In [None]:
v_crime[sfilt].FIRST_OCCURRENCE_DATE.min()

In [None]:
v_crime[sfilt].FIRST_OCCURRENCE_DATE.max()

It appears that our filtering works, now we can assign this to a new summer violent crime DataFrame, `s_v_crime`

In [None]:
s_v_crime = v_crime[sfilt]

Then to get our non-summer values we can just find the opposite of our `sfilt`.

In [None]:
#make sure our totals match up correctly
len(v_crime[~sfilt]) + len(s_v_crime) == len(v_crime)

In [None]:
ns_v_crime = v_crime[~sfilt]

Now we need to find our average violent crimes per day for both Summer and then for everything else.

In [None]:
sf_year = s_v_crime.FIRST_OCCURRENCE_DATE.dt.year.rename('year')
sf_dayofyear = s_v_crime.FIRST_OCCURRENCE_DATE.dt.dayofyear.rename('dayofyear')

In [None]:
s_counts = s_v_crime.groupby([sf_year, sf_dayofyear]).count().INCIDENT_ID

In [None]:
s_counts.mean()

The above, `12.4107...`, is the average number of violent crimes per day in Denver during Summer. The non-summer mean, can be found similarly.

In [None]:
nsf_year = ns_v_crime.FIRST_OCCURRENCE_DATE.dt.year.rename('year')
nsf_dayofyear = ns_v_crime.FIRST_OCCURRENCE_DATE.dt.dayofyear.rename('dayofyear')
ns_counts = ns_v_crime.groupby([nsf_year, nsf_dayofyear]).count().INCIDENT_ID
ns_counts.mean()

<h4>How normal are the datasets?</h4>

In [None]:
s_mu, s_std = stats.norm.fit(s_counts)
ns_mu, ns_std = stats.norm.fit(ns_counts)

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(121)
plt.hist(s_counts, bins=25, density=True, color='red', alpha=.75)
plt.title('Violent Crime Histogram: Summer - Normalized')
plt.ylabel('Frequency')
plt.xlabel('Number of Offenses')
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, len(s_counts))
p = stats.norm.pdf(x, s_mu, s_std)
plt.plot(x, p, 'k', linewidth=1)
plt.subplot(122)
plt.title('Violent Crime Histogram: Non-Summer - Normalized')
plt.hist(ns_counts, bins=25, density=True, color='blue', alpha=.75)
plt.ylabel('Frequency')
plt.xlabel('Number of Offenses')
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, len(ns_counts))
p = stats.norm.pdf(x, ns_mu, ns_std)
plt.plot(x, p, 'k', linewidth=1)

plt.tight_layout()
plt.show()

Based on a quick view of our histograms, we can see that violent crime in Summer looks much more normal than non-Summer crime. But how normal or not-normal is our data? We can do a QQ Plot on both of these to inspect closer.

In [None]:
plt.figure(figsize=(15,10))

plt.subplot(221)
stats.probplot(s_counts, dist='norm', plot=plt)
plt.title('Probability Plot - Summer Violent Crime')

plt.subplot(222)
stats.probplot(ns_counts, dist='norm', plot=plt)
plt.title('Probability Plot - Non-Summer Violent Crime')

plt.show()

These look normally distributed to a point, the extremes or outliers seem to skew the data a bit, most of our data appears like-normal for both Summer and non-Summer datasets. Let's do one last look at the skew and kurtosis to see how they look.

In [None]:
print('Summer Skew: {} | Summer Kurtosis: {}'.format(s_counts.skew(), s_counts.kurt()))
print('Non-Summer Skew: {} | Non-Summer Kurtosis: {}'.format(ns_counts.skew(), ns_counts.kurt()))

The Summer data's skew is is slightly positive, but still very close to zero, this indicates that our data trails off more in the positive direction than a normal distribution. The Non-Summer data's skew is similar, but almost double, which indicates that the outliers are more positively skewed (meaning we either have a few very large outliers or many small ones). The Kurtosis (rather "excess kurtosis" in this case) indicates that the tails of our distributions' are 'heavy' or that more values fall within the relative tails compared to the normal distribution. These values in and of themselves aren't necessarily useful in determining how normal our data is, but rather tells us in what ways it is not normal.

<h4>Null Hypothesis: Summer has no effect on the violent crime rate
    
Alternative Hypothesis: Summer does have an effect on the violent crime rate</h4>


To test this let's use a T-test and Mann-Whitney U Test (in the case that our data is not close to normal).

In [None]:
print(stats.ttest_ind(s_counts, ns_counts, equal_var=False))
print(stats.mannwhitneyu(s_counts, ns_counts, alternative='two-sided'))

In both cases, we see an extremely small p-value which means we can reject the null hypothesis and deem that Summer does indeed have a significant effect on violent crime in Denver. The rate only increases by about two offenses a day, but it seems that you can bet that when Summer arrives, violent crime will increase.