# Homework 1 Writeup

### Lauren Li

This write up contains the figures and written answers to problems 1-4 of homework 1. 


Helper functions can be found in hw1_base.py (located in the same repository).

### Problem 1: Data Acquisition and Analysis 

Below shows the number of crimes by type for years 2017 and 2018. There are a few things that stand out.

- Assault, Theft, Criminal Damage, Deceptive Practice, and Battery are among the most frequent types of crime in both years. 


- Theft, Deceptive Practice, and Battery are steady across the two years (the percent increase is low at approximately 1%).


- Assault increased by 5.6% from 2017 to 2018 while Criminal Damage decreased by -4.3%.


In [None]:
import hw1_base
import pandas as pd

In [None]:
df1, df2 = hw1_base.pull_data([2017, 2018])

#summary stats
x = hw1_base.num_crimes_type(df1, df2)

x

Next, we can look percent change in the number of crimes committed per type from 2017 to 2018.

- Concealed carry license violation and Human Trafficking had the highest percent increases. Referring back to the table above, neither type has a very high frequency. However, Concealed carry license violations are more frequent than Human trafficking crimes and had a large spike from 2017 to 2018.


- Notably, Robbery and Motor Vehicle Theft, which are more frequent that Other narcotic violations, also decreased substationally from 2017 to 2018. Robbery decreased by -18.5% while Motor Vehicle Theft decreased by -12.4%


- Interference with police officer crimes increased by 20% from 2017 to 2018.


- The type of crime with the highest percent decrease was Other narcotic violation. Given the infrequency of this type, the drop from 11 violations in 2017 to 1 in 2018 has a pronounced percent decrease.


In [None]:
hw1_base.mk_bar(x, 'Type', 'Percent Change', 'Percent Change in Crimes by Type (2017 to 2018)', 'charts/bar_type_chg.png')

You can also see the breakdown of crime types by community area below.

In [None]:
z = hw1_base.avg_crime_nhood(df1, df2)
z

The Chicago Open Data Portal crime reports also provide information on the community area. Below shows the percentage of community areas that experience a type of crime as the most frequent.

In 2017, Theft was the most frequent type of crime for 63% of community areas, followed by Battery for 36% of the areas, and Assault for 1% of the areas.

In [None]:
a = hw1_base.most_common_yr(z)
hw1_base.mk_pie(a, 2018, 'charts/2018 pie.png')

In comparison, Theft was the most frequent type of crime for 59% of community areas, followed by Battery for 38% of the areas, and Narcotics for 3% of the areas in 2018. 

More communities experienced Battery and fewer experienced Theft than in 2017. Moreover, the third most common type of crime by area switched from Assault in 2017 to Narcotics in 2018. 

In [None]:
b = hw1_base.most_common_yr(z, 2017)
hw1_base.mk_pie(b, 2017, 'charts/2017 pie.png')

To provide for detailed analysis of neighborhood crime levels, summary statistics can also be generated for each community area. This can easily be replicated using the code in hw1_nb.ipynb (function summary_nhood), which allows the user to enter the specific community area number of interest to generate a table and graph of summary statistics. Community Area 3 is shown below as an example.

Based on the table and bar chart below for Community Area 3:

- Theft and Battery are among the most frequent in the neighborhood, which is similar to other neighborhoods and the city overall based on data shown above.


- Deceptive Practice and Criminal Damages are also relatively more frequent, again similar to the city overall. 


- In contrast to the overall city statistics, Assault crimes have decreased by -6.5% from 2017 to 2018 while Motor vehicle thefts have increased by 17%.

In [None]:
hw1_base.summary_nhood(z, '3')

## Problem 2: Data Augmentation and APIs 

The table below provides summary statistics to answer questions 1, 2, and 3. To identify blocks with "frequent" reports of battery/homicide, I look at blocks that had a number of battery/homicide violations in the 75th percentile of frequency. Otherwise, there were a lot of blocks with just one or two that I didn't think would add to the analysis.


#### 1. What types of blocks have reports of “Battery”?

- The blocks that have reports of "Battery had a large proportion of residents (40% on average) with household income less than 25,000. Over 80% (on average) of the households had household income less than 100,000.


- On average, majority of the residents in the block are Black (75%). 


- The median age is 35, on average.



#### 2. What types of blocks get “Homicide”?


- The blocks that have reports of "Battery had a large proportion of residents (40% on average) with household income less than 25,000. Over 80% (on average) of the households had household income less than 100,000.


- Majority of the residents in the block are Black (80%) on average. 


- The median age is 33 on average.


#### 3. Does that change over time in the data you collected?

- The demographics listed for blocks that report Battery, do not change substantially from 2017 to 2018 on average. Note that the number of Battery crimes also stayed pretty steady from 2017 to 2018 (summary table for problem 1).


- There is more pronounced difference when comparing blocks that had Homicides in 2017 to 2018. In 2018, blocks had an increase in percentage of housholds with income less than 25,000 (49% vs 43% in 2017 on average). There is also a notable increase in percentage of Black residents in the block which is 90% in 2018 vs 80% in 2017 on average. The number of homicides in the city decreased from 2018 to 2017. This could indicate that while the number of homicides decreased, homicides became more concentrated in black, low-income neighborhoods.

In [None]:
#census api pull - takes a long time to run!

census_pull = hw1_base.get_acs_blk_data('17','031')

In [None]:
census_df = pd.read_csv('data/census_data.csv',skiprows=1)
census_df = census_df.drop('0', axis=1)
census_df.columns = ['Name', 'Median Age', 'Total Households', 'Households < 10k', '10k < Household < 15k',
                    '15k < Household < 20k','20k < Household < 25k','25k < Household < 30k',
                    '30k < Household < 35k','35k < Household < 40k','40k < Household < 45k',
                    '45k < Household < 50k', '50k < Household < 60k','60k < Household < 75k',
                    '75k < Household < 100k', 'Total Hispanic', 'Total Not Hispanic', 'Total White',
                    'Total Black', 'Total Asian', 'state', 'county', 'tract', 'block']

In [None]:
#spatial join to identify census block and tract
joined1 = hw1_base.multiple_joins(['BATTERY', 'HOMICIDE'], [df1, df2])

joined2 = hw1_base.multiple_joins(['DECEPTIVE PRACTICE','SEX OFFENSE'], [df1, df2])

metrics = ['Median Age', 'Pct household income < 25k', '25k < Pct household income < 50k',
              '50k < Pct household income < 100k', 'Percent Hispanic',
              'Percent White', 'Percent Black', 'Percent Asian']

tbl1 = hw1_base.build_summary_table(metrics, joined1, census_df)
tbl2 = hw1_base.build_summary_table(metrics, joined2, census_df)

In [None]:
tbl1

For question 4, similar to questions 1-3 above, I look at blocks that had a number of deceptive practice/sex offense violations in the 75th percentile of frequency. Otherwise, there were a lot of blocks with only a few cases that I thought would add noise.



#### 4. What is the difference in blocks that get “Deceptive Practice” vs “Sex Offense”?


- Blocks that receive Deceptive Practice have a lower percentage of households with income less than 100,000, on average than those that get Sex Offense. In 2018, blocks that got Sex Offense had 76.7% of households with income less than 100,000 on average, while blocks that got Deceptive Practice had 74.4% of households with income less than 100,000 on average.


- Blocks that receive Deceptive Practice have a lower percentage (14% on average) of Hispanic/Latinx residents in the community compared to blocks that get Sex Offense (21% on average). Furthermore, blocks that receive Deceptive Practice have a higher percentage of White residents (32.5% vs 31% on average).


- The median age is around 35 for both.


In [None]:
tbl2

## Problem 3: Analysis and Communication

#### 1. Describe how crime has changed in Chicago from 2017 to 2018?


This is also shown in the tables and charts in Problem 1, with some of the explanation repeated here. 

From 2017 to 2018, overall crime reports decreased by less than a percent. 

- Assault, Theft, Criminal Damage, Deceptive Practice, and Battery are among the most frequent types of crime in both years. 


- Theft, Deceptive Practice, and Battery are steady across the two years (the percent increase is low at approximately 1%).


- Assault increased by 5.6% from 2017 to 2018 while Criminal Damage decreased by -4.3%.


- Notably, Robbery and Motor Vehicle Theft, which are more frequent that Other narcotic violations, also decreased substationally from 2017 to 2018. Robbery decreased by -18.5% while Motor Vehicle Theft decreased by -12.4%


- Concealed carry license violation and Human Trafficking had the highest percent increases. Referring back to the table above, neither type has a very high frequency. However, Concealed carry license violations are more frequent than Human trafficking crimes and had a large spike from 2017 to 2018.


#### 2A. Are these statistics correct?


No, the table below shows that the statistics on the website for the 28 day period leading up to July 26, 2018 compared to the same period in 2017 are not consistent with the Chicago Open Data Portal data for any category.


In [None]:
end2 = pd.Timestamp(year=2018,month=7,day=26, hour = 23, minute= 59, second=59)
end1 = pd.Timestamp(year=2017,month=7,day=26, hour = 23, minute= 59, second=59)
sub = pd.Timedelta('29 days')
tgt_types = ['ROBBERY', 'BURGLARY', 'MOTOR VEHICLE THEFT', 'BATTERY']
claims = [16, 136, 50, 41, 21]

hw1_base.compare_yr(df1, df2, 2017, 2018, '43', end1, end2, sub, sub, tgt_types, claims)

The year to date numbers are also not correct. The numbers are more similar, with the same magnitude and within a couple percentage points of each other.

In [None]:
#pull in 2016 data
df3 = hw1_base.pull_data([2016])[0]

end_date_16 = pd.Timestamp(year=2016, month=7, day=26, hour = 23, minute= 59, second=59)
end_date_17 = pd.Timestamp(year=2017, month=7, day=26, hour = 23, minute= 59, second=59)
end_date_18 = pd.Timestamp(year=2018, month=7, day=26, hour = 23, minute= 59, second=59)

ytd_16 = hw1_base.ytd_ward_df(df3, '43', end_date_16)
ytd_17 = hw1_base.ytd_ward_df(df1, '43', end_date_17)
ytd_18 = hw1_base.ytd_ward_df(df2, '43', end_date_18)

ytd_totl_17 = ytd_17.shape[0]
ytd_totl_16 = ytd_16.shape[0]
ytd_totl_18 = ytd_18.shape[0]

ytd_counts = pd.DataFrame([['2016', ytd_totl_16], ['2017', ytd_totl_17], ['2018', ytd_totl_18]], columns = ['Year', 'YTD Total'])
ytd_counts['Pct Change compared to 2018'] = round((ytd_totl_18/ytd_counts['YTD Total']-1)*100,2)
ytd_counts['Claimed Pct Change compared to 2018'] = [22, 10, 'Nan']

ytd_counts

#### 2B. Could they be misleading or would you agree with the conclusions he’s drawing? Why or why not?

Can definitely be misleading.

#### 3. Based on these summary statistics, provide 5 key findings to the new mayor’s office about crime in Chicago and what they should focus on in order to deal with crime in Chicago.


1. Homicides are decreasing but concentrated in low-income, predominantly Black neighborhoods.


Target social workers, violence prevention


2. Homicide decreased but Assault increased by 5%


3. Theft and Battery are the most frequent type of crime in both 2017 and 2018.


4. Concealed carry license violation had the highest percentage increase in frequency from 2017 to 2018.


5. Interference with police officer increased from 2017 to 2018.


#### 4. What are some of the key caveats of your recommendations and limitations of the analysis that you just did?

Limited dataset

Small timeframe

Not taking into account policy changes from 2017 to 2018

## Problem 4

Note: A bit of methodology is explained below, but all code can be found in hw1_nb.ipynb

#### A. Of the types of crimes you have data for, which crime type is the most likely given the call came from 2111 S Michigan Ave? What are the probabilities for each type of request?

Using the block address, I used the 2017 and 2018 crime reports and filtered to crimes for that block address and then aggregated by type.

It is most likely a call about Battery. The below table shows the probability for each type of request (in column "Percent of Total"). For example, the probability that the call is about Battery is 26.67% vs a call about Other Offense is 21.67% likely.

In [None]:
address = '021XX S MICHIGAN AVE'
df_totl = pd.concat([df1,df2])
hw1_base.address_info(address, df_totl)

<img srg='probability_table.png'>

#### B. Let’s now assume that a call comes in about Theft. Which is more likely – that the call came from Garfield Park or Uptown? How much more or less likely is it to be from Garfield Park vs Uptown?

Note: I combined East Garfield Park and West Garfield Park to be Garfield Park.

I used the 2018 crime report data and found the total number of Thefts in 2017 and 2018. After filtering by Garfield Park and Uptown and aggregating by type, I found:

- The probability that the call came from Garfield Park: 1.93%


- The probability that the call came from Uptown: 1.51%


Therefore,

- The call is more likely to have come from Garfield Park by approximately 0.42%.


In [None]:
gp_counts = hw1_base.nhood_count(z, 'Garfield Park', ['2017 Total', '2018 Total'], 'Theft')
upt_counts = hw1_base.nhood_count(z, 'Uptown', ['2017 Total', '2018 Total'], 'Theft')

totl_theft = df2[df2['primary_type'] == 'THEFT'].shape[0] + df1[df1['primary_type'] == 'THEFT'].shape[0]

upt_pr = upt_counts/totl_theft
gp_pr = gp_counts/totl_theft

In [None]:
gp_pr - upt_pr

#### C. There are a total of 1000 calls, 600 from Garfield Park and 400 from Uptown. Of the 600 calls from Garfield Park, 100 of them are about Battery. Of the 400 calls from Uptown, 160 are about Battery. If a call comes about Battery, how much more/less likely is it that the call came from Garfield Park versus Uptown?

If a call comes about Battery, it is 23.08% less likely to have come from Garfield Park than Uptown.

Pr(Battery | Garfield Park) = 1/6

Pr(Battery | Uptown) = 160/400

Pr(Uptown) = 400/1000

Pr(Garfield Park) = 600/1000

Pr(Battery) = (100+160)/1000

If call comes about Battery, prob that it is from Garfield Park:

Pr(Garfield Park | Battery) = (Pr(Battery | Garfield Park) * Pr(Garfield Park))/Pr(Battery) = 38.46%


If call comes about Battery, prob that it is from Uptown:

Pr(Uptown | Battery) = (Pr(Battery | Uptown) * Pr(Uptown))/Pr(Battery) = 61.54%