In this assignment we will continue to explore the Chicago crimes dataset using the more advanced operations we've learned including merging and group aggregation. Put the code for each question in files named q1.py,...,q6.py
. Do not commit the CSV files to your git repository.
-
Load the 2017 crime data from assignment 5. Also load the community area socioeconomic data and get rid of the row corresponding to the whole city of Chicago (it's missing a
Community Area Number
).Calculate the number of crimes per community area. Merge this with the socioeconomic data to plot a scatter with per capita income on the x axis and crime count on the y axis. Save the plot as
crimes_by_income.png
.Hint:
- Turn the crime counts into a dataframe using
reset_index()
as discussed in lecture. Either name the community area nunber columnCommunity Area Number
to match the corresponding column in the socioeconomic data, or use separateleft_on
andright_on
arguments tomerge()
. - To make a scatter plot, call
plot()
on the DataFrame with an argumentkind='scatter'
and additional argumentsx
andy
specifying the names of the columns to use forx
andy
data.
- Turn the crime counts into a dataframe using
-
Repeat #1 for homicide counts and save the plot as
homocides_by_income.png
.Hint: The homicide counts will be missing rows for community areas that had no homicides. The default, inner, merge between homocide counts and socioeconomic data will thus be missing these areas. To get the right answer, you will need to select a different merge type using the
how
argument and then fill in the missing homocide counts with zeros. -
Create a plot where the x-axis is the hour of the day and the y-axis is the proportion of crimes occuring that hour that are domestic. Save this as
prop_domestic_by_hour.png
.Hint: You can extract the hour of a Timestamp using the
hour
attribute (similar tomonth
anddayofweek
shown in lecture). Thengroupby
the hour and aggregate theDomestic
column. -
Chicago is divded into 77 Community Areas, which were originally designated by sociologists at the University of Chicago in the 1920s. They are large neighborhoods. The Census Bureau, however, uses its own geographic partition to aggregate data, such as population counts. These are called census blocks, which are grouped into census tracts.
Download the census block population data, which contains the population of each census block. Also download the census tract to community area mapping, which tells us which Community Area each tract belongs to. (Note that some census tracts cross a Community Area boundary but this file ignores that.)
Merge these datasets on census tract to calculate the population of each community area. Put the result in a dataframe with two columns:
Community Area
andPopulation
. Write the result to a CSV file calledcommunity_populations.csv
using theDataFrame.to_csv()
function. (You may want to passindex=False
so that it doesn't write the index column to the file.)Hint:
- To merge the datasets you need to find the census tract for each block in the population data. By definition this is the first 6 digits of the block number. (See the U.S. Census page on geoidentifiers, especially the section titled "GEOID Structure for Geographic Areas).
- However, the data portal has a bug where it dropped the leading digit if it was a zero. Thus you need to convert the census blocks to strings, and then pad them to length 10 with a leading zero using the
.str.zfill()
function. - The census tracts in the
tract_community.csv
mapping are full GeoIDs. The first few numbers represent Cook County. To match he tract in the popultaion data you can ignore these digits and take only the last 6 digits by converting them to strings and indexing.- For example, you will want to pair
Census Tract
17031031000 in tract_community.csv withCENSUS BLOCK
310003002 inPopulation_by_2010_Census_Block.csv
.
- For example, you will want to pair
- Finally you can merge the two datasets and group by Community area and aggregate the populations.
-
Merge the dataset you created in #4 with a count of homicides by community area to calculate the homicide rate per 100,000 capita in each community area. Merge this with the socioeconomic data, which contains the name of each community area, to find the community area with the highest homicide rate. Include the name of this community area and its homicide rate in a comment at the end of
q5.py
. -
Load the police stations data. Join the crime data with the stations on police district. For each crime, calculate the distance to the police station in miles. Then plot a histogram of these distances and save it to
crimes_by_distance.png
.Hint:
- The station
DISTRICT
is a text field (because one of them is 'Headquarters') so you'll need to convert the crimeDistrct
to the same and strip the decimals so they match. - To calculate the distance use the included coordinates. (called
X Coordinates
andY Cooridnates
in the crime file;X COORDINATES
,Y COORDINATES
in the stations file). Find their distance in feet by taking the Euclidean distance (sqrt of dx squared plus dy squared) and then divide by 5280 to convert to miles. - By default the histogram will have 10 bins. You can increase that using the
bins
argument tohist()
.
- The station