In [1]:
# Import libraries
import csv
import calendar

from pprint import pprint

from collections import Counter
from collections import defaultdict
from collections import Counter

from datetime import datetime

## 05 Answering Data Science Questions

Time for a case study to reinforce all of your learning so far! You'll use all the containers and data types you've learned about to answer several real world questions about a dataset containing information about crime in Chicago. Have fun!

## 05.01 Counting within Date Ranges

See the video.

## 05.02 Reading your data with CSV Reader and Establishing your Data Containers

Let's get started! The exercises in this chapter are intentionally more challenging, to give you a chance to really solidify your knowledge. Don't lose heart if you find yourself stuck; think back to the concepts you've learned in previous chapters and how you can apply them to this crime dataset. Good luck!

Your data file, __crime_sampler.csv__ contains the date (1st column), block where it occurred (2nd column), primary type of the crime (3rd), description of the crime (4th), description of the location (5th), if an arrest was made (6th), was it a domestic case (7th), and city district (8th).

Here, however, you'll focus only 4 columns: The date, type of crime, location, and whether or not the crime resulted in an arrest.

Your job in this exercise is to use a CSV Reader to load up a list to hold the data you're going to analyze.

**Instructions**

1. Import the Python csv module.
3. Create a Python file object in read mode for crime_sampler.csv called csvfile.
4. Create an empty list called crime_data.
5. Loop over a csv reader on the file object :
6. Inside the loop, append the date (first element), type of crime (third element), location description (fifth element), and arrest (sixth element) to the crime_data list.
7. Remove the first element (headers) from the crime_data list.
8. Print the first 10 records of the crime_data list. This has been done for you, so hit 'Submit Answer' to see the result!

**Results:**<br>
<font color=darkgreen>Great start! Have a look at the output and notice its structure. How are arrests denoted?</font>

In [2]:
# Create the file object: csvfile
file = 'data/crime_sampler.csv'
with open(file, 'r') as csvfile:
    # Create a list: crime_data, with the date, type of crime, location description, and arrest on the file object
    crime_data = [(row[0], row[2], row[4], row[5]) for row in csv.reader(csvfile)]
    
# Remove the first element from crime_data
del crime_data[0]

# Print the first 10 records
pprint(crime_data[:10])

[('05/23/2016 05:35:00 PM', 'ASSAULT', 'STREET', 'false'),
 ('03/26/2016 08:20:00 PM', 'BURGLARY', 'SMALL RETAIL STORE', 'false'),
 ('04/25/2016 03:05:00 PM', 'THEFT', 'DEPARTMENT STORE', 'true'),
 ('04/26/2016 05:30:00 PM', 'BATTERY', 'SIDEWALK', 'false'),
 ('06/19/2016 01:15:00 AM', 'BATTERY', 'SIDEWALK', 'false'),
 ('05/28/2016 08:00:00 PM', 'BATTERY', 'GAS STATION', 'false'),
 ('07/03/2016 03:43:00 PM', 'THEFT', 'OTHER', 'false'),
 ('06/11/2016 06:55:00 PM', 'PUBLIC PEACE VIOLATION', 'STREET', 'true'),
 ('10/04/2016 10:20:00 AM', 'BATTERY', 'STREET', 'true'),
 ('02/14/2017 09:00:00 PM', 'CRIMINAL DAMAGE', 'PARK PROPERTY', 'false')]


## 05.03 Find the Months with the Highest Number of Crimes

Using the __crime_data__ list from the prior exercise, you'll answer a common question that arises when dealing with crime data: _How many crimes are committed each month?_

Feel free to use the IPython Shell to explore the __crime_data list__ - it has been pre-loaded for you. For example, __crime_data[0][0]__ will show you the first column of the first row which, in this case, is the date and time time that the crime occurred.

**Instructions**

1. Import Counter from collections and datetime from datetime.
2. Create a Counter object called crimes_by_month.
3. Loop over the crime_data list:
4. Using the datetime.strptime() function, convert the first element of each item into a Python Datetime Object called date.
5. Increment the counter for the month associated with this row by one. You can access the month of date using date.month.
6. Print the 3 most common months for crime.

**Results:**<br>
<font color=darkgreen>Well done! It looks like the months with the highest number of crimes are January, February, and July.</font>

In [3]:
# Create a Counter Object: crimes_by_month
crimes_by_month = Counter()

# Loop over the crime_data list
for crime in crime_data:
    
    # Convert the first element of each item into a Python Datetime Object: date
    date = datetime.strptime(crime[0], '%m/%d/%Y %I:%M:%S %p')
    
    # Increment the counter for the month of the row by one
    crimes_by_month[date.month] += 1
    
# Print the 3 most common months for crime
print(crimes_by_month.most_common(3))

[(1, 1948), (2, 1862), (7, 1257)]


## 05.04 Transforming your Data Containers to Month and Location

Now let's flip your __crime_data__ list into a dictionary keyed by month with a list of location values for each month, and filter down to the records for the year 2016. Remember you can use the shell to look at the __crime_data list__, such as __crime_data[1][4]__ to see the location of the crime in the second item of the list (since lists start at 0).

**Instructions**

1. Import defaultdict from collections and datetime from datetime.
2. Create a dictionary that defaults to a list called locations_by_month.
3. Loop over the crime_data list:
4. Convert the first element to a date object exactly like you did in the previous exercise.
5. If the year is 2016, set the key of locations_by_month to be the month of date and .append() the location (fifth element of row) to the values list.
6. Print the dictionary. This has been done for you, so hit 'Submit Answer' to see the result!

**Results:**<br>
<font color=darkgreen>Well done! It is difficult to draw quick insights from this output - the .most_common() method would be useful here!</font>

In [4]:
# Create a dictionary that defaults to a list: locations_by_month
locations_by_month = defaultdict(list)

# Loop over the crime_data list
for row in crime_data:
    # Convert the first element to a date object
    date = datetime.strptime(row[0], '%m/%d/%Y %I:%M:%S %p')
    
    # If the year is 2016 
    if date.year == 2016:
        # Set the dictionary key to the month and append the location (fifth element) to the values list
        locations_by_month[date.month].append(row[2])

# Print the dictionary
for m in sorted(locations_by_month):
    print('In {}, we registered {} locations.'.format(datetime(1900, m, 1).strftime('%b'), len(locations_by_month[m])))
    print(f'In {calendar.month_abbr[m]}, we registered {len(locations_by_month[m])} locations.')

In Jan, we registered 956 locations.
In Jan, we registered 956 locations.
In Feb, we registered 919 locations.
In Feb, we registered 919 locations.
In Mar, we registered 1080 locations.
In Mar, we registered 1080 locations.
In Apr, we registered 1036 locations.
In Apr, we registered 1036 locations.
In May, we registered 1092 locations.
In May, we registered 1092 locations.
In Jun, we registered 1136 locations.
In Jun, we registered 1136 locations.
In Jul, we registered 1257 locations.
In Jul, we registered 1257 locations.
In Aug, we registered 1218 locations.
In Aug, we registered 1218 locations.
In Sep, we registered 1146 locations.
In Sep, we registered 1146 locations.
In Oct, we registered 1128 locations.
In Oct, we registered 1128 locations.
In Nov, we registered 1026 locations.
In Nov, we registered 1026 locations.
In Dec, we registered 928 locations.
In Dec, we registered 928 locations.


In [5]:
datetime(1900, m, 1).strftime('%b')

'Dec'

## 05.05 Find the Most Common Crimes by Location Type by Month in 2016

Using the __locations_by_month__ dictionary from the prior exercise, you'll now determine common crimes by month and location type. Because your dataset is so large, it's a good idea to use Counter to look at an aspect of it in an easier to manageable size and learn more about it.

**Instructions**
1. Import Counter from collections.
2. Loop over the items from your dictionary, using tuple expansion to unpack locations_by_month.items() into month and locations.
3. Make a Counter of the locations called location_count.
4. Print the month.
5. Print the five most common crime locations.

**Results:**<br>
<font color=darkgreen>Fantastic work. It looks like most crimes in Chicago in 2016 took place on the street.</font>

In [6]:
# Loop over the items from locations_by_month using tuple expansion of the month and locations
for month, locations in locations_by_month.items():
    # Make a Counter of the locations
    location_count = Counter(locations)
    # Print the month 
    print(month)
    # Print the most common location
    print(location_count.most_common(5))

5
[('STREET', 241), ('RESIDENCE', 175), ('APARTMENT', 128), ('SIDEWALK', 111), ('OTHER', 41)]
3
[('STREET', 240), ('RESIDENCE', 190), ('APARTMENT', 139), ('SIDEWALK', 99), ('OTHER', 52)]
4
[('STREET', 213), ('RESIDENCE', 171), ('APARTMENT', 152), ('SIDEWALK', 96), ('OTHER', 40)]
6
[('STREET', 245), ('RESIDENCE', 164), ('APARTMENT', 159), ('SIDEWALK', 123), ('PARKING LOT/GARAGE(NON.RESID.)', 44)]
7
[('STREET', 309), ('RESIDENCE', 177), ('APARTMENT', 166), ('SIDEWALK', 125), ('OTHER', 47)]
10
[('STREET', 248), ('RESIDENCE', 206), ('APARTMENT', 122), ('SIDEWALK', 92), ('OTHER', 62)]
12
[('STREET', 207), ('RESIDENCE', 158), ('APARTMENT', 136), ('OTHER', 47), ('SIDEWALK', 46)]
1
[('STREET', 196), ('RESIDENCE', 160), ('APARTMENT', 153), ('SIDEWALK', 72), ('PARKING LOT/GARAGE(NON.RESID.)', 43)]
9
[('STREET', 279), ('RESIDENCE', 183), ('APARTMENT', 144), ('SIDEWALK', 121), ('OTHER', 39)]
11
[('STREET', 236), ('RESIDENCE', 182), ('APARTMENT', 154), ('SIDEWALK', 75), ('OTHER', 41)]
8
[('STREET',

In [7]:
# Create a dictionary that defaults to a list: locations_by_month
locations_by_month2 = defaultdict(Counter)

# Loop over the crime_data list
for row in crime_data:
    # Convert the first element to a date object
    date = datetime.strptime(row[0], '%m/%d/%Y %I:%M:%S %p')
    
    # If the year is 2016 
    if date.year == 2016:
        # Set the dictionary key to the month and append the location (fifth element) to the values list
        locations_by_month2[date.month].update({row[2]: 1})

# Print the dictionary
for m in sorted(locations_by_month2):
    print('In {}, most common crime locations: '.format(datetime(1900, m, 1).strftime('%b')))
    print(locations_by_month2[m].most_common(5))

In Jan, most common crime locations: 
[('STREET', 196), ('RESIDENCE', 160), ('APARTMENT', 153), ('SIDEWALK', 72), ('PARKING LOT/GARAGE(NON.RESID.)', 43)]
In Feb, most common crime locations: 
[('STREET', 188), ('RESIDENCE', 159), ('APARTMENT', 144), ('SIDEWALK', 73), ('OTHER', 40)]
In Mar, most common crime locations: 
[('STREET', 240), ('RESIDENCE', 190), ('APARTMENT', 139), ('SIDEWALK', 99), ('OTHER', 52)]
In Apr, most common crime locations: 
[('STREET', 213), ('RESIDENCE', 171), ('APARTMENT', 152), ('SIDEWALK', 96), ('OTHER', 40)]
In May, most common crime locations: 
[('STREET', 241), ('RESIDENCE', 175), ('APARTMENT', 128), ('SIDEWALK', 111), ('OTHER', 41)]
In Jun, most common crime locations: 
[('STREET', 245), ('RESIDENCE', 164), ('APARTMENT', 159), ('SIDEWALK', 123), ('PARKING LOT/GARAGE(NON.RESID.)', 44)]
In Jul, most common crime locations: 
[('STREET', 309), ('RESIDENCE', 177), ('APARTMENT', 166), ('SIDEWALK', 125), ('OTHER', 47)]
In Aug, most common crime locations: 
[('STR

## 05.06 Dictionaries with Time Windows for Keys

See the video.

## 05.07 Reading your Data with DictReader and Establishing your Data Containers

Your data file, __crime_sampler.csv__ contains in positional order: the date, block where it occurred, primary type of the crime, description of the crime, description of the location, if an arrest was made, was it a domestic case, and city district.

You'll now use a DictReader to load up a dictionary to hold your data with the district as the key and the rest of the data in a list. The __csv__, __defaultdict__, and __datetime__ modules have already been imported for you.

**Instructions**

1. Create a Python file object in read mode for crime_sampler.csv called csvfile.
2. Create a dictionary that defaults to a list called crimes_by_district.
3. Loop over a DictReader of the CSV file:
4. Pop 'District' from each row and store it as district.
5. Append the rest of the data (row) to the district key of crimes_by_district.

**Results:**<br>
<font color=darkgreen>Brilliant work. You're now ready to analyze crime by district.</font>

In [27]:
# Create the CSV file: csvfile
with open(file, 'r') as csvfile:
    # Create a dictionary that defaults to a list: crimes_by_district
    crimes_by_district = defaultdict(list)
    
    # Loop over a DictReader of the CSV file
    for row in csv.DictReader(csvfile):
        # Pop the district from each row: district
        district = row.pop('District')
        # Append the rest of the data to the list for proper district in crimes_by_district
        crimes_by_district[int(district)].append(row)

for district in sorted(crimes_by_district):
    print(f'In {district:>2}, it has been registered {len(crimes_by_district[district]):>4} crime(s).')

pprint(crimes_by_district[district][0])

In  1, it has been registered  753 crime(s).
In  2, it has been registered  647 crime(s).
In  3, it has been registered  650 crime(s).
In  4, it has been registered  785 crime(s).
In  5, it has been registered  685 crime(s).
In  6, it has been registered  929 crime(s).
In  7, it has been registered  847 crime(s).
In  8, it has been registered  971 crime(s).
In  9, it has been registered  697 crime(s).
In 10, it has been registered  728 crime(s).
In 11, it has been registered 1028 crime(s).
In 12, it has been registered  740 crime(s).
In 14, it has been registered  601 crime(s).
In 15, it has been registered  632 crime(s).
In 16, it has been registered  498 crime(s).
In 17, it has been registered  429 crime(s).
In 18, it has been registered  727 crime(s).
In 19, it has been registered  713 crime(s).
In 20, it has been registered  241 crime(s).
In 22, it has been registered  461 crime(s).
In 24, it has been registered  409 crime(s).
In 25, it has been registered  828 crime(s).
In 31, it 

## 05.08 Determine the Arrests by District by Year

Using your __crimes_by_district__ dictionary from the previous exercise, you'll now determine the number arrests in each City district for each year. __Counter__ is already imported for you. You'll want to use the IPython Shell to explore the __crimes_by_district__ dictionary to determine how to check if an arrest was made.

**Instructions**

1. Loop over the crimes_by_district dictionary, unpacking it into the variables district and crimes.
2. Create an empty Counter object called year_count.
3. Loop over the crimes:
4. If there was an arrest,
5. Convert crime['Date'] to a datetime object called year.
6. Add the crime to the Counter for the year, by using year as the key of year_count.
6. Print the Counter. This has been done for you, so hit 'Submit Answer' to see the result!

**Results:**<br>
<font color=darkgreen>Interesting. It looked like most arrests took place in the 11th District.</font>

In [32]:
# Loop over the crimes_by_district using expansion as district and crimes
for district, crimes in dict(sorted(crimes_by_district.items())).items():
    # Create an empty Counter object: year_count
    year_count = Counter()
    
    # Loop over the crimes:
    for crime in crimes:
        # If there was an arrest
        if crime['Arrest'] == 'true':
            # Convert the Date to a datetime and get the year
            year = datetime.strptime(crime['Date'], '%m/%d/%Y %I:%M:%S %p').year
            # Increment the Counter for the year
            year_count[year] += 1
            
    # Print the district and  the counter
    print(f'In district {district:>2}: {year_count}')

In district  1: Counter({2016: 124, 2017: 15})
In district  2: Counter({2016: 84, 2017: 15})
In district  3: Counter({2016: 98, 2017: 18})
In district  4: Counter({2016: 134, 2017: 15})
In district  5: Counter({2016: 149, 2017: 30})
In district  6: Counter({2016: 157, 2017: 32})
In district  7: Counter({2016: 181, 2017: 27})
In district  8: Counter({2016: 124, 2017: 26})
In district  9: Counter({2016: 116, 2017: 17})
In district 10: Counter({2016: 144, 2017: 20})
In district 11: Counter({2016: 275, 2017: 53})
In district 12: Counter({2016: 72, 2017: 9})
In district 14: Counter({2016: 59, 2017: 8})
In district 15: Counter({2016: 154, 2017: 16})
In district 16: Counter({2016: 66, 2017: 9})
In district 17: Counter({2016: 38, 2017: 5})
In district 18: Counter({2016: 92, 2017: 17})
In district 19: Counter({2016: 88, 2017: 11})
In district 20: Counter({2016: 27, 2017: 8})
In district 22: Counter({2016: 78, 2017: 12})
In district 24: Counter({2016: 51, 2017: 10})
In district 25: Counter({2016

## 05.09 Unique Crimes by City Block

You're in the home stretch!

Here, your data has been reshaped into a dictionary called __crimes_by_block__ in which crimes are listed by city block. Your task in this exercise is to get a unique list of crimes that have occurred on a couple of the blocks that have been selected for you to learn more about. You might remember that you used __set()__ to solve problems like this in Chapter 1.

Go for it!

**Instructions**

1. Create a unique list of crimes for the '001XX N STATE ST' block called n_state_st_crimes and print it.
2. Create a unique list of crimes for the '0000X W TERMINAL ST' block called w_terminal_st_crimes and print it.
3. Find the crimes committed on 001XX N STATE ST but not 0000X W TERMINAL ST. Store the result as crime_differences and print it.

**Results:**<br>
<font color=darkgreen>Well done! There are some curious differences in crime between these two city blocks.</font>

In [68]:
crimes_by_district_and_block = defaultdict(dict)
for district, crimes in dict(sorted(crimes_by_district.items())).items():
    
    crimes_by_block = defaultdict(set)
    for crime in crimes:
        crimes_by_block[crime['Block']].add(crime['Primary Type'])
    crimes_by_district_and_block[district] = crimes_by_block
    
# Print data for district = 20
district = 1
for i, block in enumerate(sorted(crimes_by_district_and_block[district]), start=1):
    if i > 5: break
    print('District No.{}: Block "{}" registered {} different types of crime(s).'.format(
                district, block,
                len(crimes_by_district_and_block[district][block])))
    

District No.1: Block "0000X E 21ST ST" registered 1 different types of crime(s).
District No.1: Block "0000X E 26TH ST" registered 1 different types of crime(s).
District No.1: Block "0000X E 8TH ST" registered 1 different types of crime(s).
District No.1: Block "0000X E ADAMS ST" registered 3 different types of crime(s).
District No.1: Block "0000X E CERMAK RD" registered 1 different types of crime(s).


In [69]:
# Create a unique list of crimes for the first block: n_state_st_crimes
n_state_st_crimes = set(crimes_by_district_and_block[1]['001XX N STATE ST'])

# Print the list
print(n_state_st_crimes)

# Create a unique list of crimes for the second block: w_terminal_st_crimes
w_terminal_st_crimes = set(crimes_by_district_and_block[16]['0000X W TERMINAL ST'])

# Print the list
print(w_terminal_st_crimes)

# Find the differences between the two blocks: crime_differences
crime_differences = n_state_st_crimes.difference(w_terminal_st_crimes)

# Print the differences
print(crime_differences)

{'THEFT', 'CRIMINAL TRESPASS', 'OTHER OFFENSE', 'CRIMINAL DAMAGE', 'BATTERY', 'ASSAULT', 'DECEPTIVE PRACTICE', 'ROBBERY'}
{'THEFT', 'CRIMINAL TRESPASS', 'OTHER OFFENSE', 'CRIMINAL DAMAGE', 'ASSAULT', 'DECEPTIVE PRACTICE', 'PUBLIC PEACE VIOLATION', 'NARCOTICS'}
{'ROBBERY', 'BATTERY'}


## 05.10 Final thoughts

See the video.

# Aditional material

- **Datacamp course**: https://learn.datacamp.com/courses/data-types-for-data-science-in-python