# Answering Data Science Questions
> Time for a case study to reinforce all of your learning so far! You'll use all the containers and data types you've learned about to answer several real world questions about a dataset containing information about crime in Chicago. This is the Summary of lecture "Data Types for Data Science in Python", via datacamp.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Datacamp, Data_Science]
- image: 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (10, 8)

## Counting within Date Ranges


### Reading your data with CSV Reader and Establishing your Data Containers
Let's get started! The exercises in this chapter are intentionally more challenging, to give you a chance to really solidify your knowledge. Don't lose heart if you find yourself stuck; think back to the concepts you've learned in previous chapters and how you can apply them to this crime dataset. Good luck!

Your data file, `crime_sampler.csv` contains the date (1st column), block where it occurred (2nd column), primary type of the crime (3rd), description of the crime (4th), description of the location (5th), if an arrest was made (6th), was it a domestic case (7th), and city district (8th).

Here, however, you'll focus only 4 columns: The date, type of crime, location, and whether or not the crime resulted in an arrest.

Your job in this exercise is to use a CSV Reader to load up a list to hold the data you're going to analyze.

In [4]:
import csv

# Create the file object: csvfile
csvfile = open('./dataset/crime_sampler.csv', 'r')

# Create an empty list: crime_data
crime_data = []

# Loop over a csv read on the file object
for row in csv.reader(csvfile):
    # Append the date, type of crime, location description, and arrest
    crime_data.append((row[0], row[2], row[4], row[5]))
    
# Remove the first element from crime_data
crime_data.pop(0)

# Print the first 10 records
print(crime_data[:10])

[('05/23/2016 05:35:00 PM', 'ASSAULT', 'STREET', 'false'), ('03/26/2016 08:20:00 PM', 'BURGLARY', 'SMALL RETAIL STORE', 'false'), ('04/25/2016 03:05:00 PM', 'THEFT', 'DEPARTMENT STORE', 'true'), ('04/26/2016 05:30:00 PM', 'BATTERY', 'SIDEWALK', 'false'), ('06/19/2016 01:15:00 AM', 'BATTERY', 'SIDEWALK', 'false'), ('05/28/2016 08:00:00 PM', 'BATTERY', 'GAS STATION', 'false'), ('07/03/2016 03:43:00 PM', 'THEFT', 'OTHER', 'false'), ('06/11/2016 06:55:00 PM', 'PUBLIC PEACE VIOLATION', 'STREET', 'true'), ('10/04/2016 10:20:00 AM', 'BATTERY', 'STREET', 'true'), ('02/14/2017 09:00:00 PM', 'CRIMINAL DAMAGE', 'PARK PROPERTY', 'false')]


### Find the Months with the Highest Number of Crimes
Using the `crime_data` list from the prior exercise, you'll answer a common question that arises when dealing with crime data: How many crimes are committed each month?

 For example, `crime_data[0][0]` will show you the first column of the first row which, in this case, is the date and time time that the crime occurred.



In [5]:
from collections import Counter
from datetime import datetime

# Create a Counter object: crimes_by_month
crimes_by_month = Counter()

# Loop over the crime_data list
for row in crime_data:
    
    # Convert the first element of each item into a Python DataFrame object: date
    date = datetime.strptime(row[0], '%m/%d/%Y %I:%M:%S %p')
    
    # Increment the counter for the month of the row by one
    crimes_by_month[date.month] += 1
    
# Print the 3 most common months for crime
print(crimes_by_month.most_common(3))

[(1, 1948), (2, 1862), (7, 1257)]


### Transforming your Data Containers to Month and Location
Now let's flip your `crime_data` list into a dictionary keyed by month with a list of location values for each month, and filter down to the records for the year 2016. Remember you can use the shell to look at the `crime_data` list, such as `crime_data[1][4]` to see the location of the crime in the second item of the list (since lists start at 0).



In [22]:
crime_data = []

csvfile = open('./dataset/crime_sampler.csv', 'r')

for row in csv.reader(csvfile):
    # Convert the first element of each item into a Python DataFrame object: date
    crime_data.append([row[0],
                      row[1], row[2], row[3], row[4], row[5], row[6], row[7]])

crime_data.pop(0)

['Date',
 'Block',
 'Primary Type',
 'Description',
 'Location Description',
 'Arrest',
 'Domestic',
 'District']

In [23]:
from collections import defaultdict

# Create a dictionary that defaults to a list: locations_by_month
locations_by_month = defaultdict(list)

# Loop over the crime_data list
for row in crime_data:
    # Convert the first element to a date object
    date = datetime.strptime(row[0], '%m/%d/%Y %I:%M:%S %p')
    
    # If the year is 2016
    if date.year == 2016:
        # Set the dictionary key to the month and append the location(fifth element) 
        # to the values list
        locations_by_month[date.month].append(row[4])
        
# Print the dictionary
print(locations_by_month)

defaultdict(<class 'list'>, {5: ['STREET', 'GAS STATION', '', 'PARKING LOT/GARAGE(NON.RESID.)', 'RESIDENCE', 'STREET', 'RESTAURANT', 'SMALL RETAIL STORE', 'STREET', 'APARTMENT', 'SIDEWALK', 'PARKING LOT/GARAGE(NON.RESID.)', 'DEPARTMENT STORE', 'PARKING LOT/GARAGE(NON.RESID.)', 'SMALL RETAIL STORE', 'RESIDENCE', 'STREET', 'RESIDENCE', 'APARTMENT', 'RESIDENCE-GARAGE', 'APARTMENT', 'ALLEY', 'HIGHWAY/EXPRESSWAY', 'SIDEWALK', 'POLICE FACILITY/VEH PARKING LOT', 'RESIDENCE', 'STREET', 'APARTMENT', 'RESIDENCE PORCH/HALLWAY', 'STREET', 'RESIDENCE', 'SMALL RETAIL STORE', 'SCHOOL, PUBLIC, BUILDING', 'SIDEWALK', 'SCHOOL, PUBLIC, BUILDING', 'STREET', 'APARTMENT', 'STREET', 'SIDEWALK', 'SMALL RETAIL STORE', 'ALLEY', 'OTHER', 'APARTMENT', 'STREET', 'RESIDENCE', 'GROCERY FOOD STORE', 'SIDEWALK', 'SCHOOL, PUBLIC, BUILDING', 'APARTMENT', 'APARTMENT', 'PARKING LOT/GARAGE(NON.RESID.)', 'RESIDENCE', 'STREET', 'APARTMENT', 'APARTMENT', 'CURRENCY EXCHANGE', 'RESIDENTIAL YARD (FRONT/BACK)', 'ALLEY', 'CTA TRAI

### Find the Most Common Crimes by Location Type by Month in 2016
Using the `locations_by_month` dictionary from the prior exercise, you'll now determine common crimes by month and location type. Because your dataset is so large, it's a good idea to use Counter to look at an aspect of it in an easier to manageable size and learn more about it.

In [25]:
# Loop over the items from locations_by_month using tuple expansion of the month and locations
for month, locations in locations_by_month.items():
    # Make a Counter of the locations
    location_count = Counter(locations)
    # Print the month
    print(month)
    # Print the most common location
    print(location_count.most_common(5))

5
[('STREET', 241), ('RESIDENCE', 175), ('APARTMENT', 128), ('SIDEWALK', 111), ('OTHER', 41)]
3
[('STREET', 240), ('RESIDENCE', 190), ('APARTMENT', 139), ('SIDEWALK', 99), ('OTHER', 52)]
4
[('STREET', 213), ('RESIDENCE', 171), ('APARTMENT', 152), ('SIDEWALK', 96), ('OTHER', 40)]
6
[('STREET', 245), ('RESIDENCE', 164), ('APARTMENT', 159), ('SIDEWALK', 123), ('PARKING LOT/GARAGE(NON.RESID.)', 44)]
7
[('STREET', 309), ('RESIDENCE', 177), ('APARTMENT', 166), ('SIDEWALK', 125), ('OTHER', 47)]
10
[('STREET', 248), ('RESIDENCE', 206), ('APARTMENT', 122), ('SIDEWALK', 92), ('OTHER', 62)]
12
[('STREET', 207), ('RESIDENCE', 158), ('APARTMENT', 136), ('OTHER', 47), ('SIDEWALK', 46)]
1
[('STREET', 196), ('RESIDENCE', 160), ('APARTMENT', 153), ('SIDEWALK', 72), ('PARKING LOT/GARAGE(NON.RESID.)', 43)]
9
[('STREET', 279), ('RESIDENCE', 183), ('APARTMENT', 144), ('SIDEWALK', 121), ('OTHER', 39)]
11
[('STREET', 236), ('RESIDENCE', 182), ('APARTMENT', 154), ('SIDEWALK', 75), ('OTHER', 41)]
8
[('STREET',

## Dictionaries with Time Windows for Keys


### Reading your Data with DictReader and Establishing your Data Containers
Your data file, `crime_sampler.csv` contains in positional order: the date, block where it occurred, primary type of the crime, description of the crime, description of the location, if an arrest was made, was it a domestic case, and city district.

You'll now use a DictReader to load up a dictionary to hold your data with the district as the key and the rest of the data in a list.

In [27]:
# Create the CSV 
csvfile = open('./dataset/crime_sampler.csv', 'r')

# Create a dictionary that defaults to a list: crimes_by_district
crimes_by_district = defaultdict(list)

# Loop over a DictReader of the CSV file
for row in csv.DictReader(csvfile):
    # Pop the district from each row: district
    district = row.pop('District')
    # Append the rest of the data to the list for proper district in crimes_by_district
    crimes_by_district[district].append(row)

### Determine the Arrests by District by Year
Using your `crimes_by_district` dictionary from the previous exercise, you'll now determine the number arrests in each City district for each year. 

In [33]:
# Loop over the crimes_by_district using expansion as district and crimes
for district, crimes in crimes_by_district.items():
    # Print the district
    print(district)
    
    # Create an empty Counter object: year_count
    year_count = Counter()
    
    # Loop over the crimes:
    for crime in crimes:
        # If there was an arrest
        if crime['Arrest'] == 'True':
            # Convert the Date to a datetime and get the year
            year = datetime.strptime(crime['Date'], '%m/%d/%Y %I:%M:%S %p').year
            # Increment the Counter for the year
            year_count[year] += 1
            
        # Print the counter
        print(year_count)

14
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter

Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()


Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()


Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()
Counter()


### Unique Crimes by City Block
You're in the home stretch!

Here, your data has been reshaped into a dictionary called `crimes_by_block` in which crimes are listed by city block. Your task in this exercise is to get a unique list of crimes that have occurred on a couple of the blocks that have been selected for you to learn more about. You might remember that you used `set()` to solve problems like this in Chapter 1.

In [41]:
# Create the CSV 
csvfile = open('./dataset/crime_sampler.csv', 'r')

# Removing Header
next(csvfile)

# Create a dictionary that defaults to a list: crimes_by_district
crimes_by_block = defaultdict(list)

# Loop over a DictReader of the CSV file
for row in csv.reader(csvfile):
    # Pop the district from each row: district
    block = row[1]
    # Append the rest of the data to the list for proper district in crimes_by_district
    crimes_by_block[block].append(row[2])

In [46]:
# Create a unique list of crimes for the first block: n_state_st_crimes
n_state_st_crimes = set(crimes_by_block['001XX N STATE ST'])

# Print the list
print(n_state_st_crimes)

# Create a unique list of crimes for the second block: w_terminal_st_crimes
w_terminal_st_crimes = set(crimes_by_block['0000X W TERMINAL ST'])

# Print the list
print(w_terminal_st_crimes)

# Find the differences between the two blocks: crime_differences
crime_differences = n_state_st_crimes.difference(w_terminal_st_crimes)
                        
# Print the differences
print(crime_differences)

{'BATTERY', 'ASSAULT', 'CRIMINAL DAMAGE', 'OTHER OFFENSE', 'ROBBERY', 'CRIMINAL TRESPASS', 'THEFT', 'DECEPTIVE PRACTICE'}
{'ASSAULT', 'PUBLIC PEACE VIOLATION', 'CRIMINAL DAMAGE', 'OTHER OFFENSE', 'CRIMINAL TRESPASS', 'NARCOTICS', 'THEFT', 'DECEPTIVE PRACTICE'}
{'BATTERY', 'ROBBERY'}
