Crime data visualization in San Francisco

San Francisco has one of the most "open data" policies of any large city. In this lab, we are going to download about 400M of data (1,968,080 records) describing all police incidents since 2003 (I'm grabbing data on October 6, 2016).

Getting started

Download Police Department Incident Reports: Historical 2003 to May 2018 or all San Francisco police department incident since 1 January 2003. Save in "CSV for Excel" format.

We can easily figure out how many records there are:

$ wc -l Police_Department_Incident_Reports__2018_to_Present.csv 
  238457
Police_Department_Incident_Reports__2018_to_Present.csv

So 238456 not including the header row. Let's kill that first row using tail:

$ tail +2 Police_Department_Incident_Reports__2018_to_Present.csv > SFPD.csv

In Python, that would be equivalent to data[1:] (it counts from 0 not 1 like tail). You can name that data file whatever you want but I will call it SFPD.csv for these exercises.

Sniffing the data

To get an idea of what the data looks like, let's do a simple histogram of the categories and crime descriptions. Here is the category histogram:

$ python histo.py ~/data/SFPD.csv
406342 LARCENY/THEFT
279619 OTHER OFFENSES
210564 NON-CRIMINAL
172414 ASSAULT
117467 VEHICLE THEFT
113708 DRUG/NARCOTIC
100802 VANDALISM
 93082 WARRANTS
 81932 BURGLARY
 70707 SUSPICIOUS OCC
 57810 MISSING PERSON
 50477 ROBBERY
 37409 FRAUD
 22549 SECONDARY CODES
 22157 FORGERY/COUNTERFEITING
 19415 WEAPON LAWS
 16750 TRESPASS
 15932 PROSTITUTION
 10586 STOLEN PROPERTY
 10065 SEX OFFENSES, FORCIBLE
  9331 DISORDERLY CONDUCT
  9281 DRUNKENNESS
  7779 RECOVERED VEHICLE
  5147 DRIVING UNDER THE INFLUENCE
  4989 KIDNAPPING
  4062 RUNAWAY
  3956 LIQUOR LAWS
  3430 ARSON
  2688 EMBEZZLEMENT
  2377 LOITERING
  1164 SUICIDE
  1114 FAMILY OFFENSES
   880 BAD CHECKS
   707 BRIBERY
   643 EXTORTION
   365 SEX OFFENSES, NON FORCIBLE
   319 GAMBLING
    49 PORNOGRAPHY/OBSCENE MAT
    12 TREA

and here is the start of the crime description histogram:

143903 GRAND THEFT FROM LOCKED AUTO
 70123 LOST PROPERTY
 60593 BATTERY
 59892 STOLEN AUTOMOBILE
 59412 DRIVERS LICENSE, SUSPENDED OR REVOKED
 52218 WARRANT ARREST
 49375 AIDED CASE, MENTAL DISTURBED
 47935 SUSPICIOUS OCCURRENCE
 45227 PETTY THEFT FROM LOCKED AUTO
 39820 MALICIOUS MISCHIEF, VANDALISM OF VEHICLES
 38117 PETTY THEFT OF PROPERTY
 36992 MALICIOUS MISCHIEF, VANDALISM
 35278 TRAFFIC VIOLATION
 31879 THREATS AGAINST LIFE
 28655 FOUND PROPERTY
 25815 ENROUTE TO OUTSIDE JURISDICTION
 25254 GRAND THEFT OF PROPERTY
 22830 PETTY THEFT FROM A BUILDING
 21511 PETTY THEFT SHOPLIFTING
 21340 POSSESSION OF NARCOTICS PARAPHERNALIA
 21336 FOUND PERSON
 20757 GRAND THEFT FROM A BUILDING
 20068 CREDIT CARD, THEFT BY USE OF
...

Exercise: Create a helper function in csvcols.py that opens a CSV file in Excel format given a filename and returns the indicated column number:

import csv
from collections import Counter

def get_column(filename,col):
    """
    Load CSV in Excel format, return Counter created from column of data indicated by
    integer col parameter.
    """
    data = []
    with open(filename, 'rb') as f:
        reader = csv.reader(f, dialect='excel')
        ...
    data = Counter(data)
    return data

Then in histo.py, we can use it to get a particular column:

categories = get_column(sys.argv[1],col=1)

Print out the histograms as shown above for categories and descriptions to finish off the exercise. Solutions

Word clouds

A more interesting way to visualize differences in term frequency is using a so-called word cloud. For example, here is a word cloud showing the categories from 2003 to the present.

Python has a nice library you can use:

$ pip install wordcloud

Exercise: In a file called catcloud.py, once again get the categories and then create a word cloud object and display it:

from wordcloud import WordCloud
from csvcols import get_column
import matplotlib.pyplot as plt
import sys

...
wordcloud = WordCloud(width=1800,
                      height=1400,
                      max_words=10000,
                      random_state=1,
                      relative_scaling=0.25)

... get tuples with (word,count) from categories Counter ...                      
wordcloud.fit_words(wordtuples)

plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Solutions

Which neighborhood is the "worst"?

Exercise: Now, pullout the police district and do a word cloud on that in hoodcloud.py (it's ok to cut/paste):

Crimes per neighborhood

Exercise: Filter the CSV file using grep from the commandline to get just the rows from a particular precinct, such as MISSION:

$ grep MISSION ~/data/SFPD.csv > /tmp/mission.csv
$ grep RICHMOND ~/data/SFPD.csv > /tmp/richmond.csv

Run the wordcloud.py script on those files to get an idea of the types of crimes per those two neighborhoods. Here is the mission and richmond districts crime category clouds.

Which neighborhood has most car break-ins?

Exercise: Filter the SFPD.csv for GRAND THEFT FROM LOCKED AUTO and then run hoodcloud.py on the resulting csv.

Hmm..ok, so parking in the Mission is ok, but Northern, Central, and Southern precincts are bad news.

If you get stuck in any of these exercises, you can look at the code associated with this notes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly