# Crime data visualization in San Francisco

San Francisco has one of the most "open data" policies of any large city. In this lab, we are going to download about 85M of data (238,456) describing all police incidents since 2018 (I'm grabbing data on August 5, 2019).

## Getting started

Download [Police Department Incident Reports 2018 to present](https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783) or, if you want, all [San Francisco police department incident since 1 January 2003](https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-2003/tmnf-yvry). 

*That turns out to be a slow link...try downloading from my S3 bucket*:

`https://msan692.s3.us-west-1.amazonaws.com/SFPD.csv`

Or, click the "Export" button and then save in "CSV for Excel" format. (It's fairly large at all 140MB so it could take a while if you have a slow connection.) You can name that data file whatever you want but I will call it `SFPD.csv` for these exercises and save it in `/tmp`.

You can use the command line to directly access and download the link if you want:

In [1]:
! curl 'https://data.sfgov.org/api/views/wg3w-h783/rows.csv?accessType=DOWNLOAD&bom=true&format=true' > /tmp/SFPD.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  173M    0  173M    0     0  5061k      0 --:--:--  0:00:35 --:--:-- 5212k-:-- 5238k     0 --:--:--  0:00:22 --:--:-- 5257k


We can easily figure out how many records there are:

In [4]:
! wc -l /tmp/SFPD.csv

  499034 /tmp/SFPD.csv


So that's currently about 500,000 records.

## Sniffing the data

Let's assume the file you downloaded and is in `/tmp`:

In [5]:
import pandas as pd

df_sfpd = pd.read_csv('/tmp/SFPD.csv')
df_sfpd.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Incident Datetime,2021/05/14 01:51:00 AM,2021/04/01 12:00:00 PM,2021/05/12 02:00:00 PM,2021/05/09 10:00:00 PM,2021/05/11 03:00:00 PM,2021/05/14 06:00:00 AM,2021/05/14 12:02:00 PM,2021/05/14 02:45:00 PM,2021/05/14 02:24:00 AM,2021/05/14 09:50:00 AM
Incident Date,2021/05/14,2021/04/01,2021/05/12,2021/05/09,2021/05/11,2021/05/14,2021/05/14,2021/05/14,2021/05/14,2021/05/14
Incident Time,01:51,12:00,14:00,22:00,15:00,06:00,12:02,14:45,02:24,09:50
Incident Year,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021
Incident Day of Week,Friday,Thursday,Wednesday,Sunday,Tuesday,Friday,Friday,Friday,Friday,Friday
Report Datetime,2021/05/14 01:57:00 AM,2021/05/04 09:22:00 AM,2021/05/12 03:56:00 PM,2021/05/09 11:44:00 PM,2021/05/11 05:20:00 PM,2021/05/14 08:03:00 PM,2021/05/14 01:09:00 PM,2021/05/14 04:53:00 PM,2021/05/14 02:30:00 AM,2021/05/14 09:55:00 AM
Row ID,103010326030,102792271000,103012006244,103015506244,103017806244,103033564070,103032227195,103029907055,103010864020,103021705081
Incident ID,1030103,1027922,1030120,1030155,1030178,1030335,1030322,1030299,1030108,1030217
Incident Number,210295348,216049830,216053944,216053762,216053778,210297269,210296186,210296716,210295376,210295809
CAD Number,211340138.0,,,,,211342146.0,211341484.0,211341786.0,211340171.0,211340846.0


To get a better idea of what the data looks like, let's do a simple histogram of the categories and crime descriptions.  Here is the category histogram:

In [6]:
df_sfpd['Incident Category'].unique()

array(['Arson', 'Lost Property', 'Larceny Theft', 'Suspicious Occ',
       'Other Miscellaneous', 'Motor Vehicle Theft', 'Non-Criminal',
       'Burglary', 'Malicious Mischief', 'Assault',
       'Weapons Carrying Etc', 'Weapons Offense', 'Recovered Vehicle',
       'Warrant', 'Fraud', 'Drug Offense',
       'Offences Against The Family And Children', 'Disorderly Conduct',
       'Other Offenses', 'Miscellaneous Investigation', 'Missing Person',
       'Suspicious', 'Traffic Violation Arrest', 'Robbery', 'Other',
       'Traffic Collision', 'Drug Violation', 'Stolen Property',
       'Courtesy Report', 'Case Closure', 'Fire Report', nan, 'Vandalism',
       'Forgery And Counterfeiting', 'Sex Offense', 'Vehicle Impounded',
       'Vehicle Misplaced', 'Civil Sidewalks', 'Homicide', 'Suicide',
       'Embezzlement', 'Rape', 'Prostitution',
       'Human Trafficking (A), Commercial Sex Acts', 'Weapons Offence',
       'Motor Vehicle Theft?', 'Gambling', 'Liquor Laws',
       'Human Traffic

In [7]:
from collections import Counter
counter = Counter(df_sfpd['Incident Category'])
counter.most_common(10)

[('Larceny Theft', 150235),
 ('Other Miscellaneous', 36629),
 ('Malicious Mischief', 33244),
 ('Non-Criminal', 30669),
 ('Assault', 29969),
 ('Burglary', 28358),
 ('Motor Vehicle Theft', 23464),
 ('Recovered Vehicle', 18221),
 ('Warrant', 16050),
 ('Lost Property', 15282)]

In [8]:
from collections import Counter
counter = Counter(df_sfpd['Incident Description'])
counter.most_common(10)

[('Theft, From Locked Vehicle, >$950', 62097),
 ('Malicious Mischief, Vandalism to Property', 16362),
 ('Lost Property', 15282),
 ('Theft, Other Property, $50-$200', 14678),
 ('Battery', 14529),
 ('Vehicle, Recovered, Auto', 13279),
 ('Vehicle, Stolen, Auto', 12873),
 ('Mental Health Detention', 11969),
 ('Theft, Other Property, >$950', 11621),
 ('Suspicious Occurrence', 9069)]

## Word clouds

A more interesting way to visualize differences in term frequency is using a so-called word cloud.  For example, here is a word cloud showing the categories from 2003 to the present.

<img src="figures/SFPD-wordcloud.png" width="400">

Python has a nice library you can use:

```bash
$ pip install wordcloud
```

**Exercise**: In a file called `catcloud.py`, once again get the categories and then create a word cloud object and display it:

```python
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import pandas as pd
import sys

df_sfpd = pd.read_csv(sys.argv[1])

... delete Incident Categories with nan ...
categories = ... create Counter object on column 'Incident Category' ...

wordcloud = WordCloud(width=1800,
                      height=1400,
                      max_words=10000,
                      random_state=1,
                      relative_scaling=0.25)

wordcloud.fit_words(categories)

plt.imshow(wordcloud)
plt.axis("off")
plt.show()
```

### Which neighborhood is the "worst"?

**Exercise**: Now, pullout the neighborthood and do a word cloud on that in `hoodcloud.py` (it's ok to cut/paste):

<img src="figures/SFPD-hood-wordcloud.png" width="400">

### Crimes per neighborhood


**Exercise**: Filter the data using pandas from a particular precinct or neighborhood, such as Mission and South of Market.  Modify `catcloud.py` to use a pandas query to filter for those records.  Pass the hood as an argument (`sys.argv[2]`):

```bash
$ python catcloud.py /tmp/SFPD.csv Mission
```

Run the `catcloud.py` script to get an idea of the types of crimes per those two neighborhoods. Here are the mission and SOMA districts crime category clouds:

<table>
    <tr>
        <td><b>Mission</b></td><td><b>South of Market</b></td>
    </tr>
    <tr>
        <td><img src="figures/SFPD-mission-wordcloud.png" width="300"></td><td><img src="figures/SFPD-soma-wordcloud.png" width="300"></td>
    </tr>
 </table>

### Which neighborhood has most car thefts?

**Exercise**: Modify `hoodcloud.py` to filter for `Motor Vehicle Theft`. Pass the hood as an argument (`sys.argv[2]`):

```bash
$ python hoodcloud.py /tmp/SFPD.csv 'Motor Vehicle Theft'
```

<img src="figures/SFPD-car-theft-hood-wordcloud.png" width="300">

Hmm..ok, so parking in the Presidio Heights is ok, but SOMA, BayView/Hunters point are bad news.

If you get stuck in any of these exercises, you can look at the [code associated with this notes](https://github.com/parrt/msds692/tree/master/notes/code/sfpd).