# Crime data visualization in San Francisco

San Francisco has one of the most "open data" policies of any large city. In this lab, we are going to download about 85M of data (238,456) describing all police incidents since 2018 (I'm grabbing data on August 5, 2019).

## Getting started

Download [Police Department Incident Reports 2018 to present](https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783) or, if you want, all [San Francisco police department incident since 1 January 2003](https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-2003/tmnf-yvry). Click the "Export" button and then save in "CSV for Excel" format. (It's fairly large at all 140MB so it could take a while if you have a slow connection.)

We can easily figure out how many records there are:

```bash
$ wc -l Police_Department_Incident_Reports__2018_to_Present.csv 
  388670 Police_Department_Incident_Reports__2018_to_Present.csv
```

So 388,669 not including the header row.  You can name that data file whatever you want but I will call it `SFPD.csv` for these exercises and save it in `/tmp`.

## Sniffing the data

Let's assume the file you downloaded and is in `/tmp`:

In [4]:
import pandas as pd

df_sfpd = pd.read_csv('/tmp/SFPD.csv')
df_sfpd.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Incident Datetime,2020/08/15 08:56:00 AM,2020/08/15 09:40:00 AM,2018/02/24 10:00:00 PM,2020/08/16 03:13:00 AM,2020/08/16 03:38:00 AM,2020/08/16 01:40:00 PM,2020/08/16 04:18:00 PM,2020/08/12 10:00:00 PM,2020/08/14 02:00:00 PM,2020/08/16 11:13:00 AM
Incident Date,2020/08/15,2020/08/15,2018/02/24,2020/08/16,2020/08/16,2020/08/16,2020/08/16,2020/08/12,2020/08/14,2020/08/16
Incident Time,08:56,09:40,22:00,03:13,03:38,13:40,16:18,22:00,14:00,11:13
Incident Year,2020,2020,2018,2020,2020,2020,2020,2020,2020,2020
Incident Day of Week,Saturday,Saturday,Saturday,Sunday,Sunday,Sunday,Sunday,Wednesday,Friday,Sunday
Report Datetime,2020/08/15 08:56:00 AM,2020/08/15 06:21:00 PM,2018/03/02 10:13:00 AM,2020/08/16 03:14:00 AM,2020/08/16 04:56:00 AM,2020/08/16 01:56:00 PM,2020/08/16 04:18:00 PM,2020/08/15 08:30:00 AM,2020/08/15 12:23:00 AM,2020/08/16 11:13:00 AM
Row ID,95300907041,95322706244,64174871000,95319604083,95326228100,95336264020,95335012010,95300674000,95321406244,95329661030
Incident ID,953009,953227,641748,953196,953262,953362,953350,953006,953214,953296
Incident Number,200474239,206121692,186051531,200491669,200491738,200492463,200492792,200489880,206121551,200492350
CAD Number,,,,2.0229e+08,2.0229e+08,2.02292e+08,2.02292e+08,2.02281e+08,,2.02291e+08


To get a better idea of what the data looks like, let's do a simple histogram of the categories and crime descriptions.  Here is the category histogram:

In [3]:
df_sfpd['Incident Category'].unique()

array(['Recovered Vehicle', 'Larceny Theft', 'Lost Property', 'Assault',
       'Malicious Mischief', 'Non-Criminal', 'Weapons Offense',
       'Missing Person', 'Other', 'Burglary',
       'Offences Against The Family And Children',
       'Miscellaneous Investigation', 'Other Miscellaneous',
       'Disorderly Conduct', 'Suspicious Occ', 'Other Offenses',
       'Robbery', 'Motor Vehicle Theft', 'Family Offense', 'Arson',
       'Case Closure', 'Suicide', 'Fraud', 'Traffic Violation Arrest',
       'Stolen Property', 'Drug Offense', 'Vehicle Misplaced',
       'Fire Report', 'Warrant', 'Forgery And Counterfeiting',
       'Courtesy Report', 'Sex Offense', 'Traffic Collision', 'Vandalism',
       'Weapons Carrying Etc', 'Embezzlement', nan, 'Vehicle Impounded',
       'Rape', 'Human Trafficking (A), Commercial Sex Acts',
       'Drug Violation', 'Motor Vehicle Theft?', 'Homicide', 'Gambling',
       'Prostitution', 'Suspicious', 'Civil Sidewalks', 'Liquor Laws',
       'Weapons Offenc

In [2]:
from collections import Counter
counter = Counter(df_sfpd['Incident Category'])
counter.most_common(10)

[('Larceny Theft', 120196),
 ('Other Miscellaneous', 28791),
 ('Malicious Mischief', 24307),
 ('Non-Criminal', 23823),
 ('Assault', 22955),
 ('Burglary', 19583),
 ('Motor Vehicle Theft', 15858),
 ('Warrant', 13417),
 ('Lost Property', 13081),
 ('Recovered Vehicle', 12544)]

In [3]:
from collections import Counter
counter = Counter(df_sfpd['Incident Description'])
counter.most_common(10)

[('Theft, From Locked Vehicle, >$950', 50109),
 ('Lost Property', 13081),
 ('Theft, Other Property, $50-$200', 12001),
 ('Malicious Mischief, Vandalism to Property', 11989),
 ('Battery', 11358),
 ('Theft, Other Property, >$950', 9569),
 ('Mental Health Detention', 9502),
 ('Vehicle, Recovered, Auto', 9312),
 ('Vehicle, Stolen, Auto', 9022),
 ('Theft, From Unlocked Vehicle, >$950', 7020)]

## Word clouds

A more interesting way to visualize differences in term frequency is using a so-called word cloud.  For example, here is a word cloud showing the categories from 2003 to the present.

<img src="figures/SFPD-wordcloud.png" width="400">

Python has a nice library you can use:

```bash
$ pip install wordcloud
```

**Exercise**: In a file called `catcloud.py`, once again get the categories and then create a word cloud object and display it:

```python
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import pandas as pd
import sys

df_sfpd = pd.read_csv(sys.argv[1])

... delete Incident Categories with nan ...
categories = ... create Counter object on column 'Incident Category' ...

wordcloud = WordCloud(width=1800,
                      height=1400,
                      max_words=10000,
                      random_state=1,
                      relative_scaling=0.25)

wordcloud.fit_words(categories)

plt.imshow(wordcloud)
plt.axis("off")
plt.show()
```

### Which neighborhood is the "worst"?

**Exercise**: Now, pullout the neighborthood and do a word cloud on that in `hoodcloud.py` (it's ok to cut/paste):

<img src="figures/SFPD-hood-wordcloud.png" width="400">

### Crimes per neighborhood


**Exercise**: Filter the data using pandas from a particular precinct or neighborhood, such as Mission and South of Market.  Modify `catcloud.py` to use a pandas query to filter for those records.  Pass the hood as an argument (`sys.argv[2]`):

```bash
$ python catcloud.py /tmp/SFPD.csv Mission
```

Run the `catcloud.py` script to get an idea of the types of crimes per those two neighborhoods. Here are the mission and SOMA districts crime category clouds:

<table>
    <tr>
        <td><b>Mission</b></td><td>SOMA</td>
    </tr>
    <tr>
        <td><img src="figures/SFPD-mission-wordcloud.png" width="300"></td><td><img src="figures/SFPD-soma-wordcloud.png" width="300"></td>
    </tr>
 </table>

### Which neighborhood has most car break-ins?

**Exercise**: Modify `hoodcloud.py` to filter for `Motor Vehicle Theft`. Pass the hood as an argument (`sys.argv[2]`):

```bash
$ python hoodcloud.py /tmp/SFPD.csv 'Motor Vehicle Theft'
```

<img src="figures/SFPD-car-theft-hood-wordcloud.png" width="300">

Hmm..ok, so parking in the Mission is ok, but SOMA, BayView/Hunters point are bad news.

If you get stuck in any of these exercises, you can look at the [code associated with this notes](https://github.com/parrt/msds692/tree/master/notes/code/sfpd).