# Crime in San Francisco: 2003-2017

## Introduction 
While staying at my parent's house in December of 2019, I was woken up by a car alarm at 7am. I tried to fall back asleep, but the alarm was to loud. I walked to the window and witnessed someone breaking into and stealing from the car parked directly in front of our house. I called 911 and gave them details of the crime as it was being committed and also mentioned that I had video of the entire incident. In addition to the 911 call I logged the incident on a SF Police website built for reporting car break-ins. I expected to get a call from the police asking for the video or more information and heard nothing back. 

Anecdotal evidence from family and friends still living in the city is that car break-ins have increased substatially in the past several years, and some pointed to the passaged of Proposition 47 in 2014 as one influencing factor. 

Proposition 47, which classifies many non-violent crimes as misdemenors, aims to reduce prison populations and prevent highly punitive punishments for relatively minor crimes. I was interested in if this proposition led to a higher incidence in thefts from vehicles. 

The San Francisco Police Department keeps <a href="https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-Historical-2003/tmnf-yvry">records of all crime reports </a>, including information such as crime location, description, date & time, and whether or not the crime as resolved. The dataset spans 15 years, giving insight into trends in the crime rate as the demographics of the city changes and new laws are enacted. 

I initially set out to find whether or not the level of theft from car break-ins increased markedly relative to other crime after the passing of this proposition. I also anticipated that the rate of resolving said crimes would be lower. While that question is answered in a fairly qualitative way, I also spent time doing general exploratory data analysis to look at different aspects of crime in San Francisco. 


## Dataset overview

<b>Overview of the dataset</b>
- Over 2 million entries spanning 2003-2017
- 33 columns, including Category, Description, Address, and X & Y GPS coordinates
- Specific descriptions for theft from vehicles
    
    
    
<b>Analysis</b>

Due to the size of the dataset, I initially started out using spark via a docker container. Using spark allowed quick exploration of the data and let me figure out which columns were extranneous. I also included some SQL queries when probing which descriptions are considered "violent" vs "non-violent" crime by using the LIKE command. I never knew there were so many kinds of assault.

Ultimately I eliminated 17 columns and partitioned the data out by year. The data could then be read into pandas dataframes, which I find more comfortable to use for detailed analysis and plotting. 

<b>General Process Flow:</b>

Read data in spark -> explore via spark and SQL -> filter data -> transform to pandas -> explore and plot
   
   
   
   
   
## One GIF To Rule Them All
   
Plotting the number of reports over time. Note that the overlay is the neighborhoods, not the police districts. This map was created by binning the crime GPS data and then plotting those over a <a href="https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods/p5b7-5n3h">shapefile of San Francisco Neighborhoods.</a>     
![SegmentLocal](data/images/crime-sf.gif "segment") 
   
   
   

## General Crime Statistics
   
The top 10 crimes as classified by category using tree mapping:
    ![alttext](data/images/crimetreecat.png)
    
The top 10 crimes as classified by description using tree mapping:
    ![alttext](data/images/crimetreedesc.png)
    
    
The total number of crimes and the fraction of those that violent
    ![alttext](data/images/crime_in_sf.png)
    <p align="left">
    <img src="data/images/resolution.png" alt="Drawing" style="width: 400px;" align="center"/>
    </p>
    
    
    
<b>Initial Conclusions:</b>

- The total number of crimes increases slightly, but mostly due to an uptick in non-violent crime. <b>Could a portion of this be due to car break-ins? </b>

- The number of reports increased from 2003 to 2017, although there are some fluctuations. Census data show the population has increased ~ 8.7% from 2010 to 2019, while the crime rate increased by ~ 15.6% from 2010 - 2017. 
    
- Resolution of non-violent crime is trending downwards after 2014
    
    

## Looking at car break-ins

My hypothesis is that theft from vehicles increased markedly after 2014. I used a stacked plot to show the number of thefts with the number of reports not related to thefts from vehicles. I also plotted the fraction of crimes attributed to theft from 2003-2017. 

![alttext](data/images/car_theft.png)


While the incidence of car break-ins does increase after 2014, clearly the trend started around 2012. The peak at 2006 followed by a decline and bottoming out in 2009-20011 suggest that the recession had an effect on theft from vehicles. It's likely that the gentrification and economic boom in San Francisco following the recession has a higher correlation with thefts than the passing of Proposition 47. 

<b>My hypothesis is incorrect.</b> While car thefts are increasing after 2014, there is no marked difference in the few years prior to 2014 and the years after 2014. This was roughly calculated by averaging the year over year difference from 2011-2014 and from 2014-2017.

<img src="data/images/year-over-year.png" alt="Drawing" style="width: 500px;" align="center"/>


- Average year-over-year difference 2011 - 2014: 0.017 or 1.7% increase
- Average year-over-year difference 2014 - 2017: 0.019 or 1.9% increase

Major contributing factors, such as the economic climate, make drawing a conclusion about the ramifications of Prop 47 difficult. 

I plotted car break-ins alongside the median income for San Francisco, and you can see the trend with economic data. <b>The correlation coefficient is: , indicating a positive correlation between income and car break-ins. </b>



## Digging Deeper: Nothing happens at 4am

Do crime reports follow a uniform distribution? My intuition says no - some types of crime require criminals/victim interaction, and these are more likely to occur at different times of day. Plotting both the hour of crime reports and the day of the week show that crime follows a predicatable pattern. 

I plotted all the years on top of each other. While that makes it hard to see the year over year change, it makes is very easy to see that there is a predictable pattern every year. The smeared out regions on the tops of the bars show the slight variation year to year. Given more time, I'd like to dig into the variation by day of week a little more and see if there is a difference based on neighborhood. 


![alttext](data/images/crime_by_day_hour.png)
    
    
    
    
## Crime By District    
    
<b> THIS SECTION IS WHAT NOT TO DO!</b>    

This is the more difficult part of the analysis and one which I wish I hadn't left until Thursday night. San Francisco has 10 police districts and 36 official neighborhoods. I wanted to explore how the crime rate changes in each neighborhood, but instead had to go by each district based on the data avilable in the dataset.

My hypothesis: Income change both the number of crime reports and the fraction of resolved reports


<b>Neighborhoods vs Police Districts</b>
<p align="left">
    <img src="data/images/sfneighborhoods.png" alt="Drawing" style="width: 300px"/>
    <img src="data/images/pddistrict.png" alt="Drawing" style="width: 350px"/>
</p>
    
    
Here is the data looking at income relative to crime counts.

Red dots: [Taravel, Southern]

Blue dots: [Park, Mission]

Purple dots: [Northern, Richmond]

![alttext](data/images/income_vs_crime.png)

    
Clearly there is no trend in the crime reports vs income. I was unable to quantitatively answer this question due mostly to a lack of data collection. 

- What really matters here is crime reports per population. I was unable to find the population of each district given the time constraint and it's an unreasonable assumption that each district has the same population

- I also need the neighborhood income/population instead of the district income/population

Many of the police districts span both high crime and low crime neighborhoods as well as high income and low income neighborhoods. The consequence of this is that it's difficult to draw correlations between crime reports and demographic variables since most districts have huge differences in demographics.



## Conclusion






## Sources

Police Crime Statistics: https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-Historical-2003/tmnf-yvry
San Francisco Geographical Information: https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods/p5b7-5n3h"
Income Information: https://www.deptofnumbers.com/income/california/san-francisco/

https://sfgov.org/sfplanningarchive/sites/default/files/FileCenter/Documents/8779-SFProfilesByNeighborhood_2010May.pdf

   