# Crime Rates in Different Socioeconomic Neighborhoods 

## 1. A description of the problem and a discussion of the background.

This analysis will look at crimes rates in neighborhoods of varying socioeconomic class. California is one of the most prosperous states in the US; however, this prosperity does not reach all social classes. With the rapid growth around metropolitan areas, such as San Francisco and Los Angeles, the margin between social classes becomes wider everyday. I was curious so if growing divide between social classes contributed to crime rates. Is crime more prevalent in areas of less wealth or more wealth?

## 2. A description of the data and how it will be used to solve the problem.

Within this analysis, I will be using a data set from the Public Policy Institute of California and from Kaggle. The data set from the Public Policy Institute of California presents the number of recorded crimes per 100,000 residents by county in 2017. The crime rates are calculated for violent crimes in addition to property crimes. The data set from Kaggle contains information from the 1990 California census about households in California. Within the data set, there is general information including location and house value. The two data sets will be combined on a folium map to see if there is a correlation between neighboorhood wealth and crime rates. 

## Introduction

This analysis will look at crimes rates in neighborhoods of varying socioeconomic class. California is one of the most prosperous states in the US; however, this prosperity does not reach all social classes. With the rapid growth around metropolitan areas, such as San Francisco and Los Angeles, the gap between social classes is becoming more marginalized everyday. I want to explore whether the growing divide between social classes is contributing to crime rates. Is crime more prevalent in areas of less wealth or more wealth?

Politicians would be interested in this project because it will shed light on what policies need to be created or changed to potentially demarginalize wealth in California and decrease crime rates. 

## Data

Within this analysis, I will be using a data set from the Public Policy Institute of California and from Kaggle. The data set from the Public Policy Institute of California presents the number of recorded crimes per 100,000 residents by county in 2017. The crime rates are calculated for violent crimes in addition to property crimes. The data set from Kaggle contains information from the 1990 California census about households in California. Within the data set, there is general information including location and house value. The two data sets will be combined on a folium map to see if there is a correlation between neighboorhood wealth and crime rates. 

## Methodology

In [60]:
# Importing libraries
#!pip install folium
import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN

In [61]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,County,Violent,Murder,Rape,Robbery,Aggrevated assault,Property,Burglary,Vehicle theft,Larceny theft
0,Alameda,601.1,5.5,49.1,291.8,254.8,3868.4,418.9,763.6,2685.9
1,Alpine,1139.4,0.0,0.0,0.0,1139.4,2804.6,701.1,175.3,1928.1
2,Amador,310.4,0.0,24.3,18.9,267.2,1751.7,396.8,178.1,1176.8
3,Butte,406.7,3.1,80.4,63.1,260.1,3214.6,706.1,415.5,2093.0
4,Calaveras,466.3,4.5,67.3,38.1,356.4,1770.9,511.1,221.9,1037.9


In [62]:
# Obtaining Latitude and Longitude coordinates for Counties in data set
from geopy.geocoders import Nominatim
geolocator = Nominatim()
df_crime['city_coord']  = df_crime['County'].apply(geolocator.geocode)
df_crime['Latitude'] = df_crime['city_coord'].apply(lambda x: (x.latitude))
df_crime['Longitude'] = df_crime['city_coord'].apply(lambda x: (x.longitude))

In [63]:
# Dropping rows containing NaN values and resetting index
df_crime.drop([58, 59], inplace = True)
df_crime.reset_index()
df_crime.tail()

Unnamed: 0,County,Violent,Murder,Rape,Robbery,Aggrevated assault,Property,Burglary,Vehicle theft,Larceny theft,city_coord,Latitude,Longitude
53,Tulare,348.0,7.0,34.3,80.8,225.9,2479.5,499.0,467.7,1512.9,"(Tulare County, California, USA, (36.2516475, ...",36.251647,-118.852583
54,Tuolumne,381.2,3.7,88.8,37.0,251.7,2080.1,701.4,225.8,1152.9,"(Tuolumne County, California, USA, (38.056944,...",38.056944,-119.991935
55,Ventura,261.5,3.0,31.5,85.3,141.7,1867.2,305.1,187.9,1374.1,"(Ventura, Ventura County, California, USA, (34...",34.364744,-119.310582
56,Yolo,258.8,4.1,29.6,75.6,149.5,2577.1,432.0,282.0,1863.1,"(Yolo County, California, USA, (38.7184542, -1...",38.718454,-121.9059
57,Yuba,423.8,3.9,30.0,76.9,312.9,2701.8,717.2,638.9,1345.7,"(Yuba County, California, USA, (39.2839755, -1...",39.283975,-121.355682


In [64]:
# Creating a column for the total crimes per county
df_crime['Violent'] = df_crime['Violent'].str.replace(',','').astype(float)
df_crime['Property'] = df_crime['Property'].str.replace(',','').astype(float)
df_crime['Aggrevated assault'] = df_crime['Aggrevated assault'].str.replace(',','').astype(float)
df_crime['Larceny theft'] = df_crime['Larceny theft'].str.replace(',','').astype(float)
df_crime['Total Crime'] = df_crime.iloc[:, 1:10].sum(1)

In [65]:
df_crime.head()

Unnamed: 0,County,Violent,Murder,Rape,Robbery,Aggrevated assault,Property,Burglary,Vehicle theft,Larceny theft,city_coord,Latitude,Longitude,Total Crime
0,Alameda,601.1,5.5,49.1,291.8,254.8,3868.4,418.9,763.6,2685.9,"(Alameda County, California, USA, (37.6090291,...",37.609029,-121.899142,8939.1
1,Alpine,1139.4,0.0,0.0,0.0,1139.4,2804.6,701.1,175.3,1928.1,"(Alpine County, California, USA, (38.5893934, ...",38.589393,-119.834501,7887.9
2,Amador,310.4,0.0,24.3,18.9,267.2,1751.7,396.8,178.1,1176.8,"(Amador County, California, USA, (38.449089, -...",38.449089,-120.591102,4124.2
3,Butte,406.7,3.1,80.4,63.1,260.1,3214.6,706.1,415.5,2093.0,"(Butte, Silver Bow County, Montana, USA, (46.0...",46.013151,-112.536509,7242.6
4,Calaveras,466.3,4.5,67.3,38.1,356.4,1770.9,511.1,221.9,1037.9,"(Calaveras County, California, USA, (38.255818...",38.255818,-120.498149,4474.4


I plotted the observations of the crime data set to visualize where crime was happening in California.

In [66]:
ex_map = folium.Map(location = [df_crime.Latitude.mean(), df_crime.Longitude.mean()], zoom_start = 5)
for i in range(len(df_crime)):
    folium.Marker(location = [df_crime['Latitude'][i], df_crime['Longitude'][i]], popup = df_crime['County'][i]).add_to(ex_map)
ex_map

In [67]:
# Reading in CSV file from the Kaggle containing housing information
body = client_ae8474b1596343f2bde03ead8191bb66.get_object(Bucket='capstone-donotdelete-pr-xllqlhdbkeqrzk',Key='housing.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_house = pd.read_csv(body)
print("df_house shape: ", df_house.shape)
df_house.head()

df_house shape:  (20640, 10)


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [68]:
df_house = df_house[['longitude', 'latitude', 'median_house_value']]
df_house_coords = df_house[['longitude','latitude']]

I plotted 1000 of the 20640 observations from the house data set to get a visual of how the observations were spread across California.

In [69]:
import random

In [70]:
ex_map2 = folium.Map(location = [34.0522, -118.2437], zoom_start = 5)
for i in random.sample(range(len(df_house)), 2000):
    folium.Marker(location = [df_house['latitude'][i], df_house['longitude'][i]], popup = df_house['median_house_value'][i]).add_to(ex_map2)
ex_map2

I used DBSCAN to cluster the 20640 coordinates within the housing data set because there were several hundred locations overlapping within the intial exploratory folium map. By clustering the points, it aggregates the data to represent counties across California. Unlike the k-means algorithm, which minimizes variance, not geodetic distance, DBSCAN is better suited for spatial latitude-longitude data because it cluters spatial data based on two parameters: physical distance from each point and minimum cluster size. Epsilon is the maximum distance points can be from each other to be considered a cluster. Using the haversine metric and ball tree algorithm, circle distances between points can be calculated. The haversine metric needs radian units; therefore, epsilon and the coordinates are converted to radians.

In [71]:
kms_per_radian = 6371.0088
epsilon = 1.5 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(df_house_coords))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.Series([df_house_coords[cluster_labels == n] for n in range(num_clusters)])
print('Number of clusters: ', num_clusters)

Number of clusters:  179


In [72]:
df_house['labels'] = db.labels_

After identifying the clusters, I calculated the mean value of 'coordinates' and 'median house values' for each cluster. The clusters represent the average house values of counties across California.

In [73]:
df_coord_avg = df_house.groupby(['labels'])['latitude', 'longitude'].mean()
df_house_avg = df_house.groupby(['labels'])['median_house_value'].mean()
df_db_house = df_coord_avg.join(df_house_avg)
df_db_house.head()

Unnamed: 0_level_0,latitude,longitude,median_house_value
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,36.875625,-120.158827,173115.011474
0,37.777525,-122.197081,197404.904805
1,37.505818,-122.17183,321892.819655
2,37.6996,-121.9088,287570.0
3,37.6804,-121.7724,209772.0


In [74]:
from folium.plugins import HeatMap

I plotted the housing data set on a heatmap to showcase the wealthy areas of California. Areas that are the darkest represent counties where house values are highest.

In [75]:
map2 = folium.Map(location = [df_crime.Latitude.mean(), df_crime.Longitude.mean()], zoom_start = 6)
max_amount = float(df_db_house['median_house_value'].max())
hm2 = HeatMap( list(zip(df_db_house.latitude.values, df_db_house.longitude.values, df_db_house['median_house_value'].values)),
                   min_opacity=0.2,
                   max_val=max_amount,
                   radius=17, blur=15, 
                   max_zoom=1, 
                 )
map2.add_child(hm2)

I plotted the crime data set on the same heatmap and set the 'Total Crime' circle marker radius to be scaled to the 'Total Crime' value. The areas containing larger circle marker radii represent areas with higher crime rates.

In [76]:
locations2 = df_crime[["Latitude","Longitude"]].values.tolist()
values2 = df_crime["Total Crime"].values.tolist()

for point in range(len(locations2)):
    text = "Total Crime: " + str(values2[point])
    folium.CircleMarker(location=locations2[point], radius=values2[point]/500,
                    popup = text, line_color='#3186cc',
                    fill_color='#3186cc', fill=True).add_to(map2)

In [77]:
map2

## Results

After analyzing the folium map with the overlapping housing and crime data sets, areas with higher house values, depicted with the heat map, appear to also have higher crime rates, depicted with circle markers. 

## Discussion

When analyzing the folium map, houses with the highest value tend to be concentrated in the major cities of California. Consequently, the circle markers that have the largest radius, indicating higher crime rates, also appear to be concentrated in major cities, specifically in San Francisco. As previously discussed, the gap between social classes is becoming more marginalized within the US. The high crime rates within major cities of higher wealth may be due to this gap between the social classes. Individuals in the lower-income bracket are more susceptible to committing crimes to make ends meet.

## Conclusion

This analysis looks at crimes rates in neighborhoods of varying socioeconomic class across California. California is one of the most prosperous states in the US; however, this prosperity does not reach all social classes. With the gap between social classes becoming more marginalized everyday, politicians need to implement change that will more equally distribute wealth, which may then potentially decrease crime.