# **Violent Crime Rates in California**
*María Colomer, Sara Díez and Gússem Yahia-Cheikh*

### Introduction & Objectives

#### 🤖 Project Context
This project focuses on analyzing violent crime rates across counties in the state of California, using public data from 2000 to 2013 provided by U.S. Open Data sources. The original dataset was refined to highlight only the most relevant information for understanding spatial and socioeconomic patterns in crime.

California, being one of the most populous and economically diverse states, offers a unique opportunity to explore how population size, geographic location, and economic factors may correlate with criminal activity.

### Data Loading & Initial Overview

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [14]:
df=pd.read_csv("VCC_updated.csv", encoding="ISO-8859-1")


In [15]:
df  # Preview of the database to check that everything is correct

Unnamed: 0,reportyear,geoname,geotypevalue,county_name,county_fips,region_name,region_code,dof_population,AA,FR,M_and_NNM,R,VCT,JDNR,rate
0,2000,Adelanto city,296,San Bernardino,6071,Southern California,14,18130,0.005461,0.000221,0.000110,0.000772,0.006564,,6.563707
1,2000,Agoura Hills city,394,Los Angeles,6037,Southern California,14,20537,0.001169,0.000097,0.000000,0.000487,0.001753,,1.752934
2,2000,Alameda,6001,Alameda,6001,Bay Area,1,1443939,0.003780,0.000393,0.000076,0.002333,0.006582,,6.582206
3,2000,Alameda city,562,Alameda,6001,Bay Area,1,72259,0.002519,0.000125,0.000014,0.001522,0.004179,,4.179410
4,2000,Albany city,674,Alameda,6001,Bay Area,1,16444,0.002858,0.000182,0.000000,0.002311,0.005351,,5.351496
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7122,2013,Tehachapi city,78092,Kern,6029,San Joaquin Valley,10,13316,0.002131,0.000367,0.000000,0.000514,0.003013,,3.013155
7123,2013,Wildomar city,85446,Riverside,6065,Southern California,14,33416,0.000807,0.000060,0.000000,0.000329,0.001196,,1.195707
7124,2013,Angels city,2112,Calaveras,6009,Central/Southeast Sierra,3,3806,0.001877,0.000268,0.000000,0.000268,0.002414,,2.413516
7125,2013,McFarland city,44826,Kern,6029,San Joaquin Valley,10,13141,0.003581,0.000163,0.000081,0.001465,0.005290,,5.289714


We can see that in the column of JDNR standing for 'Jurisdiction Does Not Report' contains NaN, so to handle that:

In [16]:
df.drop(["JDNR"], axis=1, inplace=True)
df.columns

Index(['reportyear', 'geoname', 'geotypevalue', 'county_name', 'county_fips',
       'region_name', 'region_code', 'dof_population', 'AA', 'FR', 'M_and_NNM',
       'R', 'VCT', 'rate'],
      dtype='object')

In [17]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7127 entries, 0 to 7126
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   reportyear      7127 non-null   int64  
 1   geoname         7127 non-null   object 
 2   geotypevalue    7127 non-null   int64  
 3   county_name     7127 non-null   object 
 4   county_fips     7127 non-null   int64  
 5   region_name     7127 non-null   object 
 6   region_code     7127 non-null   int64  
 7   dof_population  7127 non-null   int64  
 8   AA              7127 non-null   float64
 9   FR              7127 non-null   float64
 10  M_and_NNM       7127 non-null   float64
 11  R               7127 non-null   float64
 12  VCT             7127 non-null   float64
 13  rate            7127 non-null   float64
dtypes: float64(6), int64(5), object(3)
memory usage: 779.6+ KB


Unnamed: 0,reportyear,geotypevalue,county_fips,region_code,dof_population,AA,FR,M_and_NNM,R,VCT,rate
count,7127.0,7127.0,7127.0,7127.0,7127.0,7127.0,7127.0,7127.0,7127.0,7127.0,7127.0
mean,2006.514803,38714.311211,6056.138207,8.699453,129345.9,0.003192,0.000279,5.4e-05,0.001882,0.005407,5.406838
std,4.027136,26461.629975,29.494075,5.145514,537630.5,0.009658,0.000751,0.000495,0.015816,0.025788,25.788374
min,2000.0,296.0,6001.0,1.0,91.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2003.0,13112.0,6037.0,4.0,13049.5,0.00119,0.000109,0.0,0.000355,0.001965,1.965312
50%,2007.0,39122.0,6059.0,10.0,34928.0,0.00218,0.000205,1.6e-05,0.000756,0.003437,3.43684
75%,2010.0,60620.0,6079.0,14.0,79622.5,0.003567,0.000324,5.3e-05,0.001437,0.005396,5.396215
max,2013.0,87056.0,6115.0,14.0,10017640.0,0.319149,0.032258,0.022222,0.489362,0.795699,795.698925


### Merging Geographic Coordinates

To effectively visualize and analyze crime data across California, it's important to associate each record with geographic coordinates (latitude and longitude). This will allow us to create meaningful spatial plots and identify regional crime patterns.

The main dataset contains identifiers like `county_name` and `region_name`, but no explicit coordinate information. In this section, we merge the cleaned crime dataset with a secondary dataset that includes latitude and longitude values for each county.

This merge will enable geospatial visualizations in the next step.

We will be using the library geopy found in this article [here](https://www.kaggle.com/code/rayhanlahdji/geocoding-with-geopy-nominatim), where we learned about [Nominatim](https://nominatim.org/release-docs/latest/) and [RateLimiter](https://geopy.readthedocs.io/en/stable/).

In [18]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

In [15]:
# Get unique city or region names (non-null values) from the 'geoname' column
cities = df["geoname"].dropna().unique()

# Set up the Nominatim geolocator (with user agent to respect usage policy)
geolocator = Nominatim(user_agent="geo_plot_app")

# Apply a rate limiter to avoid being blocked for sending requests too quickly
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# Initialize a list to store tuples of (city_name, latitude, longitude)
city_coords = []

# Loop through each city name and retrieve its coordinates from Nominatim
for city in cities:
    loc = geocode(f"{city}, California, USA")  # Add region/country for better accuracy
    if loc:
        city_coords.append((city, loc.latitude, loc.longitude))
    else:
        # If the city is not found, store None for coordinates
        city_coords.append((city, None, None))

# Create a DataFrame from the collected coordinate data
coord_df = pd.DataFrame(city_coords, columns=["geoname", "lat", "lon"])

# Save the coordinates to a CSV for merging and future reuse (avoids repeating API calls)
coord_df.to_csv("city_coords.csv", index=False)

Now we have another dataset with the coordinates named `city_coords`, next step is merge it with the original dataset so later then we can plot it to visually understand where crime rates are higher or lower across the state.

In [19]:
df_coords = pd.read_csv("city_coords.csv") #Loading the coordinates database

In [20]:
df[['geoname']].drop_duplicates().head()
df_coords[['geoname']].drop_duplicates().head() #Here we ensure that both dataframes use the same keys, geoname

Unnamed: 0,geoname
0,Adelanto city
1,Agoura Hills city
2,Alameda
3,Alameda city
4,Albany city


In [21]:
df_merged = pd.merge(df, df_coords, on="geoname", how="left")

In [22]:
print(df_merged[['geoname', 'lat', 'lon']].isnull().sum())

geoname    0
lat        0
lon        0
dtype: int64


In [23]:
# See which geonames didn't get matched
missing_coords = df_merged[df_merged['lat'].isnull()]['geoname'].unique()
print(missing_coords[:1000])  # Just show a sample

[]


In [24]:
df_merged

Unnamed: 0,reportyear,geoname,geotypevalue,county_name,county_fips,region_name,region_code,dof_population,AA,FR,M_and_NNM,R,VCT,rate,lat,lon
0,2000,Adelanto city,296,San Bernardino,6071,Southern California,14,18130,0.005461,0.000221,0.000110,0.000772,0.006564,6.563707,34.572532,-117.411127
1,2000,Agoura Hills city,394,Los Angeles,6037,Southern California,14,20537,0.001169,0.000097,0.000000,0.000487,0.001753,1.752934,34.156930,-118.757234
2,2000,Alameda,6001,Alameda,6001,Bay Area,1,1443939,0.003780,0.000393,0.000076,0.002333,0.006582,6.582206,37.609029,-121.899142
3,2000,Alameda city,562,Alameda,6001,Bay Area,1,72259,0.002519,0.000125,0.000014,0.001522,0.004179,4.179410,34.240607,-116.847337
4,2000,Albany city,674,Alameda,6001,Bay Area,1,16444,0.002858,0.000182,0.000000,0.002311,0.005351,5.351496,37.895679,-122.309319
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7122,2013,Tehachapi city,78092,Kern,6029,San Joaquin Valley,10,13316,0.002131,0.000367,0.000000,0.000514,0.003013,3.013155,35.138461,-118.463451
7123,2013,Wildomar city,85446,Riverside,6065,Southern California,14,33416,0.000807,0.000060,0.000000,0.000329,0.001196,1.195707,33.594128,-117.241271
7124,2013,Angels city,2112,Calaveras,6009,Central/Southeast Sierra,3,3806,0.001877,0.000268,0.000000,0.000268,0.002414,2.413516,34.053691,-118.242766
7125,2013,McFarland city,44826,Kern,6029,San Joaquin Valley,10,13141,0.003581,0.000163,0.000081,0.001465,0.005290,5.289714,35.679550,-119.229086
