# **Predicting U.S. Crime Rates**

## **Target Data Retrieval**

In this notebook, we will access the [FBI Crime Database](https://crime-data-explorer.fr.cloud.gov/pages/docApi) to retrieve count of crime incidents by state by year.

We'll also do a little bit of data pre-processing here.

---

In [1]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import time
from datetime import date

In [2]:
#api call function
#note that we pass a start year of 1960, but the data for most states starts in 1979.
def get_crime_data(state):
    """Returns a dataframe of fbi ucr crime estimates by state"""
    end_year = int(date.today().strftime("%Y"))
    baseurl = 'https://api.usa.gov/crime/fbi/sapi/'
    endpoint = f'api/estimates/states/{state}/1960/{end_year}?api_key='
    apikey = '' ### YOUR API KEY HERE
    res = requests.get(baseurl + endpoint + apikey)
    
    #must return a dataframe
    df = pd.DataFrame(res.json()['results']).iloc[:,1:].sort_values('year')
    return df

#all the states' abbreviations. taken from https://gist.github.com/rogerallen/1583593
us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
}

Run the loop below to get crime data for all 50 states (+ DC).

In [3]:
state_crimes = False
for full_state, state in us_state_to_abbrev.items():
    if state_crimes is False:
        state_crimes = get_crime_data(state)
    else:
        state_crimes = state_crimes.append(get_crime_data(state),ignore_index=True)
    print(f'Crime data for {full_state} retrieved =======================//',end='\r')
    time.sleep(3)
    
print('\r==========================================================[ All Done ]//')  



At this point, we have crime data, but to deal with some missing values we will have to do some processing. Upon inspecting the data types and null counts, it appears that for some years, counts for the `rape` offense may have been revised. 

In [4]:
state_crimes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2142 entries, 0 to 2141
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   state_abbr           2142 non-null   object 
 1   year                 2142 non-null   int64  
 2   population           2142 non-null   int64  
 3   violent_crime        2142 non-null   int64  
 4   homicide             2142 non-null   int64  
 5   rape_legacy          1938 non-null   float64
 6   rape_revised         408 non-null    float64
 7   robbery              2142 non-null   int64  
 8   aggravated_assault   2142 non-null   int64  
 9   property_crime       2142 non-null   int64  
 10  burglary             2142 non-null   int64  
 11  larceny              2142 non-null   int64  
 12  motor_vehicle_theft  2142 non-null   int64  
 13  arson                2142 non-null   int64  
dtypes: float64(2), int64(11), object(1)
memory usage: 234.4+ KB


To deal with this, for years where a `rape_revised` count is non-missing, we'll use that count. Otherwise, we'll use the `rape_legacy` count. We'll store this in a variable named `rape`, and then discard the `rape_legacy` and `rape_revised` columns.

In [5]:
state_crimes['rape'] = [state_crimes['rape_legacy'][i] if x else state_crimes['rape_revised'][i] for i,x in enumerate(np.isnan(state_crimes['rape_revised']))]
state_crimes['rape'] = state_crimes['rape'].astype('int64')

In [6]:
#the new column counts should match all the other no-nulls columns:
state_crimes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2142 entries, 0 to 2141
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   state_abbr           2142 non-null   object 
 1   year                 2142 non-null   int64  
 2   population           2142 non-null   int64  
 3   violent_crime        2142 non-null   int64  
 4   homicide             2142 non-null   int64  
 5   rape_legacy          1938 non-null   float64
 6   rape_revised         408 non-null    float64
 7   robbery              2142 non-null   int64  
 8   aggravated_assault   2142 non-null   int64  
 9   property_crime       2142 non-null   int64  
 10  burglary             2142 non-null   int64  
 11  larceny              2142 non-null   int64  
 12  motor_vehicle_theft  2142 non-null   int64  
 13  arson                2142 non-null   int64  
 14  rape                 2142 non-null   int64  
dtypes: float64(2), int64(12), object(1)
me

In [7]:
#remove the original fields:
state_crimes.drop(columns=['rape_legacy','rape_revised'],inplace=True)

#rearrange the column names after the old fields are dropped:
fixed_cols = ['state_abbr', 'year', 'population',
       'violent_crime', 'homicide',  'rape', 'robbery',
       'aggravated_assault', 'property_crime', 'burglary', 'larceny',
       'motor_vehicle_theft', 'arson']

state_crimes = state_crimes[fixed_cols]

Lastly, we'll also add a set of calculated rate fields, which are defined as **incidents per thousand population**. Each offense will have a 'per thousand population field denoted by a `_1000` in the field name.

In [8]:
for offense in state_crimes.columns[3:]:
    state_crimes[offense + '_1000'] = state_crimes[offense]/(state_crimes['population']/1000)

In [9]:
state_crimes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2142 entries, 0 to 2141
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   state_abbr                2142 non-null   object 
 1   year                      2142 non-null   int64  
 2   population                2142 non-null   int64  
 3   violent_crime             2142 non-null   int64  
 4   homicide                  2142 non-null   int64  
 5   rape                      2142 non-null   int64  
 6   robbery                   2142 non-null   int64  
 7   aggravated_assault        2142 non-null   int64  
 8   property_crime            2142 non-null   int64  
 9   burglary                  2142 non-null   int64  
 10  larceny                   2142 non-null   int64  
 11  motor_vehicle_theft       2142 non-null   int64  
 12  arson                     2142 non-null   int64  
 13  violent_crime_1000        2142 non-null   float64
 14  homicide

---

With collection, some pre-processing, and field calculations complete, save the dataset to a file:

In [10]:
state_crimes.to_csv('../data/crimes_by_state.csv',index=False)

---

Repeat the above steps with all the states' criminal offense counts combined to get a **National** dataset:

In [11]:
offense_counts = ['population'] + list(state_crimes.columns[3:13])
us_crimes = state_crimes.copy()
us_crimes = us_crimes.groupby('year')[offense_counts].sum()

In [12]:
us_crimes.head(3)

Unnamed: 0_level_0,population,violent_crime,homicide,rape,robbery,aggravated_assault,property_crime,burglary,larceny,motor_vehicle_theft,arson
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1979,220097000,1207653,21456,76360,480499,629338,11040763,3327712,6600307,1112744,97747
1980,225389264,1344053,23044,82946,565616,672447,12062840,3795130,7136056,1131654,140117
1981,229146000,1361239,22516,82446,592629,663648,12060886,3779610,7193495,1087781,124102


In [13]:
for offense in us_crimes[offense_counts].columns[1:]:
    us_crimes['avg_'+ offense + '_1000'] = state_crimes.groupby('year')[offense + '_1000'].mean()
us_crimes.reset_index(inplace=True)
us_crimes.tail(5)

Unnamed: 0,year,population,violent_crime,homicide,rape,robbery,aggravated_assault,property_crime,burglary,larceny,...,avg_violent_crime_1000,avg_homicide_1000,avg_rape_1000,avg_robbery_1000,avg_aggravated_assault_1000,avg_property_crime_1000,avg_burglary_1000,avg_larceny_1000,avg_motor_vehicle_theft_1000,avg_arson_1000
37,2016,323405935,1285606,17413,132414,332797,802982,7928530,1516405,5644835,...,3.941718,0.052603,0.458628,0.905235,2.525252,25.037643,4.709251,18.041865,2.286527,0.148394
38,2017,325147121,1283875,17294,135666,320596,810319,7682988,1397045,5513000,...,3.902975,0.052329,0.466933,0.85159,2.532123,24.157457,4.357037,17.453567,2.346853,0.137141
39,2018,326687501,1252399,16374,143765,281278,810982,7219084,1235013,5232167,...,3.821438,0.04959,0.49339,0.725858,2.552601,22.610954,3.847414,16.470634,2.292906,0.121859
40,2019,328329953,1250393,16669,143224,268483,822017,6995235,1118096,5152267,...,3.802693,0.052033,0.490759,0.689349,2.570552,21.568613,3.456026,15.951128,2.16146,0.11313
41,2020,329484123,1313105,21570,126430,243600,921505,6452038,1035314,4606324,...,3.961042,0.064651,0.437022,0.626083,2.833286,19.818617,3.152663,14.289438,2.376517,0.132387


In [14]:
us_crimes.to_csv('../data/us_crimes.csv', index=False)

---