# Activity 3: Maching Learning Applications (Optional)

## Introduction
Welcome to the final activity of this project!
Here, we'll apply the knowledge we've gained from Activity 2, where we determined that household size, youth concentration, and poverty correlate the most with crime. 

Using this information, is it possible to predict the future number of crimes in a specific location?

To answer this question, we'll briefly run a linear regression that takes these three things into account, then use it to predict the crime rate of a neighborhood in Seattle. Run the following cell to get started.

In [12]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import datasets, linear_model
import statsmodels.api as sm
from scipy import stats
from sklearn import metrics
from sklearn.metrics import mean_squared_error

crime = pd.read_csv("updated_crime.csv")
equity = pd.read_csv("equity.csv")
census = pd.read_csv("census.csv")
neighborhood = pd.read_csv("neighborhooddata.csv")

neighborhood["GEOID10"] = neighborhood["GEOID10"].map(lambda x: str(x)[:-4]).astype(int)

crime["GEOID10"] = crime["GEOID10"].map(lambda x: str(x)[:-4]).astype(int)

equity = equity[equity['GEOID10'].isin(crime["GEOID10"].to_list())]

get_neighborhood = neighborhood.set_index("CRA_NAME")
neighborhood_dict = get_neighborhood.to_dict()["GEOID10"]

census.replace({"Community Reporting Area Names":neighborhood_dict}, inplace = True)
census.rename(columns = {"Community Reporting Area Names": "GEOID10"}, inplace = True)

geo = pd.DataFrame(crime["GEOID10"].value_counts())
geo.reset_index(inplace=True)
geo.columns=['GEOID10','Count']
crime_data = geo.merge(census)
crime_data = crime_data.merge(equity)
crime_data["Total Population"] = crime_data["Total Population"].str.replace(',', '').astype(float)
crime_data["Crime Rate"] = (crime_data["Count"]/crime_data["Total Population"]).astype(float)

Run the cell below. It'll load our previous table that tells us the crime rate, average household size, percent of population under 18, and percent of population whose income was under 200% poverty by neighborhood. Note that this data was taken during the year of 2010, so all crime rates and predictions will be for that year.

In [13]:
chosen_columns = ["Crime Rate", 'Average Household Size',
 'Percent of Population under 18 years', 'PCT_POP_INC_UNDER_200_POVERTY', 'GEOID10']

crime_data= crime_data[chosen_columns]
crime_data = crime_data.loc[crime_data["GEOID10"] != 53033008100]
crime_data.head(3)

Unnamed: 0,Crime Rate,Average Household Size,Percent of Population under 18 years,PCT_POP_INC_UNDER_200_POVERTY,GEOID10
1,2.161681,1.47,4.2,0.287987,53033008400
2,3.077618,1.99,15.1,0.417321,53033001200
3,0.988396,1.87,10.8,0.195256,53033005900


### How does a linear regression work?
Below is the equation for a linear regression line.

$ y = a + b{x} $

As can be seen in <a href= "https://drive.google.com/file/d/1dOODHyyJxy3SHaLIgrNoxj1B-DVPCz_S/view?usp=sharing">this image</a>, a linear regression uses corresponding $x$ and $y$ values to find a linear relationship between them, which is usually visualized through a line of best fit. The regression adjusts the coefficient values $a$ and $b$ to minimize the distance between each point and the line of best fit. Then, a $y$ value can be predicted by plugging a value of $x$ into the regression equation.


Since we're using three parameters (we defined this as average household size, percent of population under 18, and percent of population whose income was under 200% poverty) to predict crime rate, our linear regression equation will look a little different:

$ y = a + b{x}_{1} + c{x}_{2} + d{x}_{3} $ 

${x}_{1}$,${x}_{2}$, and ${x}_{3}$ all represent a parameter of the table above that will affect $y$, our predicted crime rate. Every neighborhood has separate ${x}_{1}$,${x}_{2}$, and ${x}_{3}$ values that correspond to different $y$ values in neighborhoods. Like the example above, the equation finds the values of $a$, $b$, $c$, and $d$ to create a line of best fit with the given $x$ values.  Our model will use the linear regression equation to predict the value of $y$ when given ${x}_{1}$,${x}_{2}$, and ${x}_{3}$ values.

Below, we create and execute a linear regression using the table above. However, we've left two neighborhoods out to test the accuracy of our model. Run the cell below to see the results.

In [14]:
model_crime = crime_data.copy()
model_crime['Log Percent of Population under 18 years'] = np.log(model_crime["Percent of Population under 18 years"])
model_crime["LOG_PCT_POP_INC_UNDER_200_POVERTY"] = np.log(model_crime["PCT_POP_INC_UNDER_200_POVERTY"])
model_crime["Log Crime Rate"] = np.log(model_crime["Crime Rate"])
model_crime = model_crime.drop(["Percent of Population under 18 years","PCT_POP_INC_UNDER_200_POVERTY","Crime Rate"],axis = 1)

eliminate_geos = model_crime.loc[~model_crime["GEOID10"].isin([53033011300,53033011500])]
X_train = eliminate_geos.drop(["Log Crime Rate", "GEOID10"], axis = 1)
y_train = eliminate_geos["Log Crime Rate"]

only_geos = model_crime.loc[model_crime["GEOID10"].isin([53033011300,53033011500])]
X_test = only_geos.drop(["Log Crime Rate", "GEOID10"],axis = 1)
y_test = only_geos["Log Crime Rate"]

X = model_crime.drop(["GEOID10","Log Crime Rate"],axis = 1)
y = model_crime["Log Crime Rate"]

lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)
show_location = neighborhood.copy()
show_location = show_location[["GEOID10", "CRA_NAME"]]
show_location = show_location.loc[show_location["GEOID10"].isin([53033011300,53033011500])]
show_location = crime_data.merge(show_location.drop_duplicates(keep="first"))
show_location["Crime Rate"] = np.log(show_location["Crime Rate"])
show_location["Predicted Crime Rate"] = abs(predictions[::-1])
show_location


Unnamed: 0,Crime Rate,Average Household Size,Percent of Population under 18 years,PCT_POP_INC_UNDER_200_POVERTY,GEOID10,CRA_NAME,Predicted Crime Rate
0,0.012039,2.28,19.0,0.131413,53033011500,Roxhill/Westwood,0.014712
1,0.325824,2.33,22.0,0.318208,53033011300,Highland Park,0.361241


Above, you can compare the Crime Rate to the Predicted Crime Rate of these two locations. 
Using this technology and data involving the three parameters we've defined, law enforcement can predict crime rates in specific neighborhoods and allot proper police force in these areas.