# Logistic Regression Model for 2019 data

### Logistic Regression predicts binary outcomes. This model will analyze the available data, and when presented a new sample, mathematically determine its probability of belonging to a class. If the probability is above a certain cutoff point, the sample is assigned to that class. If the probability is less than the cutoff point, the sample is assigned to the other class.

#### For our project, we will show how machine learning can help predict the safety of cities throughout the state of North Carolina. 

In [1]:
# Import dependencies
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

In [4]:
# Read the database
df = pd.read_csv(Path('NC_Crime_data_2019.csv'))
df.head()

Unnamed: 0,City,Population,Murder_nonnegligent_manslaughter,Rape,Robbery,Aggravated_assault,Violent_crime_total,Burglary,Larceny_theft,Motor_vehicle_theft,Arson,Property_crime_total,total_crime,crime_index,is_safe
0,Aberdeen,7892,0,3,7,31,41,54,229,18,0,301,342,4.33,1
1,Ahoskie,4772,2,1,20,33,56,68,174,11,2,253,309,6.47,0
2,Albemarle,16134,1,4,23,118,146,179,667,36,14,882,1028,6.37,0
3,Apex,56276,0,13,3,37,53,79,432,16,0,527,580,1.03,1
4,Asheville,93641,6,59,163,467,695,833,4552,538,20,5923,6618,7.06,0


In [5]:
# Identifying the data types of each column
dtypes_2019 = df.dtypes
print(dtypes_2019)

City                                 object
Population                            int64
Murder_nonnegligent_manslaughter      int64
Rape                                  int64
Robbery                               int64
Aggravated_assault                    int64
Violent_crime_total                   int64
Burglary                              int64
Larceny_theft                         int64
Motor_vehicle_theft                   int64
Arson                                 int64
Property_crime_total                  int64
total_crime                           int64
crime_index                         float64
is_safe                               int64
dtype: object


In [6]:
# Cleaning the DataFrame
# Dropping Violent_crime_total & Property_Crime_total columns
# These columns are subtotals, we don't want to double count the data so we are dropping them
cleaned_df = df.drop(['Violent_crime_total','Property_crime_total', 'City'], axis=1)
cleaned_df.head()

Unnamed: 0,Population,Murder_nonnegligent_manslaughter,Rape,Robbery,Aggravated_assault,Burglary,Larceny_theft,Motor_vehicle_theft,Arson,total_crime,crime_index,is_safe
0,7892,0,3,7,31,54,229,18,0,342,4.33,1
1,4772,2,1,20,33,68,174,11,2,309,6.47,0
2,16134,1,4,23,118,179,667,36,14,1028,6.37,0
3,56276,0,13,3,37,79,432,16,0,580,1.03,1
4,93641,6,59,163,467,833,4552,538,20,6618,7.06,0


In [7]:
# Splitting the Dataset into Train and Test Sets
# Creating our features
X = cleaned_df.drop(['is_safe'], axis=1)
X.head()

Unnamed: 0,Population,Murder_nonnegligent_manslaughter,Rape,Robbery,Aggravated_assault,Burglary,Larceny_theft,Motor_vehicle_theft,Arson,total_crime,crime_index
0,7892,0,3,7,31,54,229,18,0,342,4.33
1,4772,2,1,20,33,68,174,11,2,309,6.47
2,16134,1,4,23,118,179,667,36,14,1028,6.37
3,56276,0,13,3,37,79,432,16,0,580,1.03
4,93641,6,59,163,467,833,4552,538,20,6618,7.06


In [8]:
# Creating our target
y = cleaned_df['is_safe']
y.head()

0    1
1    0
2    0
3    1
4    0
Name: is_safe, dtype: int64

In [9]:
# Running to see a description of the data in the DataFrame
X.describe()

Unnamed: 0,Population,Murder_nonnegligent_manslaughter,Rape,Robbery,Aggravated_assault,Burglary,Larceny_theft,Motor_vehicle_theft,Arson,total_crime,crime_index
count,176.0,176.0,176.0,176.0,176.0,176.0,176.0,176.0,176.0,176.0,176.0
mean,26361.755682,2.136364,9.625,32.625,94.1875,144.767045,635.511364,60.448864,3.971591,977.0625,3.945795
std,87545.76407,9.175971,31.293381,165.459071,399.6388,487.998533,2390.46429,277.642499,14.643742,3746.912983,2.589198
min,119.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,8.0,0.28
25%,2588.25,0.0,1.0,1.0,4.0,12.0,43.75,3.0,0.0,75.25,1.97
50%,5275.0,0.0,2.5,3.0,13.0,31.0,130.0,9.0,1.0,178.5,3.62
75%,16143.75,1.0,6.0,11.0,42.25,95.5,377.0,23.0,2.0,581.75,5.1775
max,944260.0,103.0,317.0,1975.0,4587.0,5426.0,28304.0,3340.0,151.0,44052.0,17.64


In [10]:
# Running method to see the counts of unique values
y.value_counts()

1    129
0     47
Name: is_safe, dtype: int64

In [11]:
# Using the train_test_split function to split the data into a training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)

In [12]:
# Instantiate the Logistic Regression Model
classifier = LogisticRegression(solver='lbfgs', random_state=1)
classifier

LogisticRegression(random_state=1)

In [13]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
   intercept_scaling=1, max_iter=100, multi_class='auto', penalty='12',
   random_state=1, solver='lbfgs', warm_start=False)

LogisticRegression(penalty='12', random_state=1)

In [14]:
# Train the model
classifier.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(random_state=1)

In [16]:
# Validate the Logistics Regression Model
predictions = classifier.predict(X_test)
pd.DataFrame({"Prediction": predictions, "Actual": y_test})

Unnamed: 0,Prediction,Actual
153,1,1
87,1,1
66,1,1
13,1,1
21,0,0
1,0,0
146,1,1
16,1,1
98,1,1
33,1,1


### Our model is showing an accuracy score of .97, meaning this model could identify with 97% accuracy whether a city is considered "safe" (1) or "unsafe" (0).

### Once new data is released from the FBI, we can further train our model and use the new data to help determine if a certain city in the state of North Carolina is considered "safe".

In [18]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)

0.9772727272727273