<center> <h2> DS 3000 - Fall 2020</h2> </center>
<center> <h3> DS Report </h3> </center>


<center> <h3>Cops and Cams: Classifying Police Body Camera Usage By Victim Characteristics</h3> </center>
<center><h4>by Michael Ruberto and Jesse Steinberg</h4></center>

<hr style="height:2px; border:none; color:black; background-color:black;">

#### Executive Summary:

<p style='text-indent: 2em'>In this project we study the adoption of body cameras within US Police Departments to see if it was possible to accurately predict if a body camera was used in a specific shooting. We compared a neighbors-based classifier to two linear classifiers on this task and examined if there were any significant differences in body camera likelihood across victim demographics. After cleaning a dataset compiled by the Washington Post, we used a model based selection approach to select the most important features. Then, we trained each classifier with a cross-validation approach for a baseline comparison, before proceeding to hyperparameter tuning via a grid search approach. Although we found k-Nearest Neighbors to be the optimal classifier for this task and dataset, it still did not perform much better than random chance. This, coupled with results showing a lack of significant differences across most victim demographics, led to the conclusion that there were no underlying patterns in the dataset with which to determine body camera usage.


<hr style="height:2px; border:none; color:black; background-color:black;">

## Outline
1. <a href='#1'>INTRODUCTION</a>
2. <a href='#2'>METHOD</a>
3. <a href='#3'>RESULTS</a>
4. <a href='#4'>DISCUSSION</a>

<a id="1"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 1. INTRODUCTION

<p style='text-indent: 2em;'>We are studying the adoption of body cameras within US Police Departments over the past several years. Specifically,
we would like to see if it is possible to correctly predict if a body camera was in use during specific shootings by a
police officer. For several years, and especially in the past few months, trust in the police has been waning.
One of the measures that police departments reportedly put in place was widespread use of body cameras to increase the
accountability of their officers. The insights from this project could show whether police departments are taking
calls for increased accountability seriously, and if they are keeping their word that officers would use body cameras.
</p>

<p style='text-indent: 2em;'>If police departments are truly increasing body camera adoption over time, in a correct and uniform manner, there should not be any correlation between any of these features and if a body camera is being
used with the exception of date. If our machine learning model is able to correctly
predict that a body camera is being used with a better than random chance (around 50%) when other features are being considered, that could indicate some trend or pattern as to why cameras are not being used in certain situations, which would be indicative of a major societal problem.</p>

<p style='text-indent: 2em;'>Our primary question is: Can we predict, with better than random chance, whether a body camera was used in a given shooting? With regards to a machine learning algorithm, we are wondering: Will a linear classification algorithm perform as well at separating the two classes as a neighbors-based classifier? </p>

<p style='text-indent: 2em;'>Additionally, what features will be more important in the prediction, if any? It is possible that victim demographics play a role in whether an officer uses a bodycamera during a shooting. Given the present social climate, we plan to perform several hypothesis tests to investigate this subject. Specifically, is there any significant difference in body camera usage between racial groups? Gender? Age group? We will test each of these with Chi-Squared tests for Homogeneity.    
</p>

<a id="2"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 2. METHOD

### 2.1. Data Acquisition

<p style='text-indent: 2em;'>Our dataset comes from the <a href='https://github.com/washingtonpost/data-police-shootings'>Washington Post</a>, who, <a href='https://www.washingtonpost.com/graphics/investigations/police-shootings-database/'>starting in 2015</a>, began to track comprehensive data on fatal shootings by a police officer in the United States. The majority of data in this set was compiled from news stories, social media, and official police reports. The dataset currently has $5862$ samples/rows <b>(as of 12/07/2020)</b>. Unfortunately, this number increases every week. Our target variable is a boolean class representing if a body camera was used during the shooting. Our $11$ features are date, manner of death, if the victim was armed, victim age, victim race, victim gender, the state where it occurred, whether the victim exhibited signs of mental illness, whether the victim attempted to flee, as well as the latitude and longitude of the incident.</p> 

<p style='text-indent: 2em;'>Additionally, the dataset contains columns which we will exclude: id, name, city of incident, threat level, and is_geocoding_exact. We deemed the threat level column as described on the github page to be too vague to draw conclusions from because its values of "attack", "other", and "undetermined" each may include incidences where an officer was threatened. The ID column is simply a unique number for each incident and thus does not play any role in the incident. The city value is too specific to be meaningfully represented in a numeric fashion. The is_geocoding_exact column is simply a boolan variable which tells if the longitude and latitude columns are exact. As of the time of this writing, only 8 samples are false, so we simply filter these samples out and otherwise ignore the column. Finally, because names are personally identifiable, we believed it would be unethical to use them as a feature in this project. As such, each of these five columns will be ignored.</p>

<b>Dataset:</b> https://github.com/washingtonpost/data-police-shootings

In [77]:
import pandas as pd

file = "https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv"
raw_data = pd.read_csv(file)

In [78]:
raw_data

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,longitude,latitude,is_geocoding_exact
0,3,Tim Elliot,2015-01-02,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False,-123.122,47.247,True
1,4,Lewis Lee Lembke,2015-01-02,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False,-122.892,45.487,True
2,5,John Paul Quintero,2015-01-03,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False,-97.281,37.695,True
3,8,Matthew Hoffman,2015-01-04,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False,-122.422,37.763,True
4,9,Michael Rodriguez,2015-01-04,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False,-104.692,40.384,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5860,6401,Mark Brewer,2020-12-06,shot,screwdriver,28.0,M,,St. Louis,MO,False,other,Not fleeing,False,-90.260,38.586,True
5861,6411,Donald Edwin Saunders,2020-12-06,shot,gun,37.0,M,B,Dayton,OH,False,attack,Not fleeing,False,-84.138,39.772,True
5862,6408,,2020-12-08,shot,hammer,,M,,Las Vegas,NV,False,attack,Not fleeing,False,-115.286,36.096,True
5863,6409,,2020-12-08,shot,undetermined,,M,,Gates,OR,False,undetermined,,False,-122.417,44.756,True


-----
### 2.2. Data Analysis

<p style='text-indent: 2em'>Our outcome variable is body_camera and our feature variables are date, manner_of_death, armed, age, gender, race, state, signs_of_mental_illness, flee, longitude, and  latitude. The dependent variable for each of our hypotheses is our outcome variable, body_camera, and the independent variables are either age, gender, or race, depending for each test. For our machine learning algorithm, we expect date will be an important predictor because if body camera use has increased over time, then there should be a strong, positive correlation between date and a positive body camera classification. If any of the other features are determined to be important, it could indicate bias within the police system. That is, since officers are always supposed to have body cameras on, if any of these features could be used to predict a lack of body camera use, it indicates actively choosing not to use a body camera. We anticipate features such as age, gender, race, and possibly state as being especially significant.<p>

<p style='text-indent: 2em'>This is a supervised machine learning task, specifically Binary Classification, because we have a set of feature variables which are using to predict a known target variable. This target variable has a limited domain of True or False. For this task, we plan to compare the effectiveness of a neighbors-based classifier against a linear classifier on our dataset. We will use the KNearestNeighbors classifer to represent a neighbors algorithm. We will use the RidgeClassifier and LinearSVC models to represent linear classifiers. State Vector Machine algorithms like scikit-learn's LinearSVC perform well on linearly separable classes. High performance from LinearSVC would show a strong separation between body camera classes. Ridge algorithms like RidgeClassifer work well with high dimensional data when many of the variables are important, which would perform better if our features correlate strongly with a positive bodycam classification. By using each of our algorithms in this way, we can gain a deeper insight into our data.</p>

<a id="3"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 3. RESULTS

### 3.1. Data Wrangling

<i><h3>Data Cleaning </h3></i>

<h4>Drop Unused Columns & Inexact Geolocations</h4>
<li>
    As explained in <b>Section 2.1</b>, we will not be using these columns as features in this project. We drop them and store the resulting DataFrame.
</li>

In [79]:
drop_set = ['id', 'name', 'city', 'threat_level', 'is_geocoding_exact']

# Filter to only true geocodings before dropping
data = raw_data[raw_data['is_geocoding_exact']].drop(drop_set, axis=1)

In [80]:
data

Unnamed: 0,date,manner_of_death,armed,age,gender,race,state,signs_of_mental_illness,flee,body_camera,longitude,latitude
0,2015-01-02,shot,gun,53.0,M,A,WA,True,Not fleeing,False,-123.122,47.247
1,2015-01-02,shot,gun,47.0,M,W,OR,False,Not fleeing,False,-122.892,45.487
2,2015-01-03,shot and Tasered,unarmed,23.0,M,H,KS,False,Not fleeing,False,-97.281,37.695
3,2015-01-04,shot,toy weapon,32.0,M,W,CA,True,Not fleeing,False,-122.422,37.763
4,2015-01-04,shot,nail gun,39.0,M,H,CO,False,Not fleeing,False,-104.692,40.384
...,...,...,...,...,...,...,...,...,...,...,...,...
5860,2020-12-06,shot,screwdriver,28.0,M,,MO,False,Not fleeing,False,-90.260,38.586
5861,2020-12-06,shot,gun,37.0,M,B,OH,False,Not fleeing,False,-84.138,39.772
5862,2020-12-08,shot,hammer,,M,,NV,False,Not fleeing,False,-115.286,36.096
5863,2020-12-08,shot,undetermined,,M,,OR,False,,False,-122.417,44.756


<h4>Drop NA Values</h4>
<li>
    Dropping NA values ensures that every sample has complete information. We cannot simply fill in entries for features like age, race, gender, etc. because there is no real default value (and to assign a default value would be ethically questionable).
</li>

In [81]:
data = data.dropna(how='any').reset_index(drop=True)

In [82]:
data

Unnamed: 0,date,manner_of_death,armed,age,gender,race,state,signs_of_mental_illness,flee,body_camera,longitude,latitude
0,2015-01-02,shot,gun,53.0,M,A,WA,True,Not fleeing,False,-123.122,47.247
1,2015-01-02,shot,gun,47.0,M,W,OR,False,Not fleeing,False,-122.892,45.487
2,2015-01-03,shot and Tasered,unarmed,23.0,M,H,KS,False,Not fleeing,False,-97.281,37.695
3,2015-01-04,shot,toy weapon,32.0,M,W,CA,True,Not fleeing,False,-122.422,37.763
4,2015-01-04,shot,nail gun,39.0,M,H,CO,False,Not fleeing,False,-104.692,40.384
...,...,...,...,...,...,...,...,...,...,...,...,...
4525,2020-12-01,shot,gun,28.0,M,W,NC,False,Not fleeing,False,-81.948,35.330
4526,2020-12-02,shot,gun and vehicle,20.0,M,B,FL,False,Car,False,-82.671,27.752
4527,2020-12-06,shot,vehicle,28.0,M,H,CA,False,Car,False,-117.904,33.988
4528,2020-12-06,shot,gun,37.0,M,B,OH,False,Not fleeing,False,-84.138,39.772


<i><h3>Data Formatting</h3></i>

<h4>Vectorize Features</h4>
<li>
    Sklearn machine learning algorithms require inputs in numeric formats, so we must convert our features into a numeric representation.
</li>

In [83]:
def vectorize_col(df: pd.DataFrame, col_name: str):
    """Vectorizes the unique values in a given column in the given DataFrame."""
    str_to_num = {}
    num_to_str = {}
    count = 0
    
    # Storing conversions both ways in dicts allows to convert again later for displaying
    for key in df[col_name].unique():
        str_to_num.update({key: count})  
        num_to_str.update({count: key})
        count += 1
    df[col_name].replace(str_to_num, inplace=True)
    return str_to_num, num_to_str

In [84]:
import numpy as np

# Convert dates to ordinal numbers
data['date'] = data['date'].map(lambda date: np.datetime64(date).astype('long'))

# Vectorize manner_of_death
mod_str_to_num, mod_num_to_str = vectorize_col(data, 'manner_of_death')

# Vectorize gender:
gender_str_to_num, gender_num_to_str = vectorize_col(data, 'gender')

# Vectorize race:
race_str_to_num, race_num_to_str = vectorize_col(data, 'race')

# Vectorize state:
state_str_to_num, state_num_to_str = vectorize_col(data, 'state')

In [85]:
data

Unnamed: 0,date,manner_of_death,armed,age,gender,race,state,signs_of_mental_illness,flee,body_camera,longitude,latitude
0,16437,0,gun,53.0,0,0,0,True,Not fleeing,False,-123.122,47.247
1,16437,0,gun,47.0,0,1,1,False,Not fleeing,False,-122.892,45.487
2,16438,1,unarmed,23.0,0,2,2,False,Not fleeing,False,-97.281,37.695
3,16439,0,toy weapon,32.0,0,1,3,True,Not fleeing,False,-122.422,37.763
4,16439,0,nail gun,39.0,0,2,4,False,Not fleeing,False,-104.692,40.384
...,...,...,...,...,...,...,...,...,...,...,...,...
4525,18597,0,gun,28.0,0,1,33,False,Not fleeing,False,-81.948,35.330
4526,18598,0,gun and vehicle,20.0,0,3,31,False,Car,False,-82.671,27.752
4527,18602,0,vehicle,28.0,0,2,3,False,Car,False,-117.904,33.988
4528,18602,0,gun,37.0,0,3,10,False,Not fleeing,False,-84.138,39.772


<h4>Convert armed to Boolean</h4>
<li>
    There are many unique string values within the "armed" feature. Many have significant overlap (e.g. "gun and vehicle", "vehicle and gun", and "gun and car"). Because of this, we feel the data would better represented as a simpler boolean. Armed will be marked True only if the victim is listed as anything other than "unarmed" or "undetermined". While this groups together any item the victim was "armed" with in an overly simplistic way (e.g. "pen" and "gun" will both result in a True armed value), contextually speaking, the most important consideration is that the officer saw the victim as armed.

In [86]:
# Unique entries prior to converting to a boolean variable
# Not necessary, we just wanted to show how much overlapping occurs
data['armed'].unique()

array(['gun', 'unarmed', 'toy weapon', 'nail gun', 'knife', 'shovel',
       'vehicle', 'hammer', 'hatchet', 'sword', 'machete', 'box cutter',
       'undetermined', 'metal object', 'screwdriver', 'lawn mower blade',
       'flagpole', 'guns and explosives', 'cordless drill', 'metal pole',
       'Taser', 'metal pipe', 'metal hand tool', 'blunt object',
       'metal stick', 'sharp object', 'meat cleaver', 'carjack', 'chain',
       "contractor's level", 'unknown weapon', 'stapler', 'crossbow',
       'bean-bag gun', 'baseball bat and fireplace poker',
       'straight edge razor', 'gun and knife', 'ax', 'brick',
       'baseball bat', 'hand torch', 'chain saw', 'garden tool',
       'scissors', 'pole', 'pick-axe', 'flashlight', 'spear', 'chair',
       'pitchfork', 'hatchet and gun', 'rock', 'piece of wood', 'bayonet',
       'glass shard', 'motorcycle', 'pepper spray', 'metal rake', 'baton',
       'crowbar', 'oar', 'machete and gun', 'air conditioner',
       'pole and knife', 'beer

In [87]:
data['armed'] = data['armed'].map(lambda weapon: weapon.strip().lower() != 'unarmed' 
                                  and weapon.strip().lower() != 'undetermined')

<h4>Convert flee to Boolean</h4>
<li>
    Like armed, flee is a string variable with some unwieldy variation. Sometimes it details a method of flight, others simply whether the victim attempted to flee. Thus we simplified this to a boolean for consistency.
</li>

In [88]:
# "not fleeing" becomes false. ALL other values become true.
data['flee'] = data['flee'].map(lambda flee: flee.strip().lower() != 'not fleeing')

In [89]:
data

Unnamed: 0,date,manner_of_death,armed,age,gender,race,state,signs_of_mental_illness,flee,body_camera,longitude,latitude
0,16437,0,True,53.0,0,0,0,True,False,False,-123.122,47.247
1,16437,0,True,47.0,0,1,1,False,False,False,-122.892,45.487
2,16438,1,False,23.0,0,2,2,False,False,False,-97.281,37.695
3,16439,0,True,32.0,0,1,3,True,False,False,-122.422,37.763
4,16439,0,True,39.0,0,2,4,False,False,False,-104.692,40.384
...,...,...,...,...,...,...,...,...,...,...,...,...
4525,18597,0,True,28.0,0,1,33,False,False,False,-81.948,35.330
4526,18598,0,True,20.0,0,3,31,False,True,False,-82.671,27.752
4527,18602,0,True,28.0,0,2,3,False,True,False,-117.904,33.988
4528,18602,0,True,37.0,0,3,10,False,False,False,-84.138,39.772


<h4>Scale Date</h4>
<li>
    Scale dates from 0 to 1 in order to standardize date range.

In [90]:
from sklearn import preprocessing

data['date'] = preprocessing.MinMaxScaler().fit_transform(data['date'].values.reshape(-1,1))

In [91]:
data

Unnamed: 0,date,manner_of_death,armed,age,gender,race,state,signs_of_mental_illness,flee,body_camera,longitude,latitude
0,0.000000,0,True,53.0,0,0,0,True,False,False,-123.122,47.247
1,0.000000,0,True,47.0,0,1,1,False,False,False,-122.892,45.487
2,0.000461,1,False,23.0,0,2,2,False,False,False,-97.281,37.695
3,0.000923,0,True,32.0,0,1,3,True,False,False,-122.422,37.763
4,0.000923,0,True,39.0,0,2,4,False,False,False,-104.692,40.384
...,...,...,...,...,...,...,...,...,...,...,...,...
4525,0.996770,0,True,28.0,0,1,33,False,False,False,-81.948,35.330
4526,0.997231,0,True,20.0,0,3,31,False,True,False,-82.671,27.752
4527,0.999077,0,True,28.0,0,2,3,False,True,False,-117.904,33.988
4528,0.999077,0,True,37.0,0,3,10,False,False,False,-84.138,39.772


<h4>Equalize Sample Sizes</h4>
<li>
    There are more False samples than True (for body_camera). Randomly sample from the negative class to create two equal distributions. This will prevent the ML algorithms from being biased towards the majority class.
</li>
   

In [92]:
class_counts = data['body_camera'].value_counts()
sample_size = min(class_counts[True], class_counts[False])

# Take two equal samples. Recombine into a single dataframe.
pos_samples = data[data['body_camera']].sample(n=sample_size, random_state=8675309)
neg_samples = data[data['body_camera'] == False].sample(n=sample_size, random_state=8675309)
equalized = pos_samples.append(neg_samples)

In [93]:
equalized.reset_index(drop=True)

Unnamed: 0,date,manner_of_death,armed,age,gender,race,state,signs_of_mental_illness,flee,body_camera,longitude,latitude
0,0.222427,0,True,53.0,0,1,25,True,False,True,-116.316,43.553
1,0.841717,0,True,57.0,0,3,27,False,False,True,-95.947,41.292
2,0.179972,0,True,32.0,0,3,32,False,True,True,-90.033,35.108
3,0.867559,0,True,23.0,0,1,25,False,True,True,-112.433,42.853
4,0.906322,1,True,57.0,0,3,0,False,True,True,-122.352,47.614
...,...,...,...,...,...,...,...,...,...,...,...,...
1193,0.058145,0,True,58.0,0,1,35,True,False,False,-76.008,43.266
1194,0.694970,0,True,17.0,0,3,14,False,True,False,-87.626,41.854
1195,0.156899,0,False,36.0,0,3,35,False,True,False,-73.879,40.924
1196,0.127365,0,False,27.0,0,4,10,False,False,False,-81.436,41.042


<h4>Select Features</h4>
<li>
    We elect for a Model-Based Feature Selection approach. We are unsure how many features will be important. Rather than arbitrarily limit this number, as we would need to for an univariate or iterative approach, a model-based approach does restrict feature count <i>a priori</i>.
</li>
<li>We do our initial train/test split here to prevent data leakage. X_test_selected will later be used for final model evaluation, and X_train_selected will be used for crossvalidation and hyperparameter tuning.
</li>

In [94]:
from sklearn import feature_selection, linear_model, tree, model_selection

# Split dataset
features = equalized.drop('body_camera', axis=1)
target = equalized['body_camera']

X_train, X_test, y_train, y_test = model_selection.train_test_split(features, target, random_state=8675309)

# Select important features
selector = feature_selection.SelectFromModel(tree.DecisionTreeClassifier())
selected = selector.fit(X_train, y_train).get_support()
selected_cols = [feature for feature, supported in zip(X_train.columns, selected) if supported]

# Selected only the features deemed important
X_train_selected = X_train[selected_cols].reset_index(drop=True)
X_test_selected = X_test[selected_cols].reset_index(drop=True)

In [95]:
X_train_selected

Unnamed: 0,date,age,longitude,latitude
0,0.623443,46.0,-83.236,42.362
1,0.748962,17.0,-77.107,39.053
2,0.143516,45.0,-121.914,37.255
3,0.706045,31.0,-85.582,42.295
4,0.062298,34.0,-117.154,32.802
...,...,...,...,...
893,0.254269,32.0,-93.170,44.992
894,0.797416,28.0,-81.334,33.839
895,0.985695,56.0,-111.775,34.785
896,0.759114,37.0,-82.414,27.721


In [96]:
y_train

3029     True
3538    False
780     False
3369    False
333      True
        ...  
1346     True
3724    False
4500     True
3589    False
2493     True
Name: body_camera, Length: 898, dtype: bool

<h4>Discretize Age</h4>
<li>
    Age is given as a continuous floating point number. In context, when someone sees a person for the first time, they can guess a general age range, but not an exact number. For a police officer, this means they might see a teenager or middle-aged person, not exactly a "24-year-old. This may affect outcomes. We discretize age with one-hot encoding to give a more realistic representation.

In [97]:
from sklearn.preprocessing import KBinsDiscretizer

# Bin in roughly 10 year ranges
bins = int((equalized['age'].max() - equalized['age'].min()) // 10)  

# Fit the bins, then transform
binner = KBinsDiscretizer(n_bins=bins, encode='onehot-dense', strategy='uniform')
binner.fit(equalized['age'].values.reshape(-1,1))

# Only binning one feature, we can assume only a single array element
age_bins = []
for x in range(len(binner.bin_edges_[0]) - 1):
    left_edge = binner.bin_edges_[0][x]
    right_edge = binner.bin_edges_[0][x+1]
    age_bins += [f'age_{left_edge}, {right_edge}']
    
# Store onehot encoded ages in a table with bin edges as the column labels. Merge with the dataset.
bin_train_ages = pd.DataFrame(binner.transform(X_train_selected['age'].values.reshape(-1, 1)), columns=age_bins)
X_train_selected = pd.merge(X_train_selected.drop('age', axis=1), bin_train_ages, left_index=True, right_index=True)

bin_test_ages = pd.DataFrame(binner.transform(X_test_selected['age'].values.reshape(-1, 1)), columns=age_bins)
X_test_selected = pd.merge(X_test_selected.drop('age', axis=1), bin_test_ages, left_index=True, right_index=True)

In [98]:
X_train_selected

Unnamed: 0,date,longitude,latitude,"age_6.0, 16.25","age_16.25, 26.5","age_26.5, 36.75","age_36.75, 47.0","age_47.0, 57.25","age_57.25, 67.5","age_67.5, 77.75","age_77.75, 88.0"
0,0.623443,-83.236,42.362,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.748962,-77.107,39.053,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.143516,-121.914,37.255,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.706045,-85.582,42.295,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.062298,-117.154,32.802,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
893,0.254269,-93.170,44.992,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
894,0.797416,-81.334,33.839,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
895,0.985695,-111.775,34.785,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
896,0.759114,-82.414,27.721,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


### 3.2. Data Exploration

<h4>Bodycam Use By State</h4>
<li>
    We use the plotly express library to visualize the percentage of incidences per state in which a body-camera was used.
</li>

In [99]:
def build_heatmap(df: pd.DataFrame):
    """Given a DataFrame for with raw numbers and states, calculate body camera usage counts per state."""
    heat_df = pd.DataFrame(columns=['state', 'bodycam_on', 'bodycam_off', 'total'])
    
    state_counts = df['state'].value_counts()
    pos_class_counts = df[df['body_camera']]['state'].value_counts()
    neg_class_counts = df[df['body_camera'] == False]['state'].value_counts()
    
    for state in df['state'].unique():
        heat_df=heat_df.append(pd.Series([state_num_to_str[state],
                                          pos_class_counts.get(state, 0), 
                                          neg_class_counts.get(state, 0), 
                                          state_counts[state]], 
                                         index=heat_df.columns), 
                               ignore_index=True)
    return heat_df

In [100]:
# Calculate state by state statistics to graph
heatmap_df = build_heatmap(data)
heatmap_df['percent_cam_use'] = heatmap_df.apply(lambda state: state['bodycam_on'] / state['total'], axis=1)
heatmap_df.head()

Unnamed: 0,state,bodycam_on,bodycam_off,total,percent_cam_use
0,WA,14,107,121,0.115702
1,OR,6,63,69,0.086957
2,KS,5,42,47,0.106383
3,CA,115,549,664,0.173193
4,CO,19,143,162,0.117284


In [101]:
from plotly import express as px

heat = px.choropleth(heatmap_df, locations='state', locationmode='USA-states', color='percent_cam_use', scope='usa',
                    hover_name='state', hover_data=['bodycam_on', 'bodycam_off', 'total'],
                    title='Percent Bodycamera Usage by US Police Departments in Fatal Shootings, by State',
                    labels=dict(percent_cam_use='% Bodycam Use'))
heat.update_traces(marker_line_width=1.5, marker_line_color='black')
heat.show()

<img src='https://raw.githubusercontent.com/jsteinberg4/cops-and-cams/main/visualizations/choropleth.png'>

<p style='text-indent: 2em'>This choropleth shows the percentage of shooting incidences in each state where a body camera was used. As of <b>12/10/2020</b>, the majority of the country is a colored purple, indicating around $20\%$ lower. Even the best performing state, Vermont, only shows a $50\%$ usage rate. Even this is likely not representative of the state as a whole, considering the it is based on a sample size of only $6$. 

<h4>Pie Charts & Bar Graph</h4>
<li>
    We visualize the features describing personal characteristics using pie charts or bar graphs. These characteristic traits are age, gender, and race. For each feature, we create an appropriate graph describing the breakdown for instances where a body camera was used versus those where a camera was not.
</li>

<h5>Age Breakdown</h5>

In [102]:
def partition_cams(df_train, y_train, df_test, y_test):
    """Incorporates train and test dataframes into two new dataframes which
        separate positive and negative classes, while maintaining age bins.
    """
    cams_on = pd.DataFrame(columns=df_train.columns)
    cams_off = pd.DataFrame(columns=df_train.columns)
    
    # Split training set
    for sample, label in zip(df_train.values, y_train):
        row = pd.Series(sample, index=df_train.columns)
        if label:
            cams_on = cams_on.append(row, ignore_index=True)
        else:
            cams_off = cams_off.append(row, ignore_index=True)
    
    # Split testing set
    for sample, label in zip(df_test.values, y_test):
        row = pd.Series(sample, index=df_train.columns)
        if label:
            cams_on = cams_on.append(row, ignore_index=True)
        else:
            cams_off = cams_off.append(row, ignore_index=True)
    return cams_on, cams_off

In [103]:
# Here we use both X_train_selected and X_test_selected to be representative of the whole dataset
# (and maintain discretization)
cams_on, cams_off = partition_cams(X_train_selected, y_train, X_test_selected, y_test)
cams_on.drop(['date', 'longitude', 'latitude'], axis=1, inplace=True) # Not needed for age charting
cams_off.drop(['date', 'longitude', 'latitude'], axis=1, inplace=True) # Not needed for age charting

# Relabel the age bins for readability in graphs
cams_on_counts = {col.replace('age_', '').replace(',', ' to'): 
                  cams_on[col].value_counts().get(1, 0) for col in cams_on}
cams_off_counts = {col.replace('age_', '').replace(',', ' to'): 
                   cams_off[col].value_counts().get(1, 0) for col in cams_off}

In [104]:
# Construct two pie charts
age_on_pie = px.pie(values=list(cams_on_counts.values()), names=cams_on_counts.keys(), 
       title='Age Breakdown: Bodycameras On', template='presentation', width=800)
age_on_pie.update_traces(opacity=0.75, marker_line_width=1.5, marker_line_color='black')
age_on_pie.show()

age_off_pie = px.pie(values=list(cams_off_counts.values()), names=cams_off_counts.keys(), 
                     title='Age Breakdown: Bodycameras Off', template='presentation', width=800)
age_off_pie.update_traces(opacity=0.75, marker_line_width=1.5, marker_line_color='black')
age_off_pie.show()

<img src='https://raw.githubusercontent.com/jsteinberg4/cops-and-cams/main/visualizations/age_cam_on.png'>
<img src='https://raw.githubusercontent.com/jsteinberg4/cops-and-cams/main/visualizations/age_cam_off.png'>
<p style='text-indent: 2em'>These pie charts represent the age breakdown for all incidences with and without a body camera, respectively. While not exactly equal, the relative proportions between graphs (i.e. same age groups but different pie chart) appear fairly close. This does not seem to indicate a significant difference in body camera likelihood across age groups, but we will study this further in our hypothesis tests.

<h5>Race Breakdown</h5>

In [105]:
# split to cam on and off dataframes for race
cams_on, cams_off = partition_cams(X_train, y_train, X_test, y_test)  
race_cams_on = cams_on['race']
race_cams_off = cams_off['race']

# Setup race counts tables
df_on = pd.DataFrame(race_cams_on.value_counts()).rename(race_num_to_str)
df_off = pd.DataFrame(race_cams_off.value_counts()).rename(race_num_to_str)
ethnicities = ['White', 'Black', 'Hispanic', 'Asian', 'Native', 'Other']

# Build Body-cam On Chart
cam_on_pie = px.pie(df_on, values='race', names=ethnicities, title='Bodycameras Enabled by Race',
                    template='presentation')
cam_on_pie.update_traces(opacity=0.75, marker_line_width=1.5, marker_line_color='black')
cam_on_pie.show()

# Build Body-Cam off Chart
cam_off_pie = px.pie(df_off, values='race', names=ethnicities, title='Bodycameras Disabled by Race',
                     template='presentation')
cam_off_pie.update_traces(opacity=0.75, marker_line_width=1.5, marker_line_color='black')
cam_off_pie.show()

<img src='https://raw.githubusercontent.com/jsteinberg4/cops-and-cams/main/visualizations/race_cam_on.png'>
<img src='https://raw.githubusercontent.com/jsteinberg4/cops-and-cams/main/visualizations/race_cam_off.png'>
<p style='text-indent: 2em'>These pie charts represent the race breakdown for all incidences with and without a body camera, respectively. While not exactly equal, the relative proportions between graphs (i.e. same racial groups but different pie chart) appear fairly close. The two exceptions to this are blue and orange slices, representing the White and Black populations. This does not seem to indicate a significant difference in body camera likelihood across racial groups, save for the two listed, but we will study this further in our hypothesis tests.

<h5>Gender Breakdown</h5>

In [106]:
gender_cams_on = cams_on['gender']
gender_cams_off = cams_off['gender']

# Fill gender counts into a table & relabel pos/neg classes for readability
gender_cam_counts = pd.DataFrame()
gender_cam_counts = gender_cam_counts.append(gender_cams_on.value_counts()).rename({'gender':'Bodycam On'})
gender_cam_counts = gender_cam_counts.append(gender_cams_off.value_counts()).rename({'gender':'No Bodycam'})
gender_cam_counts = gender_cam_counts.rename(gender_num_to_str,axis=1)

# Graph gender counts as bar graph
bar = px.bar(gender_cam_counts, barmode='group', title='Bodycamera Usage by Gender', template='presentation',
             labels=dict(variable='Gender', index="Camera Usage"))
bar.update_yaxes(title='Number of Uses', tickangle=-40)
bar.update_traces(opacity=0.6, marker_line_width=1.5, marker_line_color='black')
bar.show()

<img src='https://raw.githubusercontent.com/jsteinberg4/cops-and-cams/main/visualizations/gender_bar.png'>
<p style='text-indent: 2em'>This bar graph shows the number of men and women involved in instances in which a body camera was or was not used, respectively. While it appears that many more men are involved in shootings overall, at least in this dataset, the relative frequency comparing within genders is about equal in terms of body camera use. This does not seem to indicate a significant difference, but we will investigate further in our hypothesis tests.

### 3.3. Model Construction

<h4>Testable Hypotheses</h4>
<ol>
    <li>Is there any significant difference in body camera usage by police officers for victims of each...
        <ol>
            <li>
                ...racial group?
            </li>
            <li>
                ...gender?
            </li>
            <li>
                ...age group?
            </li>
        </ol>
    </li>
  We plan to test these with Chi-Squared tests for Homogeneity with a $H_0$ that there is not a significant difference between groups and $H_a$ that there is a significant difference.
</ol>

In [107]:
from scipy import stats

<h5>A: Racial Group</h5>

In [108]:
# Build a list of each dataframes for each distinct race
races = [equalized[equalized['race'] == race_str_to_num[race_key]] for race_key in race_str_to_num]

# Compute observed counts for race
race_observed = [[], []]
race_descriptives = pd.DataFrame(columns=[True, False])
for race_df in races:
    race_observed[0] += [len(race_df[race_df['body_camera']])]
    race_observed[1] += [len(race_df[race_df['body_camera'] == False])]

    # Create race descriptives
    desc_series = pd.Series(race_df['body_camera'].value_counts(), 
                            name=race_num_to_str[race_df['race'].iloc[0]])
    race_descriptives = race_descriptives.append(desc_series)

race_stat, race_pvalue = stats.chisquare(race_observed)
race_descriptives = race_descriptives.T.agg(["count", "mean", "std", "sem"])

In [109]:
race_descriptives

Unnamed: 0,A,W,H,B,O,N
count,2.0,2.0,2.0,2.0,2.0,2.0
mean,13.5,274.5,110.0,188.0,4.5,8.5
std,0.707107,40.305087,2.828427,32.526912,0.707107,4.949747
sem,0.5,28.5,2.0,23.0,0.5,3.5


In [110]:
print(f'Statistic: {race_stat}\nP-value: {race_pvalue}')
# guaranteed 2 rows, equal num of columns per row
print(f'Degrees of Freedom: {len(race_observed[0]) - 1}') 

Statistic: [0.03703704 5.91803279 0.07272727 5.62765957 0.11111111 2.88235294]
P-value: [0.84738966 0.01498668 0.78740649 0.01767922 0.73888268 0.08955507]
Degrees of Freedom: 5


<h5>B: Gender</h5>

In [111]:
# Chi-Square expected counts
obs_men_cam = len(equalized[(equalized['gender'] == gender_str_to_num['M']) & (equalized['body_camera'])]) 
obs_men_no_cam = len(equalized[(equalized['gender'] == gender_str_to_num['M']) & (equalized['body_camera'] == False)]) 
obs_women_cam = len(equalized[(equalized['gender'] == gender_str_to_num['F']) & (equalized['body_camera'])])
obs_women_no_cam = len(equalized[(equalized['gender'] == gender_str_to_num['F']) & (equalized['body_camera'] == False)])

# Arrange expected counts
gender_observed = [[obs_men_cam, obs_women_cam], 
                   [obs_men_no_cam, obs_women_no_cam]]

gender_stat, gender_pvalue = stats.chisquare(gender_observed)

In [112]:
gender_descriptives = equalized[['gender', 'body_camera']].groupby("gender")
gender_descriptives = gender_descriptives.agg(["count", "mean", "std", "sem"])
gender_descriptives.rename(gender_num_to_str, inplace=True)
gender_descriptives

Unnamed: 0_level_0,body_camera,body_camera,body_camera,body_camera
Unnamed: 0_level_1,count,mean,std,sem
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
M,1145,0.500437,0.500218,0.014783
F,53,0.490566,0.504695,0.069325


In [113]:
print(f'Statistic: {gender_stat}\nP-value: {gender_pvalue}')
print(f'Degrees of Freedom: {(len(gender_observed) - 1) * (len(gender_observed[0]) - 1)}')

Statistic: [0.00087336 0.01886792]
P-value: [0.97642378 0.8907458 ]
Degrees of Freedom: 1


<h5>C: Age Group</h5>

In [114]:
# Get a list of names of the age bin features
age_cols = [col for col in X_train_selected.columns if col.lower().startswith('age')]
age_counts = {bin: [0,0] for bin in age_cols}

age_descriptives = pd.DataFrame(columns=['age', 'body_camera'])

def count_ages(X, y):
    """Counts occurrences of each group and whether there was a body camera for buidling
        descriptive statistics.
    """
    global age_descriptives
    for bin in age_counts:
        # Count cameras
        for age, label in zip(X[bin], y):
            if age == 1:
                age_descriptives = age_descriptives.append(pd.Series([bin, int(label)], 
                                                                     index=['age', 'body_camera']),
                                                           ignore_index=True)
                if label:
                    age_counts[bin][1] += 1
                else:
                    age_counts[bin][0] += 1

                    
# Count instances of bodycameras per age group in the train and test sets
count_ages(X_train_selected, y_train)
count_ages(X_test_selected, y_test)

age_observed = [[], # Bodycam YES counts 
                []] # Bodycam NO counts
for bin in age_counts:
    age_observed[0] += [age_counts[bin][0]]
    age_observed[1] += [age_counts[bin][1]]

age_stat, age_pvalue = stats.chisquare(age_observed)

In [115]:
age_descriptives.groupby('age').count()

Unnamed: 0_level_0,body_camera
age,Unnamed: 1_level_1
"age_16.25, 26.5",289
"age_26.5, 36.75",413
"age_36.75, 47.0",244
"age_47.0, 57.25",156
"age_57.25, 67.5",61
"age_6.0, 16.25",17
"age_67.5, 77.75",12
"age_77.75, 88.0",6


In [116]:
print(f'Statistic: {age_stat}\n\nP-value: {age_pvalue}')
print(f'Degrees of Freedom: {(len(age_observed) - 1) * (len(age_observed[0])-1)}')

Statistic: [2.88235294e+00 1.00000000e+00 2.42130751e-03 2.36065574e+00
 2.56410256e-02 1.63934426e-02 3.33333333e-01 0.00000000e+00]

P-value: [0.08955507 0.31731051 0.96075451 0.12442988 0.87278012 0.89811979
 0.56370286 1.        ]
Degrees of Freedom: 7


<h4>Model Question: <i>Neighbors vs. Linear</i></h4>

In [117]:
from sklearn import linear_model, svm, neighbors, model_selection

estimators = {
    "k-Nearest Neighbors": neighbors.KNeighborsClassifier(),
    "Ridge Classifier": linear_model.RidgeClassifier(random_state=8675309),
    "Linear Support Vector Machine": svm.LinearSVC(max_iter=1_000_000, random_state=8675309)
}

# Split X_train_selected into a set for cross-validation and a validation set for part 3.4
X_cv, X_validation, y_cv, y_validation = model_selection.train_test_split(X_train_selected, y_train, 
                                                                          random_state=8675309)

# Evaluate each estimator using cross validation
for name in estimators:
    cv_ = model_selection.KFold(shuffle=True, random_state=8675309)
    cv_scores_dict = model_selection.cross_validate(estimators[name], X_cv, y=y_cv, cv=cv_, return_estimator=True, )
    cv_scores_arr = cv_scores_dict['test_score']
    estimators.update({name: cv_scores_dict['estimator'][0]})
    
    # Results
    print(f'{name}:')
    print(f'\tMean Cross-Validation Score: {cv_scores_arr.mean() * 100: .2f}%')
    print(f'\tStandard Deviation of Cross-Validation Scores: {cv_scores_arr.std() * 100: .2f}%')

k-Nearest Neighbors:
	Mean Cross-Validation Score:  58.99%
	Standard Deviation of Cross-Validation Scores:  4.02%
Ridge Classifier:
	Mean Cross-Validation Score:  56.91%
	Standard Deviation of Cross-Validation Scores:  2.77%
Linear Support Vector Machine:
	Mean Cross-Validation Score:  56.32%
	Standard Deviation of Cross-Validation Scores:  3.26%


### 3.4. Model Evaluation

<i><b>Note:</b> As mentioned previously, this is a live dataset which is updated at least every few days. Any specific values mentioned in our analysis may not accurately reflect what you see when you run the code. We used a consistent random seed to attempt to mitigate this effect, but as more data is added values will likely change. All values were accurate as of <b>12/10/2020</b>.

<h5>Hypothesis Test Interpretations</h5>

<p style='text-indent: 2em'>For our hypothesis tests, we ran three Chi-Square tests for Homogeneity to see if there was any significant difference in the likelihood of an incident having a bodycamera turned on. We conducted these tests across the populations based on race, gender, and ethnicity. Our $H_0$ was that there would be no significant difference between categories. Results of tests revealed no significant difference in police body camera usage across genders $X(1) = [0.000873,~0.019], p < 0.05$. The same was true across age groups $X(7) = [2.88,~1.000,~.0024,~2.36,~0.026,~0.016,~0.333,~0.000], p < 0.05$.</p> 

<p style='text-indent: 2em'>Out of all race groups, two were found to be significant with $p < 0.05$, corresponding to the population of reported "White" and "Black" victims $X(5) = [0.037,~5.92,~0.073,~5.63,~0.111,~2.88]$. Note, the chi-square statistics for "white" and "black" are the second and fourth statistics in the given list, respectively.
</p>

<i><b>Note:</b> Test statistics rounded to 3 significiant digits for ease of readability</i>.

<h5>Model Performance Metrics</h5>
<li>
    We will evaluate the models based on Accuracy and F-1 score. We are able to avoid the accuracy paradox by using a balanced subset of our original dataset.
</li>

In [118]:
from sklearn import metrics

# Iteratively evaluate each estimator
for name in estimators:
    # Predict on the validation set
    pred = estimators[name].predict(X_validation)
    
    # Results
    print(f'{name}:')
    # Accuracy
    print(f'\tAccuracy: {metrics.accuracy_score(y_validation, pred) * 100: 0.2f}%')
    # F-1 Score
    print(f'\tF-1 Score: {metrics.f1_score(y_validation, pred): .3f}')

k-Nearest Neighbors:
	Accuracy:  57.78%
	F-1 Score:  0.566
Ridge Classifier:
	Accuracy:  59.11%
	F-1 Score:  0.516
Linear Support Vector Machine:
	Accuracy:  60.44%
	F-1 Score:  0.534


<h5>Model Evaluation</h5> 
<p style='text-indent: 2em'>The KNN algorithm $(\text{F-1}\approx 0.566, \text{Accuracy}=57.78\%)$ had the highest mean cross validation accuracy with the greatest variability in scores. This translated to being the model with the highest accuracy score on the validation set at $57.78\%$. KNN has the highest F-1 score, indicating the strongest balance of precision and recall. Overall, there do not appear to be signs of over or underfitting because the training and validation accuracies are roughly equal.</p>

<p style='text-indent: 2em'>Both the RidgeClassifier $(\text{F-1}\approx0.516, \text{Accuracy}=59.11\% )$ and LinearSVC $(\text{F-1}\approx 0.534, \text{Accuracy}=60.44\%)$ algorithms performed slightly better on the validation set than the training set, scoring approximately 1 standard deviation above their mean cross validation scores. These estimators also do not appear to be overfitting or underfitting. Both have F-1 scores slightly less than KNN, with Ridge being slightly less than LinearSVC.
</p>

<p style='text-indent: 2em'>For the remainder of this project, we will continue to use all 3 algorithms. Originally, we intended to drop the lowest performing linear model for our Neighbors vs. Linear comparison. After running our model evaluations on different random states of the dataset, RidgeClassifier and LinearSVC appeared to be so equal in performance that we do not feel comfortable dropping either model.</p>

### 3.5. Model Optimization

<p style='text-indent: 2em'>We adjust hyperparameters in an attempt to increase overall estimator accuracy past the ~50% range. Our models did not appear to overfit, so we simply try a variety of parameters which will vary the complexity of our estimators and test which will yield the greatest performance.
</p>

In [119]:
# Reset estimators dictionary
opt_estimators = {
    "k-Nearest Neighbors": neighbors.KNeighborsClassifier(),
    "Ridge Classifier": linear_model.RidgeClassifier(random_state=8675309),
    "Linear Support Vector Machine": svm.LinearSVC(random_state=8675309)
}

# Parameters to tune
hyper_parameters = {
    "k-Nearest Neighbors": {
        'n_neighbors': np.arange(3, 16, 2),  # Odd numbers 3-15
        'metric': ['minkowski', 'euclidean', 'manhattan']
    },
    "Ridge Classifier": {
        'alpha': np.arange(1, 100, 3),
        'solver': ['svd', 'cholesky', 'lsqr', 'sag', 'auto']
    },
    "Linear Support Vector Machine": {
        'dual': [True, False],
        'C': np.arange(0.1, 100, 5)
    }
}

In [120]:
from sklearn.utils import testing
from sklearn import exceptions

tuned_estimators = {} # Mapping from estimator name to GridSearch tuned estimators

@testing.ignore_warnings(category=exceptions.ConvergenceWarning)
def tune():
    # Perform grid search on each estimator
    for name in opt_estimators:
        cv_ = model_selection.KFold(shuffle=True, random_state=8675309)
        grid = model_selection.GridSearchCV(opt_estimators[name], hyper_parameters[name], cv=cv_)

        # Fit to X_train_selected. Since we're doing GridSearch, the validation set becomes one of the folds
        grid.fit(X_train_selected, y_train)

        # Results
        print(name, ':')
        print("\tBest parameters: ", grid.best_params_)
        print("\tBest cross-validation score: ", grid.best_score_)

        # Save tuned model
        tuned_estimators.update({name: grid})

In [121]:
tune()

k-Nearest Neighbors :
	Best parameters:  {'metric': 'manhattan', 'n_neighbors': 5}
	Best cross-validation score:  0.5968466790813161
Ridge Classifier :
	Best parameters:  {'alpha': 91, 'solver': 'lsqr'}
	Best cross-validation score:  0.5679143389199256
Linear Support Vector Machine :
	Best parameters:  {'C': 10.1, 'dual': False}
	Best cross-validation score:  0.5623215394165115


### 3.6. Model Testing

In [122]:
# Evaluate each estimator using cross validation
for name in tuned_estimators:
    print(name, ':')
    print(f"\tTest set score: {tuned_estimators[name].score(X_test_selected, y_test) * 100: .2f}%")

k-Nearest Neighbors :
	Test set score:  62.33%
Ridge Classifier :
	Test set score:  52.00%
Linear Support Vector Machine :
	Test set score:  51.67%


<a id="4"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 4. DISCUSSION

<p style='text-indent: 2em'>In this project we compared the scikit-learn implementation of the k-Nearest Neighbors classification algorithm to two linear algorithms on our dataset. These linear algorithms were the RidgeClassifier and Linear State Vector Machine Classifier, or LinearSVC in scikit-learn. Across all sets and regardless of hyperparameter tuning, the KNN classifier revealed the best and most consistent performance. This answers our initial question, showing that a neighbors-based classifier would perform better than a linear algorithm for this dataset. Based on these findings, KNN is the algorithm which should be used for this predictive model. However, despite being the most accurate, KNN was still only slightly more accurate than a blind guess ($50\%$). This implies that the features in our dataset cannot accurately predict whether an officer was using a body camera in a given police shooting instance, answering our primary research question. 
</p>

<p style='text-indent'>We were unable to reject the null hypothesis in any of our hypothesis tests, with two exceptions. Our tests showed a significant difference in police body camera usage with only White and Black victims, the two largest race groups in the set. This means, in general, victim demographics do not seem to correlate with whether an officer was using a body camera. This reinforces the answer to our research question that these features are not accurate predictors of police body camera usage.
</p>

<p style='text-indent'>Our findings in this project seem to be a positive from a societal perspective. This shows that there are no obvious trends in victim demographics contributing to unequal body camera usage. However, body cameras are not yet as widespread as many agree they should be. This is evidenced by our choropleth in <b>Section 3.2</b>, which shows that even the state with the highest usage of body cameras, Vermont, only saw a $50\%$ use in this dataset as of <b>December 10, 2020</b>. In the future, there is room for deeper investigation into which parts of the country are leading or trailing in terms of body camera usage. Additionally, there may be many factors which influence the adoption rate of body cameras in an area (e.g. wealth, size of police department, crime rate, etc.) which may warrant other investigation. Finally, it is important for body camera usage to be reported; at the time of writing, this Washington Post dataset appeared to be the only readily available dataset which contained comprehensive information about body camera usage spanning the entire United States. Even so, many states were underrepresented in this dataset. In all, though this report sheds an important light upon the usage of body cameras with United States police officers, there is still a lot of work waiting to be done.
</p>

<a id="5"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

### CONTRIBUTIONS

<p style='text-indent'>This entire project was a collaborative effort between <b>Michael Ruberto</b> and <b>Jesse Steinberg</b>. Each member contributed equally to every section during synchronous, remote work sessions over Zoom.
</p>