# <p style="color:darkblue">Fatal Police Shooting</p>

<hr>


In this Jupyter-Notebook, I will conduct an analysis of Fatal Police Shootings in the US using supervised machine learning algorithms. The input features for the analysis will be extracted from the dataset obtained from the Kaggle platform, and the output will indicate whether the individual exhibited signs of mental illness, classified as positive or negative. We will follow a step-by-step process, starting from dataset preprocessing and continuing to model validation.

<a href="https://colab.research.google.com/github/lauradefaria/Machine_Learning_and_Data_Analysis/blob/main/Fatal-Police-Shooting/Fatal_Police_Shootings_English.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Reading Data


At first, we will import all the libraries that will be used throughout the project.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.impute import KNNImputer

Let's read the dataset that will be used for this work. The dataset can be downloaded from Kaggle or accessed from the repository of this class.

In [None]:
df_path = "https://raw.githubusercontent.com/lauradefaria/Machine_Learning_and_Data_Analysis/main/Fatal-Police-Shooting/PS-Dataset/PoliceKillingsUS.csv"

In [None]:
df = pd.read_csv(df_path, encoding='cp1252', index_col=0)
print("Dataset size : ", df.shape)

Dataset size :  (2535, 13)


# 2. Data Analysis

Initially, a brief analysis will be conducted on the attributes of the dataset to assess their relevance to the problem at hand:
* ***name:*** Name of the deceased
* ***date:*** Date of the occurrence
* ***manner_of_death:*** Manner in which the victim was killed
* ***armed:*** Whether the victim was holding an object (or unarmed)
* ***age:*** Age of the victim
* ***gender:*** Gender of the victim (Male/Female)
* ***race:*** Race of the victim (Asian/Whine/Native American/Black/Hispanic)
* ***city:*** City where the occurrence took place
* ***state:*** State in the US where the occurrence took place
* ***signs_of_mental_illness:*** Indication of whether the victim exhibited signs of mental illness
* ***threat_level:*** Whether the victim posed a threat to the officer or not
* ***body_camera:*** Whether the officer was using a body camera or not

In [None]:
df

Unnamed: 0_level_0,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
3,Tim Elliot,02/01/15,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False
4,Lewis Lee Lembke,02/01/15,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False
5,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False
8,Matthew Hoffman,04/01/15,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False
9,Michael Rodriguez,04/01/15,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2822,Rodney E. Jacobs,28/07/17,shot,gun,31.0,M,,Kansas City,MO,False,attack,Not fleeing,False
2813,TK TK,28/07/17,shot,vehicle,,M,,Albuquerque,NM,False,attack,Car,False
2818,Dennis W. Robinson,29/07/17,shot,gun,48.0,M,,Melba,ID,False,attack,Car,False
2817,Isaiah Tucker,31/07/17,shot,vehicle,28.0,M,B,Oshkosh,WI,False,attack,Car,True


#3. Data Preprocessing

It plays a crucial role in the data analysis process as raw data is often dirty, inconsistent, and not ready to be directly used in machine learning algorithms or other analytical techniques. A generalized step-by-step process would be:

* Data cleaning: Identification and handling of any issues in the data, such as missing data, duplicated values, noise, or outliers.

* Class balancing (optional): It may be necessary to apply class balancing techniques, such as oversampling (increasing the minority class) or undersampling (reducing the majority class).

* Data transformation: Next, data transformation techniques are applied, which may include normalization or standardization of values, encoding categorical variables into numerical formats, or creating new features from existing ones.

* Feature selection: If the dataset contains many attributes, feature selection is performed to reduce dimensionality and improve the efficiency and accuracy of models. This can be done using techniques such as correlation analysis, principal component analysis (PCA), or feature selection algorithms.

* Data splitting: The data is divided into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance.


A quick analysis of the raw data from the provided CSV file was conducted, and it was decided that some attribute removals would be beneficial during the loading process:

* Removing ***'name'***: It was deemed unnecessary since using it as an input parameter or target would be detrimental to the model, given that each object has a unique label (each individual's name is unique).
* Removing ***'date'***: Similar to 'name', there is subjectivity in the attribute, leading to its exclusion.
* Removing ***'city'***: It becomes unnecessary since there is already an attribute related to the state.

In [None]:
df.drop(columns=["name", "date"], axis=1, inplace=True)

## 3.1 Removing 'NaN' values


There are several possibilities for dealing with missing data, and below I will mention some of them along with their benefits:

* **Instance deletion**: Useful when the amount of missing data is small compared to the dataset size. Deletion can preserve the integrity of the remaining data, but it may lead to a reduced dataset and loss of information.

* **Attribute deletion:** Variables with a large number of NaN values are removed. This approach is good when the attribute is not essential for the analysis, but it may result in the loss of important information.

* **Filling with fixed values**: Using mean, median, or mode. This approach can be detrimental when there is a high number of NaN values, as it can introduce bias into the data.

* **Statistical imputation**: Estimated using statistical techniques (e.g., KNN or Linear Regression). Provides more accurate estimates and is more suitable when the data is informative.

* **Implementing Machine Learning algorithms**: Utilizing predictive models such as Decision Trees. Appropriate when missing data is related to complex patterns.

In [None]:
nan_counts = df.isna().sum()
print(nan_counts)

manner_of_death              0
armed                        9
age                         77
gender                       0
race                       195
city                         0
state                        0
signs_of_mental_illness      0
threat_level                 0
flee                        65
body_camera                  0
dtype: int64


Afterwards, the number of instances with NaN values was evaluated for each attribute and it was observed that there were 195 missing instances for ***'race'***, 9 for ***'armed'***, and 65 for ***'flee'***. Since the dataset has a significantly larger number of instances compared to the missing data, it was decided to remove these instances to ensure the integrity of the remaining data.

In [None]:
df['race'].replace('', np.nan, inplace=True)
df.dropna(subset=['race'], inplace=True)

df['flee'].replace('', np.nan, inplace=True)
df.dropna(subset=['flee'], inplace=True)

df['armed'].replace('', np.nan, inplace=True)
df.dropna(subset=['armed'], inplace=True)

Regarding the attribute ***'age'***, it has been decided to use statistical imputation with the KNN method. This method was chosen as linear regression is more effective when there is correlation among variables and patterns in the data. Additionally, KNN exhibits better performance compared to simple methods (such as fixed values) taking into account the structure and relationship of the data, thereby providing more accurate estimates.

In [None]:
imputer = KNNImputer(n_neighbors=5, weights ="distance")

df["age"] = imputer.fit_transform(df[["age"]])

nan_counts = df.isna().sum()
print(nan_counts)

##3.2 Inconsistencies and Duplicates Removal

This process is performed to avoid inconsistencies in the data, increase the reliability, quality of the results, and improve the performance of the models:

* It is inconsistent to state that the individual's age is less than 0.0
* Duplicate rows in the DataFrame are also discarded

In [None]:
correct = df["age"] >= 0.00
df = df[correct]

df.drop_duplicates()

## 3.3 Balanceamento

## 3.4 Transformação dos dados

### 3.4.1 One-Hot-Encoding

### 3.4.2 Normalização

### 3.4.3 Boxplot

## 3.5 Amostragem

## 3.6 Redução da dimensionalidade

#4. Aprendizagem Supervisionada

## 4.1 DecisionTree

## 4.2 KNN

## 4.3 RandomForest

## 4.4 Naive Bayes