# Analyzing Fatal Police Shootings in The US from 2015 to 2020.


## Table of Contents.

- [Introduction](#intro)
- [Data Wrangling](#data)

<a id='intro'></a>
## Introduction
The year 2020 will be a year that lives long in the memory of man. The year was filled with major happenings all over the world. From the Australian bush fires, Prince Harry and Meghan Markle quiting the British royal family to the death toll numbers from COVID-19. The United States of America (USA) was the worst hit by the COVID-19 according to [numbers from John Hopkins University](https://coronavirus.jhu.edu/map.html).

Although [social distancing](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/advice-for-public), being one of the many reccommended ways to reduce and stop the spread of COVID-19, people all over the world protesting for one issue or another defied the advice. One of the issues was the fatal [killing of George Floyd, an African American, by Police officers](https://www.nytimes.com/2020/05/31/us/george-floyd-investigation.html) in the United States. The protest on the killing reverberated accross the world as months earlier an African American woman of name [Breonna Taylor was fatally shot by police in her apartment](https://www.nytimes.com/article/breonna-taylor-police.html).

Her killing which might have been preventable was what inspired this analysis. However, it's important to point out that only fatal shootings by police that occurred in the line of duty is captured in the dataset.

In [1]:
# Importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import urllib.request
%matplotlib inline

<a id='data'></a>
## Data Wrangling
The dataset will be downloaded from [Washington Post's page on Github](https://github.com/washingtonpost/data-police-shootings). It keeps logs of fatal police **shootings** that happened in the Unites States from 2015 to present. Since the database deals with shootings, cases like that of George Floyd who wasn't shot will not be in the database.

In [2]:
# Dowloading the dataset
url = 'https://github.com/washingtonpost/data-police-shootings/releases/download/v0.1/fatal-police-shootings-data.csv'
file_name = 'fatal_police_shootings_data.csv'
urllib.request.urlretrieve(url, file_name)

('fatal_police_shootings_data.csv', <http.client.HTTPMessage at 0x230195495e0>)

The file has been downloaded. Let's begin our analysis by looking viewing the top 5 rows.

In [3]:
# Read dataset into a pandas dataframe
df = pd.read_csv('fatal_police_shootings_data.csv')
df.head()

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,longitude,latitude,is_geocoding_exact
0,3,Tim Elliot,2015-01-02,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False,-123.122,47.247,True
1,4,Lewis Lee Lembke,2015-01-02,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False,-122.892,45.487,True
2,5,John Paul Quintero,2015-01-03,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False,-97.281,37.695,True
3,8,Matthew Hoffman,2015-01-04,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False,-122.422,37.763,True
4,9,Michael Rodriguez,2015-01-04,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False,-104.692,40.384,True


So far so good. From the cell above, we notice that the id is not sequential, but let's not dwell on that, we'd solve it during our cleaning process.

Now, let's get some info about the dataset.

### Assessing Data

#### Programmatic Assessment

In [4]:
# View some basic information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6032 entries, 0 to 6031
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       6032 non-null   int64  
 1   name                     5802 non-null   object 
 2   date                     6032 non-null   object 
 3   manner_of_death          6032 non-null   object 
 4   armed                    5822 non-null   object 
 5   age                      5756 non-null   float64
 6   gender                   6031 non-null   object 
 7   race                     5410 non-null   object 
 8   city                     6032 non-null   object 
 9   state                    6032 non-null   object 
 10  signs_of_mental_illness  6032 non-null   bool   
 11  threat_level             6032 non-null   object 
 12  flee                     5688 non-null   object 
 13  body_camera              6032 non-null   bool   
 14  longitude               

From the output of cell 4, we now know the following:

- There are 17 columns in the dataset.
- There are 6032 rows.
- Some data are missing in some columns.
- There are 4 different datatypes, boolean (3 columns), integer (1 column), float (3 columns) and string object (10 columns).

Now onto the next step, let's check the data in each column.

### Data Overview
    The details about the different columns were gotten from the Washington Post's github page.
    
- `id`: A unique identifier for each victim.


- `name`: The name of the victim


- `date`: The date of the fatal shooting in YYYY-MM-DD format


- `manner_of_death`: 
    - `Shot`.
    - `Tasered`.
    
    
- `armed`: Indicates that the victim was armed with some sort of implement that a police officer believed could inflict harm
    - `undetermined`: It is not known whether or not the victim had a weapon
    - `unknown`: The victim was armed, but it is not known what the object was
    - `unarmed`: The victim was not armed
    
    
- `age`: The age of the victim


- `gender`: The gender of the victim. The Post identifies victims by the gender they identify with if reports indicate that it differs from their biological sex.
    - `M`: Male
    - `F`: Female
    - `None`: Unknown
    
    
- `race`:
    - `W`: White, non-Hispanic
    - `B`: Black, non-Hispanic
    - `A`: Asian
    - `N`: Native American
    - `H`: Hispanic
    - `O`: Other
    - `None`: unknown
    
    
- `city`: the municipality where the fatal shooting took place. Note that in some cases this field may contain a county name if a more specific municipality is unavailable or unknown.


- `state`: two-letter postal code abbreviation


- `signs of mental illness`: News reports have indicated the victim had a history of mental health issues, expressed suicidal intentions or was experiencing mental distress at the time of the shooting.


- `threat_level`: The threat_level column was used to flag incidents for the story by Amy Brittain in October 2015. http://www.washingtonpost.com/sf/investigative/2015/10/24/on-duty-under-fire/ As described in the story, the general criteria for the attack label was that there was the most direct and immediate threat to life. That would include incidents where officers or others were shot at, threatened with a gun, attacked with other weapons or physical force, etc. The attack category is meant to flag the highest level of threat. The other and undetermined categories represent all remaining cases. Other includes many incidents where officers or others faced significant threats.


- `flee`: News reports have indicated the victim was moving away from officers
    - `Foot`
    - `Car`
    - `Not Fleeing`
    
    The threat column and the fleeing column are not necessarily related. For example, there is an incident in which the suspect is fleeing and at the same time turns to fire at gun at the officer. Also, attacks represent a status immediately before fatal shots by police while fleeing could begin slightly earlier and involve a chase.
    
    
- `body_camera`: News reports have indicated an officer was wearing a body camera and it may have recorded some portion of the incident.


- `latitude` and `longitude`: the location of the shooting expressed as WGS84 coordinates, geocoded from addresses. The coordinates are rounded to 3 decimal places, meaning they have a precision of about 80-100 meters within the contiguous U.S.


- `is_geocoding_exact`: reflects the accuracy of the coordinates. true means that the coordinates are for the location of the shooting (within approximately 100 meters), while false means that coordinates are for the centroid of a larger region, such as the city or county where the shooting happened.



**Now, let's try to take each individual column and assess it**.

In [5]:
# Number of unique ids
df.id.nunique()

6032

From the result above, there are no repeated id in the dataset.

In [6]:
# Let's check info about the name column
df.name.describe()

count                5802
unique               5785
top       Michael Johnson
freq                    3
Name: name, dtype: object

It seems like a name was repeated thrice, however, let's remember that there are no repeated ids in the dataset.

In [7]:
df.date.describe()

count           6032
unique          2068
top       2018-04-01
freq               9
Name: date, dtype: object

From the above cell, there are almost 4000 different days where a fatal shooting happened, and more fatal shootings occurred on 28, January 2019. Nine (9) fatal shootings happened on that day.

Onto the next column, the manner of death.

In [8]:
df.manner_of_death.describe()

count     6032
unique       2
top       shot
freq      5730
Name: manner_of_death, dtype: object

In [9]:
df.manner_of_death.unique()

array(['shot', 'shot and Tasered'], dtype=object)

This isn't surprising since there are only two manner of death. However, above 300 people died by being shot and tasered.

In [10]:
df.armed.describe()

count     5822
unique      96
top        gun
freq      3444
Name: armed, dtype: object

There are 96 different type of weapons that the armed victims used. And unsurprisingly, armed victims were armed with gun 3444 times. 

We should also note that the victims count is 5822, meaning some rows about victims being armed or not are empty.

In [11]:
# Details about the age of the victims
df.age.describe()

count    5756.000000
mean       37.167477
std        13.058384
min         6.000000
25%        27.000000
50%        35.000000
75%        46.000000
max        91.000000
Name: age, dtype: float64

In [12]:
df.age.nunique()

78

Wow, the minimum age of a victim was 6, and the maximum age was 91. The average age of the victims was 37 years and some months. The median age of the victims was 35 years. There are also 78 different unique victim ages.

**Onto the genders of the victims.**

In [13]:
df.gender.describe()

count     6031
unique       2
top          M
freq      5765
Name: gender, dtype: object

In [14]:
df.gender.unique()

array(['M', 'F', nan], dtype=object)

There are two unique genders, obviously, but 1 row is empty. Also unsurprisingly, there are more male victims than female.

In [15]:
df.race.describe()

count     5410
unique       6
top          W
freq      2746
Name: race, dtype: object

In [16]:
df.race.unique()

array(['A', 'W', 'H', 'B', 'O', nan, 'N'], dtype=object)