# Introduction

Since Jan. 1, 2015, [The Washington Post](https://www.washingtonpost.com/) has been compiling a database of every fatal shooting in the US by a police officer in the line of duty. 

<center><img src=https://i.imgur.com/sX3K62b.png></center>

While there are many challenges regarding data collection and reporting, The Washington Post has been tracking more than a dozen details about each killing. This includes the race, age and gender of the deceased, whether the person was armed, and whether the victim was experiencing a mental-health crisis. The Washington Post has gathered this supplemental information from law enforcement websites, local new reports, social media, and by monitoring independent databases such as "Killed by police" and "Fatal Encounters". The Post has also conducted additional reporting in many cases.

There are 4 additional datasets: US census data on poverty rate, high school graduation rate, median household income, and racial demographics. [Source of census data](https://factfinder.census.gov/faces/nav/jsf/pages/community_facts.xhtml).

### Upgrade Plotly

Run the cell below if you are working with Google Colab

In [1]:
%pip install --upgrade plotly

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting plotly
  Downloading plotly-5.11.0-py2.py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m56.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 5.5.0
    Uninstalling plotly-5.5.0:
      Successfully uninstalled plotly-5.5.0
Successfully installed plotly-5.11.0


## Import Statements

In [2]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

# This might be helpful:
from collections import Counter

## Notebook Presentation

In [3]:
pd.options.display.float_format = '{:,.2f}'.format

## Load the Data

In [4]:
df_hh_income = pd.read_csv('Median_Household_Income_2015.csv', encoding="windows-1252")
df_pct_poverty = pd.read_csv('Pct_People_Below_Poverty_Level.csv', encoding="windows-1252")
df_pct_completed_hs = pd.read_csv('Pct_Over_25_Completed_High_School.csv', encoding="windows-1252")
df_share_race_city = pd.read_csv('Share_of_Race_By_City.csv', encoding="windows-1252")
df_fatalities = pd.read_csv('Deaths_by_Police_US.csv', encoding="windows-1252")

# Preliminary Data Exploration

* What is the shape of the DataFrames? 
* How many rows and columns do they have?
* What are the column names?
* Are there any NaN values or duplicates?

In [24]:
# Income df
print(f"Shape: {df_hh_income.shape}")
print(f"Columns: {df_hh_income.columns}")
print(f"Is nan value?: {df_hh_income.isna().values.any()}")
print(f"Is duplicated values? {df_hh_income.duplicated().values.any()}")
print(df_hh_income.head(3))

Shape: (29322, 3)
Columns: Index(['Geographic Area', 'City', 'Median Income'], dtype='object')
Is nan value?: True
Is duplicated values? False
  Geographic Area             City Median Income
0              AL       Abanda CDP         11207
1              AL   Abbeville city         25615
2              AL  Adamsville city         42575


In [25]:
# Povery percent df
print(f"Shape: {df_pct_poverty.shape}")
print(f"Columns: {df_pct_poverty.columns}")
print(f"Is nan value?: {df_pct_poverty.isna().values.any()}")
print(f"Is duplicated values? {df_pct_poverty.duplicated().values.any()}")
print(df_pct_poverty.head(3))

Shape: (29329, 3)
Columns: Index(['Geographic Area', 'City', 'poverty_rate'], dtype='object')
Is nan value?: False
Is duplicated values? False
  Geographic Area             City poverty_rate
0              AL       Abanda CDP         78.8
1              AL   Abbeville city         29.1
2              AL  Adamsville city         25.5


In [27]:
# Education (high school) percent df
print(f"Shape: {df_pct_completed_hs.shape}")
print(f"Columns: {df_pct_completed_hs.columns}")
print(f"Is nan value?: {df_pct_completed_hs.isna().values.any()}")
print(f"Is duplicated values? {df_pct_completed_hs.duplicated().values.any()}")
print(df_pct_completed_hs.head(3))

Shape: (29329, 3)
Columns: Index(['Geographic Area', 'City', 'percent_completed_hs'], dtype='object')
Is nan value?: False
Is duplicated values? False
  Geographic Area             City percent_completed_hs
0              AL       Abanda CDP                 21.2
1              AL   Abbeville city                 69.1
2              AL  Adamsville city                 78.9


In [28]:
# Race share percent df
print(f"Shape: {df_share_race_city.shape}")
print(f"Columns: {df_share_race_city.columns}")
print(f"Is nan value?: {df_share_race_city.isna().values.any()}")
print(f"Is duplicated values? {df_share_race_city.duplicated().values.any()}")
print(df_share_race_city.head(3))

Shape: (29268, 7)
Columns: Index(['Geographic area', 'City', 'share_white', 'share_black',
       'share_native_american', 'share_asian', 'share_hispanic'],
      dtype='object')
Is nan value?: False
Is duplicated values? False
  Geographic area             City share_white share_black  \
0              AL       Abanda CDP        67.2        30.2   
1              AL   Abbeville city        54.4        41.4   
2              AL  Adamsville city        52.3        44.9   

  share_native_american share_asian share_hispanic  
0                     0           0            1.6  
1                   0.1           1            3.1  
2                   0.5         0.3            2.3  


In [29]:
# fatalities  df
print(f"Shape: {df_fatalities.shape}")
print(f"Columns: {df_fatalities.columns}")
print(f"Is nan value?: {df_fatalities.isna().values.any()}")
print(f"Is duplicated values? {df_fatalities.duplicated().values.any()}")
print(df_fatalities.head(3))

Shape: (2535, 14)
Columns: Index(['id', 'name', 'date', 'manner_of_death', 'armed', 'age', 'gender',
       'race', 'city', 'state', 'signs_of_mental_illness', 'threat_level',
       'flee', 'body_camera'],
      dtype='object')
Is nan value?: True
Is duplicated values? False
   id                name      date   manner_of_death    armed   age gender  \
0   3          Tim Elliot  02/01/15              shot      gun 53.00      M   
1   4    Lewis Lee Lembke  02/01/15              shot      gun 47.00      M   
2   5  John Paul Quintero  03/01/15  shot and Tasered  unarmed 23.00      M   

  race     city state  signs_of_mental_illness threat_level         flee  \
0    A  Shelton    WA                     True       attack  Not fleeing   
1    W    Aloha    OR                    False       attack  Not fleeing   
2    H  Wichita    KS                    False        other  Not fleeing   

   body_camera  
0        False  
1        False  
2        False  


## Data Cleaning - Check for Missing Values and Duplicates

Consider how to deal with the NaN values. Perhaps substituting 0 is appropriate. 

In [32]:
# Income df
print(f"Where is nan value located: \n{df_hh_income.isna().sum()}")
print(f"nan values: {df_hh_income[df_hh_income.isna().values == True].head(2)}")
df_hh_income.fillna(0, inplace=True)
print(f"is nan value: \n{df_hh_income.isna().values.any()}")

Where is nan value located: 
Geographic Area     0
City                0
Median Income      51
dtype: int64
nan values:       Geographic Area                  City Median Income
29119              WY            Albany CDP           NaN
29121              WY            Alcova CDP           NaN
29123              WY  Alpine Northeast CDP           NaN
29126              WY    Antelope Hills CDP           NaN
29129              WY         Arlington CDP           NaN
Where is nan value located: 
Geographic Area    0
City               0
Median Income      0
dtype: int64


In [34]:
# fatalities  df
print(f"Where is nan value located: \n{df_fatalities.isna().sum()}")
print(f"nan values: {df_fatalities[df_fatalities.isna().values == True].head(2)}")
df_fatalities.fillna(0, inplace=True)
print(f"is nan value: \n{df_fatalities.isna().values.any()}")

Where is nan value located: 
id                           0
name                         0
date                         0
manner_of_death              0
armed                        9
age                         77
gender                       0
race                       195
city                         0
state                        0
signs_of_mental_illness      0
threat_level                 0
flee                        65
body_camera                  0
dtype: int64
nan values:       id                name      date   manner_of_death    armed   age  \
59   110    William Campbell  25/01/15              shot      gun 59.00   
124  584   Alejandro Salazar  20/02/15              shot      gun   NaN   
241  244  John Marcell Allen  30/03/15              shot      gun 54.00   
266  534          Mark Smith  09/04/15  shot and Tasered  vehicle 54.00   
340  433          Joseph Roy  07/05/15              shot    knife 72.00   

    gender race           city state  signs_of_mental_illness

# Chart the Poverty Rate in each US State

Create a bar chart that ranks the poverty rate from highest to lowest by US state. Which state has the highest poverty rate? Which state has the lowest poverty rate?  Bar Plot

In [57]:
print(df_pct_poverty.head(2))
df_pct_poverty.info()

  Geographic Area            City poverty_rate
0              AL      Abanda CDP         78.8
1              AL  Abbeville city         29.1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29329 entries, 0 to 29328
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Geographic Area  29329 non-null  object
 1   City             29329 non-null  object
 2   poverty_rate     29329 non-null  object
dtypes: object(3)
memory usage: 687.5+ KB


In [61]:
df_pct_poverty.poverty_rate = df_pct_poverty.poverty_rate.astype(str).str.replace("-", "0")
df_pct_poverty.poverty_rate = pd.to_numeric(df_pct_poverty.poverty_rate)
df_pct_by_area = df_pct_poverty.groupby("Geographic Area").agg({"poverty_rate": pd.Series.mean})

In [73]:
df_pct_by_area.sort_values("poverty_rate", ascending=False, inplace=True)
df_pct_by_area.head(3)

Unnamed: 0_level_0,poverty_rate
Geographic Area,Unnamed: 1_level_1
MS,26.88
AZ,25.27
GA,23.66


In [83]:
plt.figure(figsize=(4, 8), dpi=250)
fig = px.bar(
    df_pct_by_area,
    x=df_pct_by_area.index,
    y="poverty_rate",
    color="poverty_rate",
    title="poverty rate from highest to lowest by US state"
)

fig.update_layout(
    yaxis_title="Percent",
    xaxis_title="States"
)
fig.show()

<Figure size 1000x2000 with 0 Axes>

# Chart the High School Graduation Rate by US State

Show the High School Graduation Rate in ascending order of US States. Which state has the lowest high school graduation rate? Which state has the highest?

In [75]:
print(df_pct_completed_hs.head(3))
df_pct_completed_hs.info()

  Geographic Area             City percent_completed_hs
0              AL       Abanda CDP                 21.2
1              AL   Abbeville city                 69.1
2              AL  Adamsville city                 78.9
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29329 entries, 0 to 29328
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Geographic Area       29329 non-null  object
 1   City                  29329 non-null  object
 2   percent_completed_hs  29329 non-null  object
dtypes: object(3)
memory usage: 687.5+ KB


In [77]:
df_pct_completed_hs.percent_completed_hs = df_pct_completed_hs.percent_completed_hs.astype(str).str.replace("-", "0")

df_pct_completed_hs.percent_completed_hs = pd.to_numeric(df_pct_completed_hs.percent_completed_hs)

In [79]:
df_pct_hs_state = df_pct_completed_hs.groupby("Geographic Area").agg({"percent_completed_hs": pd.Series.mean})
df_pct_hs_state.sort_values("percent_completed_hs", inplace=True)
df_pct_hs_state.head(3)

Unnamed: 0_level_0,percent_completed_hs
Geographic Area,Unnamed: 1_level_1
TX,74.09
MS,78.47
GA,78.63


In [82]:
plt.figure(figsize=(7, 4), dpi=150)

fig = px.bar(
    df_pct_hs_state,
    x=df_pct_hs_state.index,
    y="percent_completed_hs",
    color="percent_completed_hs",
    title="High School Graduation Rate in ascending order of US States"
)

fig.update_layout(
    yaxis_title="Percent",
    xaxis_title="States"
)

fig.show()

<Figure size 1050x600 with 0 Axes>

# Visualise the Relationship between Poverty Rates and High School Graduation Rates

#### Create a line chart with two y-axes to show if the rations of poverty and high school graduation move together.  

#### Now use a Seaborn .jointplot() with a Kernel Density Estimate (KDE) and/or scatter plot to visualise the same relationship

#### Seaborn's `.lmplot()` or `.regplot()` to show a linear regression between the poverty ratio and the high school graduation ratio. 

# Create a Bar Chart with Subsections Showing the Racial Makeup of Each US State

Visualise the share of the white, black, hispanic, asian and native american population in each US State using a bar chart with sub sections. 

# Create Donut Chart by of People Killed by Race

Hint: Use `.value_counts()`

# Create a Chart Comparing the Total Number of Deaths of Men and Women

Use `df_fatalities` to illustrate how many more men are killed compared to women. 

# Create a Box Plot Showing the Age and Manner of Death

Break out the data by gender using `df_fatalities`. Is there a difference between men and women in the manner of death? 

# Were People Armed? 

In what percentage of police killings were people armed? Create chart that show what kind of weapon (if any) the deceased was carrying. How many of the people killed by police were armed with guns versus unarmed? 

# How Old Were the People Killed?

Work out what percentage of people killed were under 25 years old.  

Create a histogram and KDE plot that shows the distribution of ages of the people killed by police. 

Create a seperate KDE plot for each race. Is there a difference between the distributions? 

# Race of People Killed

Create a chart that shows the total number of people killed by race. 

# Mental Illness and Police Killings

What percentage of people killed by police have been diagnosed with a mental illness?

# In Which Cities Do the Most Police Killings Take Place?

Create a chart ranking the top 10 cities with the most police killings. Which cities are the most dangerous?  

# Rate of Death by Race

Find the share of each race in the top 10 cities. Contrast this with the top 10 cities of police killings to work out the rate at which people are killed by race for each city. 

# Create a Choropleth Map of Police Killings by US State

Which states are the most dangerous? Compare your map with your previous chart. Are these the same states with high degrees of poverty? 

# Number of Police Killings Over Time

Analyse the Number of Police Killings over Time. Is there a trend in the data? 

# Epilogue

Now that you have analysed the data yourself, read [The Washington Post's analysis here](https://www.washingtonpost.com/graphics/investigations/police-shootings-database/).