# Capstone Project

## Background

We are glad you have successfully reached the capstone project part of "Data Science Fundamentals" course. You will put everything 
you have learned so far about data science to work. The outcome of this module should potentially serve as your portfolio item.

Unlike with previous projects, this time you are free to choose a dataset to explore from three suggested ones. As you will have to solve a provided problem, there will not be list of predefined questions that you have to answer - be creative and explore any dimensions of data you deem worth analyzing.

Although this might seem scary, this is how data science looks like in the industry. Often, it's your responsibility to not only give answers using the data, but also raise questions. The more creatively you look at this project, the better. Good luck!

----

## Requirements

Whichever problem you choose to analyze, general requirements are as follow:

#### Exploratory Data Analysis
* Describe the data with basic statistical parameters - mean, median, quantiles, etc. Use parameters that give you the most important statistical insights of the data.
* Grouping the data and analyzing the groups - using Pandas aggregate methods.
* Work with features - handle missing data if needed, use pandas date APIs.
* Manipulate datasets - use joins in needed.
* Visualize the data - you can use line, scatter, histogram plots, density plots, regplots, etc.

#### Statistical hypothesis testing
* Use at least one statistical significance test.
* Report p-values.
* Use visualizations.

#### Modeling
* Visualize data with dimensionality reduction algorithms.
* Perform cluster analysis.
* Use a linear model to explain relationships and predict new values.

#### Presentation
* Present the project - the data, methods and results.

## Problems

#### COVID-19 crisis 

<div><img width="400px" height="auto" src="https://images.unsplash.com/photo-1574515944794-d6dedc7150de?ixlib=rb-1.2.1&ixid=MXwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHw%3D&auto=format&fit=crop&w=1532&q=80" /></div>

The world is still struggling with one the most rapidly spreading pandemics. There are a lot of people who say that data is the best weapon we can use in this "Corona Fight". 

Imagine that you are one of the best data scientists in your country. The president of your country asked you to analyze the COVID-19 patient-level data of South Korea and prepare your homeland for the next wave of the pandemic. You, as the lead data scientist of your country **have to create and prove a plan of fighting the pandemics in your country** by analyzing the provided data. You must get most important insights using learned data science techniques and present them to the lead of your country.

https://www.kaggle.com/kimjihoo/coronavirusdataset/

#### 2016 US presidential elections

<div><img width="400px" height="auto" src="https://images.unsplash.com/photo-1583340806569-6da3d5ea9911?ixlib=rb-1.2.1&ixid=MXwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHw%3D&auto=format&fit=crop&w=1315&q=80" /></div>

In 2016, Donald Trump lost the popular vote, yet he won the electoral vote, securing 4 years in the Oval Office. This has been a shock to democrat supporters all around the world.

Imagine you travel back in time to 2016. As soon as you step out of your time-capsule, the Democratic Party hires you. They want you, the best data scientist across the time and space, **to explain what happened and what should have been done differently**. They want you to **prepare them for 2020 presidential elections**.

The Party has some tips for you - inspect the voters. What are Trump supporters? What do they feature? What are our supporters? Where should focus next? Any pro-trump states? Cities?

The Democrats were kind of enough to share [a Kaggle dataset](https://www.kaggle.com/benhamner/2016-us-election) with you on 2016 U.S. elections. Use the data to help the Democrats.

#### Fatal Police Shooting in United States

<div><img width="400px" height="auto" src="https://images.unsplash.com/photo-1606352466047-7cef02b312bb?ixlib=rb-1.2.1&ixid=MXwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHw%3D&auto=format&fit=crop&w=1662&q=80" /></div>

[Police brutality in the United States](https://en.wikipedia.org/wiki/Police_brutality_in_the_United_States) has been an nationwide issue since the 20th century. Public safety of U.S. citizens is a typical argument to justify the controversially high number of fatal shootings.

You are a contractor to the United States Department of Justice. **You have been given a case to investigate fatal police shootings throughout the United States of America, provide a list of issues, and propose a plan on how to tackle these issues**.

The department offered some tips - the public opinion indicates that there's something systematically fishy of police actions against civilians, some states differ from other, some cities are different from others, race equality is still an unanswered question, there's some talk about huge spendings on police, rumors about mental issues of those getting shot. Government is all about prioritizing - use the data to list issues with the police activity and propose a plan which issues to tackle first and how.

Your are given 1 dataset to start with. Try to search for more datasets to enrich your data analysis.

Here's the dataset:

* [Fatal Police Shootings in the U.S. '15 - '17](https://www.kaggle.com/washingtonpost/police-shootings).

## Evaluation Criteria

- Code quality
- Fulfillment of the idea
- Adherence to the requirements
- Delivery of the presentation

#### Statistical hypothesis testing
- Correct statistical test method is used, based on the situation.
- Reasoning on chosen statistical significance level.

#### Modeling
- Both PCA and T-SNE algorithms are used.

In [1]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

In [2]:
!pip install kneed



# Introduction

This project explores the dataset of police brutality in the United states between 2015 and 2017. The aim is to investigate the fatal shootings to see if there's something systematically wrong with police actions, list issues discovered with police actions and propose a plan to tackle the issues.

Each instance in the dataset contains Name, Date, Manner_of_death, Armed, Age, Gender,  Race, City, State, Signs_of_mental_illness, Threat_level, Flee, and Body_camera

In [3]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [4]:
import os


from kneed import KneeLocator
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import seaborn as sns
from scipy.stats import norm
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import TSNE
from sklearn.metrics import accuracy_score
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

In [5]:
# os.chdir('drive/My Drive/Turing')

In [6]:
pol_shooting = pd.read_csv('database.csv')
pop_2016 = pd.read_csv('raw_data_2016_yesss.csv')

FileNotFoundError: ignored

In [None]:
pol_shooting['date'] = pd.to_datetime(pol_shooting['date'])
pol_shooting['year'] = pol_shooting.date.dt.year
pol_shooting['month'] = pol_shooting.date.dt.month_name()
pol_shooting['day'] = pol_shooting.date.dt.day_name()

In [None]:
days = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
months = [ 'January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 
          'September', 'October', 'November', 'December']

In [None]:
pol_shooting [(pol_shooting ['year']==2015) | (pol_shooting ['year']==2016)].groupby(['month'])\
                                                        ['month'].count().reindex(months).plot()                                                  
plt.xlabel('Month')
plt.ylabel('Count')
plt.title('Shooting by month in 2015 and 2016')
plt.show()

In [None]:
pol_shooting [(pol_shooting ['year']==2015) | (pol_shooting ['year']==2016)].groupby\
                                        (['day'])['day'].count().reindex(days).plot()                                       
plt.xlabel('Day')
plt.ylabel('Count')
plt.title('Shooting by day in 2015 and 2016')
plt.show()

In [None]:
shooting_by_state = pol_shooting.groupby(['state'])['state'].count().to_frame().rename(columns=\
                                      {'state': 'count'}).reset_index().sort_values('count', ascending= False)


In [None]:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,5))

ax1.bar(shooting_by_state['state'][0:10], shooting_by_state['count'][0:10])
ax1.title.set_text('Top 10 shooting cases by state')
ax1.set_xlabel('State')
ax1.set_ylabel('Shooting count')

ax2.bar(shooting_by_state['state'][-10:], shooting_by_state['count'][-10:])
ax2.title.set_text('Bottom 10 shooting cases by state')
ax2.set_xlabel('State')
ax2.set_ylabel('Shooting count')
plt.show()

In [None]:
pop_2016['Location'] = pop_2016['Location'].replace(us_state_abbrev)

In [None]:
shooting_pop = pd.merge(shooting_by_state, pop_2016[['Location', 'Total']], left_on='state', \
                        right_on='Location', how='left').rename(columns= {'Total': 'Population 2016'})

In [None]:
pop_state = shooting_pop[['state','Population 2016']].sort_values(by='Population 2016', ascending=False)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,5))

ax1.bar(pop_state['state'][0:10], pop_state['Population 2016'][0:10])
ax1.title.set_text('Top 10 states by population')
ax1.set_xlabel('State')
ax1.set_ylabel('Population count')

ax2.bar(pop_state['state'][-10:], pop_state['Population 2016'][-10:])
ax2.title.set_text('Bottom 10 states by population')
ax2.set_xlabel('State')
ax2.set_ylabel('Population count')
plt.show()

From the population chart, it appears that states with high population, have similarly high number of shooting incidents, but that is only partially true. States like Arizona(AZ), Oklahoma(OK) etc. have very high number of shooting cases even when they don't appear in the Top 10 populous states. Similarly for states like DC and AK with low population, the shooting cases are high.


In [None]:
shooting_pop['shooting_per_1M'] = round((shooting_pop['count']/shooting_pop['Population 2016'])*1000000)

In [None]:
shooting_per_1M = shooting_pop.sort_values(by='shooting_per_1M', ascending=False)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,5))

ax1.bar(shooting_per_1M['state'][0:10], shooting_per_1M['shooting_per_1M'][0:10])
ax1.title.set_text('Top 10 shooting per 1M population')
ax1.set_xlabel('State')
ax1.set_ylabel('Shooting count')

ax2.bar(shooting_per_1M['state'][-10:], shooting_per_1M['shooting_per_1M'][-10:])
ax2.title.set_text('Bottom 10 shooting per 1M population')
ax2.set_xlabel('State')
ax2.set_ylabel('Shooting count')
plt.show()

The chart of shooting per 1M population shows the concentration OF states with high cases of police shooting. This clearly shows that even though high numbers of police shooting were recorded in highly populated states, the proportion of shootings per 1M population is higher in smaller states. None of the states in the top 10 shooting per 1M population is in the top 10 most populous states.

In [None]:
plt.hist(pol_shooting['age'], edgecolor = 'black')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age distribution of shooting victims')
plt.show()

In [None]:
sns.boxplot(y=pol_shooting.age, x=pol_shooting.gender)
plt.xlabel('Gender')
plt.ylabel('Age')
plt.title('Age distribution by gender')
plt.show()

In [None]:
shooting_race = pol_shooting.groupby(['race'])['race'].count().sort_values(ascending=False)\
                                                      .to_frame().rename(columns={'race': 'count'}).reset_index()

shooting_race['race'].replace({'W': 'White', 'B': 'Black', 'H': 'Hispanic',\
                               'A': 'Asian', 'O': 'Others', 'N': 'Others'}, inplace=True)

shooting_race_clean= shooting_race.groupby('race')['count'].sum().to_frame().reset_index()\
                                                          .sort_values(by='count', ascending=False)
                                                          
shooting_race_clean.plot.bar('race', 'count')
plt.xlabel('Race')
plt.ylabel('Count')
plt.title('Shooting incidents by race')
plt.show()

In [None]:
pop_race = pd.melt(pop_2016[1:].drop(['Location', 'Total'], axis=1)).groupby('variable')\
                                            ['value'].sum().sort_values(ascending=False)

pop_race_clean= pop_race.to_frame().reset_index().replace({ 'Multiple Races': 'Others', \
                                                           'American Indian/Alaska Native': 'Others', \
                                                           'Native Hawaiian/Other Pacific Islander': 'Others'})\
                                                           .groupby('variable')['value'].sum().to_frame().reset_index()\
                                                           .rename(columns={'variable': 'race', 'value': 'population'})

In [None]:
pop_race_clean.sort_values(by = 'population', ascending=False).plot.bar('race', 'population')
plt.xlabel('Race')
plt.ylabel('Count')
plt.title('Population by race')
plt.show()

Looking at the chart for shooting cases by race and that of the population by race, population of whites is more than three times that of the Hispanic population and about five times that of the black, but looking at the shooting cases, the incidents involving black people is only about two times that of the white

In [None]:
shooting_race_pop['pop_proportion'] = (shooting_race_pop['population']\
                                       /shooting_race_pop['population'].sum())*100

shooting_race_pop['shooting_proportion'] = (shooting_race_pop['count']\
                                       /shooting_race_pop['count'].sum())*100
                                       
shooting_race_pop.sort_values(by='population', ascending=False, inplace=True)

In [None]:
labels= list(shooting_race_pop.race)
fig, ax = plt.subplots(figsize=(10,5))
ind=np.arange(len(labels))
width=0.35

ax.bar(ind, shooting_race_pop['pop_proportion'], width, label= 'Population')
ax.bar(ind+width, shooting_race_pop['shooting_proportion'], width, label='Shooting')
ax.title.set_text('Population and Shooting proportions')
ax.set_xticks(ind)
ax.set_xticklabels(labels)
ax.set_xlabel('Race')
ax.set_ylabel('Proportion')
ax.legend()
plt.show()


In [None]:
shooting_race_pop = pd.merge(shooting_race_clean, pop_race_clean)

shooting_race_pop['shooting_per_1M_per_race'] = round((shooting_race_pop['count']\
                                                       /shooting_race_pop['population'])*1000000)
shooting_race_pop = shooting_race_pop.sort_values(by='shooting_per_1M_per_race', ascending=False)
shooting_race_pop.plot.bar('race', 'shooting_per_1M_per_race')
plt.xlabel('Race')
plt.ylabel('Shooting count')
plt.title('Shooting per 1M population by race')
plt.show()

In the shooting per 1M people per race chart, black jumps to top of the chart with twice as many incidents as the next race, Hispanic and almost three times as white

Using the Z-score Proportion test to verify a report on USA today that claims Black people account for about 23% for those shot and killed by the police (https://eu.usatoday.com/story/opinion/2020/07/03/police-black-killings-homicide-rates-race-injustice-column/3235072001/)


*   H0: The percentage of black people shot dead by police is 23%
*   H1: The percentage of black people shot dead by police is more than 23%





In [None]:
shooting_race_clean[shooting_race_clean['race']=='Black']['count']

In [None]:
def proportion_test(p_hat, p_o, n):
  num = p_hat - p_o
  den = np.sqrt((p_o * (1-p_o))/n)
  return num / den

In [None]:
def get_z_score(alpha):
  return norm.ppf(1-alpha)

In [None]:
# confidence = 95
n = shooting_race_clean['count'].sum()
shooting_black = int(shooting_race_clean[shooting_race_clean['race']=='Black']['count'])
p_hat = shooting_black / n
p_o = 0.23
alpha = 0.05

In [None]:
test_score = proportion_test(p_hat, p_o, n)
print('test_score: ', test_score)

z_score = get_z_score(alpha)
print('z_score: ', z_score)

Since the calculated test score is greater than the z-score, we reject the hypothesis that 23% of the victims of police shooting are black

In [None]:
pol_shooting['armed'].value_counts().shape[0]

The categories of arms found on victims of police shooting are 64 

In [None]:
pol_shooting['armed'].value_counts()[0:15].plot.bar()
plt.xlabel('Arms')
plt.ylabel('Count')
plt.title('Top 15 arms from victims of police shooting')
plt.show()

The report also claims that majority of the victims are armed which is obvious in the chart above. Using Z-score Proportion test with 95% confidence, we test if more than half the victims had guns with them.


*   H0: The percentage of victims with gun shot dead by police is 50%
*   H1: The percentage of victims with gun shot dead by police is more than 50%




In [None]:
# confidence = 95
n = pol_shooting['armed'].value_counts().sum()
shooting_armed = pol_shooting[pol_shooting['armed']=='gun'].shape[0]
p_hat = shooting_armed / n
p_o = 0.5
alpha = 0.05

In [None]:
test_score = proportion_test(p_hat, p_o, n)
print('test_score: ', test_score)

z_score = get_z_score(alpha)
print('z_score: ', z_score)

Since the calculated test score is greater than the z-score, we reject the hypothesis that the percetage of victims with gun shot dead by police is 50%

In [None]:
pol_shooting['threat_level'].value_counts().plot.bar()
plt.xlabel('Threat level')
plt.ylabel('Count')
plt.title('Victim threat level')
plt.show()

In [None]:
pol_shooting['flee'].value_counts().plot.bar()
plt.xlabel('Flee')
plt.ylabel('Count')
plt.title('Victims fleeing by count')
plt.show()

From the last three charts, it makes sense to see that the threat level of the victims is mostly high, considering the large percentage of victims were found with gun. Also, its important to note that most of the victims were not fleeing.

In [None]:
pol_shooting['signs_of_mental_illness'].value_counts().plot.bar()
plt.ylabel('Count')
plt.xlabel('Signs of mental illness')
plt.title('Shooting victims mental state')
plt.show()

In [None]:
pol_shooting.info()

In [None]:
pol_shooting.fillna(pol_shooting.mode().iloc[0], inplace=True)

In [None]:
X = pol_shooting.drop(['id', 'name', 'city', 'date', 'year','age', 'signs_of_mental_illness'], axis=1)
y =  pol_shooting['signs_of_mental_illness']

In [None]:
ohe = OneHotEncoder(handle_unknown='ignore')

In [None]:
X = ohe.fit_transform(X).toarray()
X = pd.DataFrame(X, columns=ohe.get_feature_names())
X['age'] = pol_shooting.age

In [None]:
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

In [None]:
pca= PCA()
X = pca.fit_transform(X_scaled)

In [None]:
explained_variance = pca.explained_variance_ratio_
cumsum_explained_variance = np.cumsum(explained_variance)

In [None]:
plt.plot(range(pca.n_components_), cumsum_explained_variance)
ax = plt.gca()
ax.axhline(y=0.80, color='red')

In [None]:
pca = PCA(n_components=100)
X = pca.fit_transform(X_scaled)

In [None]:
pc_df = pd.DataFrame(data= X, columns= list(range(1, 101)))

In [None]:
plt.figure(figsize = (7,7));
sns.scatterplot(x = pc_df[1],y = pc_df[2],hue =y);
plt.xlabel('Pc1')
plt.ylabel('Pc2')
plt.title('Victim clusters based on mental state')
plt.show()

In [None]:
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(X)
    distortions.append(kmeanModel.inertia_)

In [None]:
plt.plot(K, distortions, 'bx-');
plt.xlabel('K');
plt.ylabel('Distortion');
plt.title('The Elbow Method showing the optimal k');

In [None]:
knee = KneeLocator(range(1, 10), distortions, curve="convex", direction="decreasing")

print(knee.elbow)

In [None]:
kmeans = KMeans(n_clusters=4)
clusters = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

In [None]:
unique_clusters = np.unique(clusters)

for i in unique_clusters:
    plt.scatter(X[clusters == i , 0] , X[clusters == i , 1] , label = i);
plt.scatter(centroids[:,0] , centroids[:,1] , s = 50, color = 'black');
plt.legend();
plt.title('Cluster plot for music attributes with pca');

In [None]:
print('Silhouette score with PCA: ', silhouette_score(X, clusters))

The silhouette score indicates that the clusters overlap, i.e there's no clear separation among clusters

A model to predict wether a victim has mental illness or not

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [None]:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

In [None]:
print('Model accuracy: ' + str(accuracy_score(y_test, y_pred)))

Issues:


*   There is not enough data on the victims e.g Occupation, Family, neighbourhood they live in, education etc. and nothing on the policer officers e.g race, age, past history, etc. for more insight into the shootings.
*   Blacks have significantly higher chaces of being killed by a police than any other race
*   Most of the victims had weapon on them (Gun being the major arm)



Proposed solution:


*   More and rich data should be collected on police shooting
*   Unconscious racial bias should be a major focus for the police training.
*   Gun laws should be reviewed as over 50% of victims are in posession of guns 
  



# Conclusion

The results from this project shows that there's some form of racial bias in the police system in the United states. It also shows that access to guns might also contribute to the number of incidents of police brutality. 