# Compound Bruality Analysis
## Logan Chang and Robert Yu
## 12/22/20

In this notebook, we look at the prevalence of brutality by demographic group. We break up the subjects by race and age group, look for trends of police brutality among demographic groups, and offer some of our own insight. 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.options.display.max_rows = 4000
import math
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/chicago-pd-police-brutality/TRR_FINAL_WITH_PB.csv
/kaggle/input/clean-cpp-data/df_trr_by_beat_clean.csv
/kaggle/input/clean-cpp-data/df_subject_id.csv
/kaggle/input/clean-cpp-data/df_trr_id_final-2.csv
/kaggle/input/clean-cpp-data/df_beat_clean.csv


In [2]:
#load dataframes
df_subjects = pd.read_csv('/kaggle/input/clean-cpp-data/df_subject_id.csv')
df_trr = pd.read_csv('/kaggle/input/chicago-pd-police-brutality/TRR_FINAL_WITH_PB.csv')

Let's try to combine all the TRR incident data by subject.

We will mainly be looking at police brutality incidents as described in previous notebooks.
To reiterate, we measured police brutality by the presence of 3 metrics:
1. Police force was used
2. An Injury (officer or subject reported) occured
3. The level of police force was judged to be greater than the level of resistance faced

Adding a new binary column if police brutaltiy occured at during TRR incident according to our metrics (1 means police brutality is present, 0 means police brutality isn't present):


In [3]:
#add binary switch of police brutality to trr dataframe
def pb_present(row):
    if row['pb_1'] == 1 and row['pb_2'] == 1 and row['pb_3'] == 1:
        return 1
    else:
        return 0
df_trr['pb'] = df_trr.apply(lambda row: pb_present(row), axis = 1)
df_trr.head()

Unnamed: 0,trr_id,sr_no,se_no,beat,party_fired_first,taser,firearm_used,trr_year,weapon_discharge_yn,list_of_subcats,...,injured,alleged_injury,subject_no,event_id,race,pb_1,pb_2,max_officer_action,pb_3,pb
0,4,1.0,1.0,1322,,0,0,2004,0,"4.2,",...,1,1,1.0,1,HISPANIC,1,1,4.0,1,1
1,5,1.0,1.0,1322,,0,0,2004,0,"4.2,",...,1,1,1.0,1,HISPANIC,1,1,4.0,1,1
2,7,2.0,2.0,1131,,0,0,2004,1,"4.1,4.2,",...,0,0,2.0,2,BLACK,1,0,4.0,1,0
3,8,2.0,2.0,1131,,0,0,2004,0,"4.2,",...,0,0,2.0,2,BLACK,1,0,4.0,1,0
4,9,3.0,3.0,1112,,0,0,2004,0,"4.2,3.3,",...,0,0,3.0,3,BLACK,1,0,4.0,1,0


Transfer police brutality incidents to each subject by mathcing subject id and subject no:

In [4]:
#create tracker of subject number -> experienced police brutality
subject_pb = {}
for index, row in df_trr.iterrows():
    if(np.isnan(row['subject_no'])):
        continue
    subject_num = int(row['subject_no'])
    if subject_num not in subject_pb.keys():
        subject_pb[subject_num] = row['pb']
    if subject_num in subject_pb.keys() and (subject_pb[subject_num] == 0 and row['pb'] == 1):
        subject_pb[subject_num] = row['pb']
# print(subject_pb)

In [5]:
#add data from tracker to trr dataframe
def add_pb(row):
    if row['subject_ID'] not in subject_pb.keys():
        return np.nan
    return subject_pb[row['subject_ID']]
df_subjects['pb'] = df_subjects.apply(lambda row: add_pb(row), axis = 1)
# df_subjects.head()

Adding sub categories of reported crime committed by each subject:

In [6]:
#create tracker of each subjects list of sub categories of reported crime committed
subjects_subcats = {}
for index, row in df_trr.iterrows():
    if np.isnan(row['subject_no']):
        continue
    key = int(row['subject_no'])
    subcats = str(row['list_of_subcats']).split(",")[:-1]
    for subcat in subcats:
        try:
            subjects_subcats[key].add(subcat)
        except KeyError:
            subjects_subcats[key] = {subcat}
# print(subjects_subcats)

In [7]:
#add tracker to subjects dataframe
def add_subcats(row):
    if row['subject_ID'] not in subjects_subcats.keys():
        return np.nan
    return subjects_subcats[row['subject_ID']]
df_subjects['subcats'] = df_subjects.apply(lambda row: str(add_subcats(row))[2:-2].replace("'",""), axis = 1)

In [8]:
#take a peek at what we've added
df_subjects.head()

Unnamed: 0,subject_ID,list_of_trr_id,gender,race,age,pb,subcats
0,1.0,45,MALE,HISPANIC,38.0,1.0,4.2
1,2.0,78,MALE,BLACK,25.0,0.0,"4.1, 4.2"
2,3.0,9,MALE,BLACK,24.0,0.0,"3.3, 4.2"
3,4.0,10,MALE,BLACK,21.0,0.0,3.3
4,5.0,1112,MALE,BLACK,21.0,1.0,"3.3, 4.2"


Drop any null data subjects:

In [9]:
#drop any subjects with null data
print(df_subjects.size)
df_subjects.dropna(inplace = True)
print(df_subjects.size)

313747
281792


Now, we are first going to see the compound probabilities for experiencing police brutality of each possible age, race, and gender combination:

First, we will create age bands as such:

* <20 (Minor)
* 20-29
* 30-39
* 40-49
* 50-59
* 60-69
* 70+

In [10]:
#create age bands and apply to subjects dataframe
def band_age(age):
    if age <20:
        return '<20'
    elif age<30:
        return '20-29'
    elif age<40:
        return '30-39'
    elif age<50:
        return '40-49'
    elif age<60:
        return '50-59'
    elif age<70:
        return '60-69'
    else:
        return '70+'
df_subjects['age_band'] = df_subjects.apply(lambda row: band_age(row['age']), axis = 1)
df_subjects.head()

Unnamed: 0,subject_ID,list_of_trr_id,gender,race,age,pb,subcats,age_band
0,1.0,45,MALE,HISPANIC,38.0,1.0,4.2,30-39
1,2.0,78,MALE,BLACK,25.0,0.0,"4.1, 4.2",20-29
2,3.0,9,MALE,BLACK,24.0,0.0,"3.3, 4.2",20-29
3,4.0,10,MALE,BLACK,21.0,0.0,3.3,20-29
4,5.0,1112,MALE,BLACK,21.0,1.0,"3.3, 4.2",20-29


Now, we can find the compound probabilities of each age, race, and gender combination experiencing police brutality relative to the total number of crimes reported:

In [11]:
#create counter
age_order = ['<20','20-29', '30-39', '40-49', '50-59', '60-69', '70+']
compound_counts = {}
for gender in df_subjects.gender.unique():
    for race in df_subjects.race.unique():
        for age_range in age_order:
            label_list = tuple([gender, race, age_range])
            compound_counts[label_list] = 0
#fill counter and sort
df_pb = df_subjects.loc[df_subjects['pb'] == 1]
for index, row in df_pb.iterrows():
    label = tuple([row['gender'], row['race'], row['age_band']])
    compound_counts[label] += 1
compound_counts = dict(sorted(compound_counts.items(), key=lambda item: item[1], reverse = True))
#print results
print('NUMBER OF POLICE BRUTALITY INCIDENTS AMONG DEMOGRPAHIC GROUPS (in descending order):')
for label in compound_counts.keys():
    gender = label[0]
    race = label[1]
    age = label[2]
    print(race + " "+gender + "'S AGE " + age+': '+str(compound_counts[label]))

NUMBER OF POLICE BRUTALITY INCIDENTS AMONG DEMOGRPAHIC GROUPS (in descending order):
BLACK MALE'S AGE 20-29: 2312
BLACK MALE'S AGE 30-39: 1097
BLACK MALE'S AGE 40-49: 588
BLACK MALE'S AGE <20: 522
HISPANIC MALE'S AGE 20-29: 443
BLACK FEMALE'S AGE 20-29: 401
WHITE MALE'S AGE 20-29: 271
HISPANIC MALE'S AGE 30-39: 226
BLACK MALE'S AGE 50-59: 226
BLACK FEMALE'S AGE 30-39: 194
WHITE MALE'S AGE 30-39: 172
BLACK FEMALE'S AGE 40-49: 125
BLACK FEMALE'S AGE <20: 111
WHITE MALE'S AGE 40-49: 105
HISPANIC MALE'S AGE <20: 96
HISPANIC MALE'S AGE 40-49: 91
HISPANIC FEMALE'S AGE 20-29: 69
WHITE FEMALE'S AGE 20-29: 58
WHITE MALE'S AGE <20: 55
WHITE MALE'S AGE 50-59: 53
BLACK FEMALE'S AGE 50-59: 35
WHITE FEMALE'S AGE 30-39: 34
HISPANIC MALE'S AGE 50-59: 30
HISPANIC FEMALE'S AGE 30-39: 27
BLACK MALE'S AGE 70+: 23
WHITE FEMALE'S AGE 40-49: 22
BLACK MALE'S AGE 60-69: 20
HISPANIC FEMALE'S AGE <20: 19
ASIAN/PACIFIC ISLANDER MALE'S AGE 20-29: 17
WHITE MALE'S AGE 60-69: 17
HISPANIC FEMALE'S AGE 40-49: 16
ASIAN/

Group data by these variables, values being = % of total police brutality incidents

In [12]:
#create pivot table of percentages of burtality incidents to total number of crimes
tot_pb = len(df_trr.loc[df_trr['pb'] == 1])
# print(tot_crimes)
groupings = df_subjects.groupby(['gender','race','age_band']).agg({'pb': 'sum'})
percentages = groupings.groupby('pb').apply(lambda x: (((x/tot_pb)*100)))
percentages.reindex(['<20','20-29', '30-39', '40-49', '50-59', '60-69', '70+'], level = 'age_band')
percentages.rename(columns = {'pb': "Percent of Total Brutality Incidents"},inplace = True)
percentages

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Percent of Total Brutality Incidents
gender,race,age_band,Unnamed: 3_level_1
FEMALE,ASIAN/PACIFIC ISLANDER,20-29,0.008222
FEMALE,ASIAN/PACIFIC ISLANDER,30-39,0.041112
FEMALE,ASIAN/PACIFIC ISLANDER,40-49,0.008222
FEMALE,ASIAN/PACIFIC ISLANDER,50-59,0.0
FEMALE,ASIAN/PACIFIC ISLANDER,60-69,0.008222
FEMALE,ASIAN/PACIFIC ISLANDER,70+,0.0
FEMALE,ASIAN/PACIFIC ISLANDER,<20,0.008222
FEMALE,BLACK,20-29,3.297155
FEMALE,BLACK,30-39,1.595132
FEMALE,BLACK,40-49,1.027791


Observations:
* Just as we have seen in our other analysis, Black men (typically those under 40) experience police brutality the most out of any demographic group.
* Native Americans, Asians, senior women (those older than 60) all populate the bottom of demographic groups having victims of police brutality. Both this and the preivous operation are consistent with sheer volume of these demographic groups having confrontations with police.
* Black citizens younger than 50 make up close to 50% of all police brutality incidents. All White citzens make up less than 10% of all police brutality incidents. 

It will be much more meaningful to do a similar analysis to the percent of police brutality among reported crimes of the same demographic groups (i.e. # of police brutality incidents/# of reported crimes for the same demographic groups):

*Note: We remove demographic groups with less than 20 reported crimes as we deemed this too small a sample size*

In [13]:
# examine the percentage of police brutality incidents to reported crimes by demographic groups
print('PERCENTAGE OF REPORTED CRIMES THAT ARE POLICE BRUTALITY INCIDENTS BY DEMOGRPAHIC GROUPS (in descending order):')
for label in compound_counts.keys():
    gender = label[0]
    race = label[1]
    age = label[2]
    num_crimes = len(df_subjects.loc[(df_subjects['gender'] == gender) & (df_subjects['race'] == race) & (df_subjects['age_band'] == age)])
    if num_crimes >= 20:
        try:
            compound_counts[label] = (compound_counts[label]/num_crimes)*100
        except ZeroDivisionError:
            compound_counts[label] = 0
compound_counts = dict(sorted(compound_counts.items(), key=lambda item: item[1], reverse = True))
for label in compound_counts.keys():
    gender = label[0]
    race = label[1]
    age = label[2]
    print(race + " "+gender + "'S AGE " + age+': '+str(compound_counts[label])+'%')

PERCENTAGE OF REPORTED CRIMES THAT ARE POLICE BRUTALITY INCIDENTS BY DEMOGRPAHIC GROUPS (in descending order):
WHITE MALE'S AGE <20: 26.82926829268293%
WHITE MALE'S AGE 60-69: 26.153846153846157%
BLACK FEMALE'S AGE 60-69: 26.08695652173913%
WHITE MALE'S AGE 70+: 25.0%
HISPANIC FEMALE'S AGE <20: 22.093023255813954%
ASIAN/PACIFIC ISLANDER MALE'S AGE 30-39: 22.0%
HISPANIC FEMALE'S AGE 20-29: 21.904761904761905%
WHITE MALE'S AGE 30-39: 21.526908635794744%
HISPANIC FEMALE'S AGE 40-49: 21.333333333333336%
BLACK FEMALE'S AGE 40-49: 20.973154362416107%
HISPANIC MALE'S AGE 50-59: 20.689655172413794%
WHITE FEMALE'S AGE 30-39: 20.0%
WHITE FEMALE'S AGE 20-29: 19.93127147766323%
WHITE MALE'S AGE 50-59: 19.850187265917604%
BLACK MALE'S AGE 70+: 19.65811965811966%
WHITE MALE'S AGE 40-49: 19.553072625698324%
BLACK MALE'S AGE 30-39: 19.26251097453907%
BLACK MALE'S AGE 50-59: 19.250425894378196%
BLACK MALE'S AGE 40-49: 19.016817593790428%
HISPANIC FEMALE'S AGE 30-39: 19.014084507042252%
HISPANIC MALE'S 

Obersvations:
* White men under 20 and between 60-69 and Black women between 60-69 are victims of police brutality more than 26% of the time that they are reported committing a crime. 
* Surprisingly, Black citizens and Hispanic citizens have percentages lower than their White counterparts of the same age. This would suggest that White citizens have a greater likelihood of being victims of police brutality when confronted by police than Black and Hispanic citizens. However, this difference is marginal and often less than 2%
* This was an interesting statistic to look at, as we aimed to measure if police had a bias in treating subjects of certain demographic. For example, a higher percentage would correlate to police always treating the given demographic group with more force. However, the results seem to be distorted by demographic groups with smaller sample sizes as illustrated by the prevelance of 60-69 year olds being victims of police brutality often in comparison to their total number of reported crimes. 

In [14]:
#download dataframe
df_subjects.to_csv('subjects_with_PB.csv',index=False)