# Chicago Police Data Analysis

- authors: Ela Bandari, Elanor Boyle-Stanley, Micah Kwok

# Summary 

In this project we attempt to investigate whether there is an association between the number of complaints a police officer in the Chicago Police Department receives per year and the officer's annual salary. To better understand this relationship we constructed and examined three linear regression models: 1) a simple linear regression model with number of complaints per year as the reponse variable and annual salary as the predictor variable, 2) a multivariable linear regression model with number of complaints per year as the reponse variable and annual salary and gender of the officer as the predictor variables, and 3) a multivariable linear regression model with number of complaints per year as the reponse variable and annual salary, gender, race and age of the officer as the predictor variables. We included demographic variables in two of our regression models as they can have confounding effects in our investigation of the association between number of complaints per year and annual salary. Based on an examination of the coefficients of regression and their p-values we can conclude that there is a small but significant association betweeen number of complaints an officer receives per year and their annual salary.

# Introduction

In the workforce we tend to associate higher salaries with higher levels of performance, since people who are good at their job tend to get paid more. With this study we wanted to take a look and answer an inferential question of whether a police staff member's salary influences the number of complaints per year they receieve.

# Methods

## Dataset
The original datasets and documents are sourced from the Chicago Police Department (CPD), Civillian Office of Police Accountability (COPA), the Independent Police Review Authority (IPRA), or the City of Chicago. However, we are building off of data that has been cleaned and matched from a [repository](https://github.com/invinst/chicago-police-data) maintained by the [Invisible Institute](https://invisible.institute/introduction). We have used a subset of the data from the years 2005 to 2015. 

## Analysis
Simple and multivariable linear regression models were used to determine if there was an association between the number of complaints an officer receives per year and the officer's salary. Number of complaints, salary and officer demographic variables (i.e. gender, race and age) were the variables used from the original dataset. The R programming language [CITE R] and the following R packages: tidyverse, doctopt... were used to conduct these analyses. The code used to perform the analysis and create this report can be found here: https://github.com/UBC-MDS/CPD

[citations required for R packages]

# Results and Discussion

To better answer whether a police staff member's salary influences their number of complaints per year, we found it helpful to start with a high level view of the Chicago Police Data and looked at the complaints per year. We see that the complaints per year have been falling from 2007 onwards, but a portion of it can be contributed to declining total staff numbers. Instead, if we look at the complaints per year per officer, we see that the overall trend has not really changed from year to year.

As the majority of complaints filed from 2005-2015 were against those ranked as a Police Officer, we decided to focus our study on this group of staff. 

![officer breakdown](../eda/images/complaints_by_rank.png)

Taking a closer look at this group of Police Officers, we found that their salary ranged from \$36,984 to \$122,100 with a mean of \$73,773. In terms of age, the average police officer in 2015 was about 41 years old and this was a relatively normal distribution. These officers had about 12 years of service but this distribution was harder to characterize with a number of officers having less than seven years of service. In the gender and race bar charts we can see that the majority of police officers are male and white. By looking just one year, 2015, we are able to prevent double-counting and get an understanding of the distributions of the Police Officer sub-group. We chose 2015 as it was relatively recent and represented the other years quite well.

With a better understanding of our group of interest, we generated a correlation matrix to see if there were any strong correlations between salary, age, years of service and complaints per year. While our initial inferential question focuses on salary and complaints per year, knowing that there may be more variables accounting for a change in complaints per year led us to add these other variables. From this matrix we saw a relatively small negative coefficient between salary and complaints per year and proceeded with our analysis.

![correlation matrix](../eda/images/corr_matrix.png)


Upon completing our EDA, we moved on to constructing our linear models. 

### Simple Linear Regression Model
Our first linear model is a simple linear model described by the following equation:

$$ \texttt{complaints_per_year}_i = \beta_0 + \beta_1 \times \texttt{salary_scaled}_i + \varepsilon_i $$
 
The coefficents of this model and their p-values are presented below:

In [17]:
readRDS('../results/salary_reg.rds')

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),1.20266217,0.02370954,50.72482,0.0
salary_scaled,-0.09258197,0.003185082,-29.06737,5.1613219999999994e-185


Our line of regression can also be visualized by the graph below:

![salary_reg](../results/salary_reg.png)

### Interpretation of Simple Regression Model 
Our simple regression model indicated a change of -0.09258197 (in \$10,000) is associated with 1 complaint increase in number of complaints received by a police officer in a year.

### Multivariable Regression Models
In addition to the simple regression model, we have also considered multivariable linear regression models that would take into account confounding effects by demographics factors.

Our second model includes a single demographic term, gender, which seemed to have explanatory power according to our EDA. It is described by the equation below:

$$ \texttt{complaints_per_year}_i = \beta_0 + \beta_1 \times \texttt{salary_scaled}_i + \beta_2 \times \texttt{gender}_i + \beta_3 \times \texttt{comment_percentage}_i + \varepsilon_i $$

The coefficents of this model and their p-values are presented below:

In [15]:
readRDS('../results/salary_gender_reg.rds')

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),1.00206754,0.02414633,41.49979,0.0
salary_scaled,-0.09277928,0.00316361,-29.32703,2.775135e-188
genderMALE,0.27118799,0.007211993,37.60236,2.348148e-307


Our third and final model includes all demographic terms(i.e. gender, race and age) and is described by the equation below:

$$ \texttt{complaints_per_year}_i = \beta_0 + \beta_1 \times \texttt{salary_scaled}_i + \beta_2 \times \texttt{gender}_i + \beta_3 \times \texttt{race}_i +  \beta_4 \times \texttt{approx_age}_i + \varepsilon_i $$

The coefficents of this model and their p-values are presented below:

In [18]:
readRDS('../results/salary_demographics_reg.rds')

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),0.95097889,0.0299785255,31.722003,8.778627000000001e-220
salary_scaled,0.03387924,0.0038812607,8.728927,2.608162e-18
genderMALE,0.2505616,0.0072013319,34.793786,1.0115619999999999e-263
raceBLACK,0.15221156,0.01991035,7.644846,2.109832e-14
raceHISPANIC,0.09516874,0.0201034636,4.733947,2.20485e-06
raceNATIVE AMERICAN/ALASKAN NATIVE,0.16923847,0.0546886065,3.094584,0.001971417
raceWHITE,0.10474518,0.0194400383,5.388116,7.135482e-08
approx_age,-0.0234391,0.0004257525,-55.053341,0.0


We will be evaluating the performance of these models in the next iteration of our project to see which moedl can best describe our data. 

We also acknowledge that our data does not fit all the assumptions of linear regression, therefore, we will be interpreting the results of our analysis with caution.

# References
[to do]