# AME project 3


# Racial differences in police use of force

The goal of this project is to investigate whether there are racial differences in police use of force. We will analyse this issue based on three different binary response models: Linear Probaility Model (LPM), the Probit model, and the Logit model.
Binary response models are relevant when the dependent variable $y$ has two possible outcomes, 
e.g., $y=1$ if a police encounter resulted in any use of force by the police officer(s), and $y=0$ if it does not.


## Theory

The binary response model assumes that the data generating process is 
$$
\begin{aligned}
y_i^* &= \mathbf{x}_i \boldsymbol{\beta} + u_i, \\ 
y_i   &= \mathbf{1}(y_i^* > 0), 
\end{aligned}
$$
where $u_i$ are distributed IID according to some cdf. $G$. 

In the lectures, it is shown that
$$ p_i \equiv \Pr(y_i = 1| \mathbf{x}_i) = G(\mathbf{x}_i \boldsymbol{\beta}). $$ 

Since $y_i$ (conditioning on $\mathbf{x}_i$) is Bernoulli-distributed with parameter $p_i$, its log-likelihood function is 
$$
\ell_i(\theta) 
               = \mathrm{1}(y_i = 1) \log[ G(\mathbf{x}_i \boldsymbol{\beta}) ]
               + \mathrm{1}(y_i = 0) \log[1 - G(\mathbf{x}_i \boldsymbol{\beta})]
$$

Estimation is then conducted by maximum likelihood, 
$$ \hat{\boldsymbol{\theta}} = \arg\max_\theta \frac{1}{N} \sum_{i=1}^N \ell_i (\theta), $$ 
which can be implemented as a minimizer in the usual $M$-framework with $q(\theta, y_i, x_i) = -\ell_i(\theta)$, and then minimizing $Q(\theta) = N^{-1} \sum_i q(\theta, y_i, x_i)$. 

We will consider two models: 
1. Probit: when $G$ is the standard normal CDF 
2. Logit: when $G$ is the standard logistic CDF. 

And we will be comparing them to OLS (which is called the Linear Probability Model, LPM, in a case like this where $y_i$ is binary). 

## Setup 

In [1]:
from sys import path

import numpy as np
import pandas as pd 
from scipy.stats import norm
import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set_theme()

%load_ext autoreload
%autoreload 2

import estimation as est 
import LinearModel as lm
import probit
import logit

## Data




In [2]:
dat = pd.read_csv('ppcs_cc.csv')
print(f'The data contains {dat.shape[0]} rows (encounters) and {dat.shape[1]} columns (variables).')

The data contains 3799 rows (encounters) and 19 columns (variables).


In [3]:
#Summary statistics for variables  
print(dat.describe())

#Possible values of each variable
print('Possible values of each variable:')
for col in dat.columns:
    print(f'{col}: {dat[col].unique()}')


            sblack        shisp       swhite       sother        smale  \
count  3799.000000  3799.000000  3799.000000  3799.000000  3799.000000   
mean      0.110555     0.101606     0.739142     0.048697     0.529613   
std       0.313622     0.302169     0.439160     0.215262     0.499188   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.000000     0.000000     0.000000     0.000000     0.000000   
50%       0.000000     0.000000     1.000000     0.000000     1.000000   
75%       0.000000     0.000000     1.000000     0.000000     1.000000   
max       1.000000     1.000000     1.000000     1.000000     1.000000   

              sage        sempl      sincome         spop      daytime  \
count  3799.000000  3799.000000  3799.000000  3799.000000  3799.000000   
mean     41.010003     0.695446     2.164780     1.362727     0.666491   
std      16.146916     0.460279     0.848262     0.765598     0.471529   
min      16.000000     0.000000     1

In [22]:
#Exclude the variables with zero variance
data = dat.drop(columns=['osplit', 'year'])

#Transform the variables 'sincome', 'spop', 'inctype_lin' to dummy variables
data = pd.get_dummies(data, columns=['inctype_lin', 'sincome', 'spop'], drop_first=True, dtype=int)

#Adjust names of new dummies
data.rename(columns={'inctype_lin_2': 'traffic_stop', 'sincome_2': 'sincome_20-50', 'sincome_3': 'sincome_above50', 'spop_2': 'spop_100-499', 'spop_3':'spop_500-999', 'spop_4': 'spop_above1000' }, inplace=True)

print(f'The data now contains {data.shape[0]} rows (encounters) and {data.shape[1]} columns (variables).')

#Summary statistics for variables  
print(data.describe())

print('Possible values of each variable:')
for col in data.columns:
    print(f'{col}: {data[col].unique()}')


The data now contains 3799 rows (encounters) and 20 columns (variables).
            sblack        shisp       swhite       sother        smale  \
count  3799.000000  3799.000000  3799.000000  3799.000000  3799.000000   
mean      0.110555     0.101606     0.739142     0.048697     0.529613   
std       0.313622     0.302169     0.439160     0.215262     0.499188   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.000000     0.000000     0.000000     0.000000     0.000000   
50%       0.000000     0.000000     1.000000     0.000000     1.000000   
75%       0.000000     0.000000     1.000000     0.000000     1.000000   
max       1.000000     1.000000     1.000000     1.000000     1.000000   

              sage        sempl      daytime    omajblack     omajhisp  \
count  3799.000000  3799.000000  3799.000000  3799.000000  3799.000000   
mean     41.010003     0.695446     0.666491     0.060805     0.023954   
std      16.146916     0.460279     0.

In [30]:
# Declare labels (BEMÆRK: VI KAN IKKE INKLUDERE ALLE VARIABLE DA PROBIT OG LOGIT MODELLERNE ELLERS IKKE KONVERGERER)
y_lab = 'anyuseofforce_coded' # dependent variable
x_lab = ['const', 
         'swhite', 'shisp','sother', #sblack is the reference category
         'smale', 'sempl', 'sbehavior',
      #   'sage', 'sagesq',
      #   'sincome_20-50', 'sincome_above50', #sincome_0-20 is the reference category
      #   'spop_100-499', 'spop_500-999', 'spop_above1000', #spop_0-100 is the reference category
         'omajwhite', #'omajhisp', 'omajother', 
         'daytime', 'traffic_stop'
        ]

In [31]:
N = data.shape[0]

# create extra variables 
data['const'] = np.ones((N,))
#data['sagesq'] = data.sage * data.sage

# reorder columns 
data = data[[y_lab] + x_lab].copy()

data.head(5)

Unnamed: 0,anyuseofforce_coded,const,swhite,shisp,sother,smale,sempl,sbehavior,omajwhite,daytime,traffic_stop
0,0,1.0,0,0,0,1,0,0,1,1,1
1,0,1.0,0,0,0,1,1,0,1,0,1
2,0,1.0,0,0,0,1,1,0,1,1,1
3,0,1.0,0,0,0,1,1,0,1,1,1
4,0,1.0,0,0,0,1,1,0,1,1,1


In [32]:
y = data[y_lab].values
x = data[x_lab].values
K = x.shape[1]

# Question 1: Estimate model using LPM
We model Labour participation of females using an LPM model, which we estimate using OLS. Use the given `lm` module, and print it out in a nice table. Remember to use heteroscedasticity-robust standard errors. 

In [33]:
ols_results = lm.estimate(y, x, robust_se=True)
ols_tab = lm.print_table((y_lab, x_lab), ols_results, title='LPM results')
ols_tab

LPM results
Dependent variable: anyuseofforce_coded

R2 = 0.030
sigma2 = nan


Unnamed: 0,b_hat,se,t
const,0.0318,0.0162,1.9655
swhite,-0.0038,0.0045,-0.8432
shisp,0.0076,0.0075,1.0146
sother,-0.0027,0.0068,-0.3877
smale,0.0042,0.0022,1.8663
sempl,-0.0042,0.0029,-1.4578
sbehavior,0.0355,0.0123,2.892
omajwhite,0.0039,0.0031,1.262
daytime,-0.0021,0.0029,-0.7475
traffic_stop,-0.0295,0.0151,-1.9493


# The Probit model

The Probit model has the link function 
$$G^{\text{probit}}(\mathbf{x}_i \boldsymbol{\beta}) 
    =\Phi(\mathbf{x}_i \boldsymbol{\beta})
    \equiv \int_{-\infty}^{\mathbf{x}_i \boldsymbol{\beta}}\phi\left(z\right) \, \mathrm{d} z$$

$\phi\left(z\right)= (2 \pi)^{-\frac12}\exp(\frac{-z^{2}}{2})$ is the standard normal pdf. As starting values, we can use OLS estimates: $\boldsymbol{\theta} = 2.5\hat{\boldsymbol{\beta}}^{OLS}$, as will become clear later.) 

In [34]:
theta0 = probit.starting_values(y, x)
theta0

array([ 0.07942711, -0.00944949,  0.01893092, -0.00662781,  0.01041014,
       -0.01051959,  0.08882608,  0.00968295, -0.00537087, -0.07375052])

In [35]:
probit_results = est.estimate(probit.q, theta0, y, x)

Optimization terminated successfully.
         Current function value: 0.023693
         Iterations: 94
         Function evaluations: 1045
         Gradient evaluations: 95


In [36]:
probit_tab = est.print_table(x_lab, probit_results, title=f'Probit, y = {y_lab}')
probit_tab

Optimizer succeded after 94 iter. (1045 func. evals.). Final criterion:  0.02369.
Probit, y = anyuseofforce_coded


Unnamed: 0,theta,se,t
const,-2.4895,0.8284,-3.005
swhite,-0.3214,0.3321,-0.9677
shisp,0.2518,0.3875,0.6499
sother,-0.1615,0.5244,-0.3079
smale,0.5217,0.3497,1.4918
sempl,-0.3252,0.2181,-1.4908
sbehavior,1.0718,0.247,4.3391
omajwhite,0.4799,0.7749,0.6193
daytime,-0.2045,0.2624,-0.7791
traffic_stop,-0.7643,0.2673,-2.8597


In [None]:
probit.G(x @ probit_results['theta']).mean()

0.005027919096513242

What is the interpretation of $\bar{G(\mathbf{x}\boldsymbol{\beta})}$?

Answer: it is the mean predicted value of police use of force

## The Logit Model

For the Logit model, the link function is 

$$G^{\text{logit}}( \mathbf{x}_i \boldsymbol{\beta} ) = \Lambda(\mathbf{x}_i \boldsymbol{\beta}) \equiv  \frac{\exp(\mathbf{x}_i \boldsymbol{\beta})}{1+\exp(\mathbf{x}_i \boldsymbol{\beta})}= \frac{1}{1+\exp(-\mathbf{x}_i \boldsymbol{\beta})} \tag{2}$$

In [38]:
theta0 = logit.starting_values(y, x)
theta0 

array([ 0.12708338, -0.01511918,  0.03028947, -0.01060449,  0.01665622,
       -0.01683135,  0.14212173,  0.01549271, -0.00859339, -0.11800083])

In [39]:
logit_results = est.estimate(logit.q, theta0, y, x)

Optimization terminated successfully.
         Current function value: 0.024060
         Iterations: 106
         Function evaluations: 1177
         Gradient evaluations: 107


In [40]:
logit_tab = est.print_table(x_lab, logit_results, title=f'Logit, y = {y_lab}')
logit_tab

Optimizer succeded after 106 iter. (1177 func. evals.). Final criterion:  0.02406.
Logit, y = anyuseofforce_coded


Unnamed: 0,theta,se,t
const,-4.8524,1.7828,-2.7217
swhite,-0.8197,0.837,-0.9793
shisp,0.5725,0.8857,0.6464
sother,-0.4807,1.3501,-0.3561
smale,1.1402,0.7959,1.4327
sempl,-0.7822,0.5241,-1.4924
sbehavior,2.5689,0.5703,4.5045
omajwhite,0.9362,1.5807,0.5923
daytime,-0.5391,0.6151,-0.8764
traffic_stop,-1.7334,0.6005,-2.8864
