# Money and death



We return to the death penalty.

In [1]:
# Array library.
import numpy as np

# A Numpy random number generator.
rng = np.random.default_rng()

# Data frames library.
import pandas as pd

# Set up plotting
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In [2]:
# Load the OKpy test library and tests.
from client.api.notebook import Notebook
ok = Notebook('money_death_arrays.ok')

In this case, we are going to analyze whether people with higher incomes are
more likely to favor the death penalty.

To do this, we are going to analyze the results from a sample of the US
[General Social Survey](http://www.gss.norc.org) from 2002.

Make sure you have the data file [GSS2002.csv](https://lisds.github.io/textbook/data/GSS2002.csv) in the same directory as this notebook.  This should be
true if you opened this page from a JupyterHub system.

First we will get the information we need from the data file, using Pandas.

In [3]:
# Just run this cell
# Read the data into a data frame.
gss = pd.read_csv('GSS2002.csv')
# Select columns of interest.
money_death = gss.loc[:, ['Income', 'DeathPenalty']].dropna()
# Recode income from strings to numbers

def recode_income(value):
    if value == 'under 1000':
        return 500
    low_str, high_str = value.split('-')
    low, high = int(low_str), int(high_str)
    return np.mean([low, high])

# Recode income and make it into an array.
income = np.array(money_death['Income'].apply(recode_income))
death = money_death['DeathPenalty']
# Income values for people who favor the death penalty.
favor_income = income[death == 'Favor']
oppose_income = income[death == 'Oppose']

In [4]:
# Show the first 10 values for income in the "Favor" group
favor_income[:10]

In [5]:
# Show the first 10 values for income in the "Oppose" group
oppose_income[:10]

Calculate the difference in mean income between the groups.  This is the
difference we observe.

In [6]:
actual_diff = np.mean(favor_income) - np.mean(oppose_income)
actual_diff

In [7]:
_ = ok.grade('q_8_actual_diff')

We want to know whether this difference in income is compatible with random
sampling. That is, we want to know whether a difference this large is
plausible, if the incomes are in fact random samples from the same
population.

To estimate how variable the mean differences can be, for such random
sampling, we simulate this sampling by pooling the income values that we
have, from the two groups, and the permuting them.

First, we get the number of respondents in favor of the death penalty.

In [8]:
n_favor = len(favor_income)
n_favor

In [9]:
_ = ok.grade('q_9_n_favor')

Then we pool the income values for the in-favor and oppose groups.

In [10]:
# Concatenate the in-favor and opposed incomes.
pooled = np.concatenate([favor_income, oppose_income])
# Show the first 10 values before shuffling.
pooled[:10]

To do the random sampling we permute the values, so the `pooled` vector is a
random mixture of the two groups.

In [11]:
shuffled = rng.permutation(pooled)
# Show the first 10 values after shuffling.
shuffled[:10]

Treat the first `n_favor` observations from this shuffled vector as our
simulated in-favor group.  The rest are our simulated oppose
group.

In [12]:
fake_favor = shuffled[:n_favor]
fake_oppose = shuffled[n_favor:]

Calculate the difference in means for this simulation.

In [13]:
fake_diff = np.mean(fake_favor) - np.mean(fake_oppose)
fake_diff

Now it is your turn.   Do this simulation 10000 times, to build up the
distribution of differences compatible with random sampling.

In [14]:
fake_differences = np.zeros(10000)
for i in np.arange(10000):
    #- Permute the pooled incomes
    shuffled = rng.permutation(pooled)
    #- Make a fake favor sample
    fake_favor = shuffled[:n_favor]
    #- Make a fake opposed sample
    fake_oppose = shuffled[n_favor:]
    #- Calculate the mean difference for the fake samples
    fake_diff = np.mean(fake_favor) - np.mean(fake_oppose)
    #- Put the mean difference into the fake_differences array.
    fake_differences[i] = fake_diff
# Show the first 10 fake differences
fake_differences[:10]

In [15]:
_ = ok.grade('q_10_fake_differences')

When you have that working, do a histogram of the differences.

In [16]:
plt.hist(fake_differences);

You can get an idea of where the actual difference we saw sits on this
histogram, and therefore how likely that difference is, assuming the incomes
come from the same underlying population of incomes.

To be more specific, show the proportion of the differences you calculated that
were greater than or equal to the actual difference.

In [17]:
p_fake_ge_actual = np.count_nonzero(fake_differences >= actual_diff) / 10000
p_fake_ge_actual

In [18]:
_ = ok.grade('q_11_p_fake_ge_actual')

This proportion gives an estimate of the probability of seeing a difference
this large, if the incomes all come from the same underlying population.

## Done

In [19]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]