In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.optimize import curve_fit, minimize

# Question 1

Risk is defined as

$$\text{Risk} = \hat{P}(\text{failure}) = \frac{N_{\text{failure}}}{N_{\text{failure}}+N_{\text{success}}}$$

where $\hat{P}$ is an estimator for this probability. The relative risk between two groups is defined as

$$\text{Relative Risk} = \frac{\text{Risk}_{\text{control}}}{\text{Risk}_{\text{experimental}}}$$

In [2]:
def compute_risks(arr):
    risk_con = arr[0,0]/np.sum(arr[0])
    risk_exp = arr[1,0]/np.sum(arr[1])
    rr = risk_con/risk_exp
    return risk_con, risk_exp, rr

def setup_tab(arr, risks):
    di = {'Failure': arr[:,0],
          'Success': arr[:,1], 
          'Risk': [risks[0], risks[1]],
         'Relative Risk': [risks[2], np.nan]}
    
    return pd.DataFrame(di, index=['Control', 'Experimental'])

Define tables

In [3]:
males = np.array([[80,120],[20,80]])
females = np.array([[40,160],[10,90]])
tot = males+females

Compute all relative risks:

In [4]:
risks_males = compute_risks(males)
risks_females = compute_risks(females)
risks_tot = compute_risks(tot)

**Men**

In [5]:
setup_tab(males, risks_males)

Unnamed: 0,Failure,Success,Risk,Relative Risk
Control,80,120,0.4,2.0
Experimental,20,80,0.2,


**Women**

In [6]:
setup_tab(females, risks_females)

Unnamed: 0,Failure,Success,Risk,Relative Risk
Control,40,160,0.2,2.0
Experimental,10,90,0.1,


**Total**

In [7]:
setup_tab(tot, risks_tot)

Unnamed: 0,Failure,Success,Risk,Relative Risk
Control,120,280,0.3,2.0
Experimental,30,170,0.15,


If the relative risk is the same in each stratum, the stratification variable (gender) is 

* **not** a confounder because it doesn't satisfy condition (i) (there are the same number of men as there are women).
* **not** an interactive variable (with respect to relative risk) because the relative risk is the same in each group.

# Question 2

Setup tables like we did in question 1

In [8]:
males = np.array([[80,120],[20,80]])
females = np.array([[80,20],[80,120]])
tot = males+females

Get risks

In [9]:
risks_males = compute_risks(males)
risks_females = compute_risks(females)
risks_tot = compute_risks(tot)

**Men**

In [10]:
setup_tab(males, risks_males)

Unnamed: 0,Failure,Success,Risk,Relative Risk
Control,80,120,0.4,2.0
Experimental,20,80,0.2,


**Women**

In [11]:
setup_tab(females, risks_females)

Unnamed: 0,Failure,Success,Risk,Relative Risk
Control,80,20,0.8,2.0
Experimental,80,120,0.4,


**Total**

In [12]:
setup_tab(tot, risks_tot)

Unnamed: 0,Failure,Success,Risk,Relative Risk
Control,160,140,0.533333,1.6
Experimental,100,200,0.333333,


The third variable (gender) in this situation is a **confounder** because it satisfies (i) (in this case there are 300 men/women *but* there are 200 women in the experimental group and 200 men in the control group) and (ii) (the risk in the control/experimental group is different in the stratified gender groups).

# Question 3

Setup tables like we did in question 1

In [13]:
males = np.array([[40,160],[10,190]])
females = np.array([[10,90],[5,95]])
tot = males+females

Get risks:

In [14]:
risks_males = compute_risks(males)
risks_females = compute_risks(females)
risks_tot = compute_risks(tot)

**Men**

In [15]:
setup_tab(males, risks_males)

Unnamed: 0,Failure,Success,Risk,Relative Risk
Control,40,160,0.2,4.0
Experimental,10,190,0.05,


**Women**

In [16]:
setup_tab(females, risks_females)

Unnamed: 0,Failure,Success,Risk,Relative Risk
Control,10,90,0.1,2.0
Experimental,5,95,0.05,


**Total**

In [17]:
setup_tab(tot, risks_tot)

Unnamed: 0,Failure,Success,Risk,Relative Risk
Control,50,250,0.166667,3.333333
Experimental,15,285,0.05,


The third variable (gender) in this situation is **not** a confounder because it doesn't satisfy condition (i) (there are the same percent of men and women in the control/experimental split). It **is**, however, an interactive variable because the relative risk is different in the different stratum.

# Question 4

For each year of additional education, the expected increase in income is $13000.

The effect of the outlier point likely led to a significant overestimate of the regression slope. In fact, the extent to which it overestimated the slope can be quantified, assuming there is a true linear relationship:

In [18]:
def true_income(x,a,b):
    return a*x+b

# Generate a bunch of incomes following the 13 thousand per year increase
years = np.random.uniform(size=100)*12
income = true_income(years, 13, 30)

# Add the outlier data point
years = np.append(years, 10)
income = np.append(income, 3e3)

# See what happens to the curve parameters
(a,_),_=curve_fit(true_income, years, income, p0=(13,30))
a

22.580709926555766

What has been shown is that if the data really did follow a 13000/year increase **without** the outlier data point, then the outlier data point **included** would increase that to approximately 22000/year. So this data set, which only contains 100 points, is extremely sensitive to such large outliers.

# Question 5

* For each increase in 1000 dollars, life satisfaction increases by 0.2 points.

* While age and years of education (over a large range of ages) are likely not correlated, it is likely that **age and income**, and **years of education and income** form two highly correlated pairs. As such, in the context of multiple regression,  income is not needed as an additional variable because it doesn't provide any signficant *independent* information.