In [24]:
import numpy as np 
import pandas as pd 
from scipy.stats import chi2_contingency

## Causal Inference Modeling 

Problem Statement: What is the strength of the correlation between RTO and productivity at work? Is there a causal relationship between RTO and workplace productivity?

Let X represent RTO status and Y represent worker productivity, where
 - X = 1 represents workers who go into the office for work, and 
 - Y = 1 represents workers who are "productive."


##### Question 1: What is the quanitity of workers who RTO but are not considered "productive"? 

Solve by calculating P(Y=1|X=1)−P(Y=1|X=0); subract the mean probability of "productive" RTO workers from the mean probabilty of "productive" non-RTO workers.

##### Create a function to infer productivity by RTO status. 

In [9]:
def estimate_uplift(df):
    no_rto = df[df.x == 0]
    rto = df[df.x == 1]
    delta = rto.y.mean() - no_rto.y.mean()
    delta_err = 1.96 * np.sqrt(
        rto.y.var() / rto.shape[0] + 
        no_rto.y.var() / no_rto.shape[0])
    return {"estimated_effect": delta, "standard_error": delta_err}

##### Create a synthetic dataset.

In [7]:
p_z = 0.5
p_x_z = [0.9, 0.1]
p_y_xz = [0.2, 0.4, 0.6, 0.8]
z = np.random.binomial(n=1, p=p_z, size=500)
p_x = np.choose(z, p_x_z)
x = np.random.binomial(n=1, p=p_x, size=500)
p_y = np.choose(x+2*z, p_y_xz)
y = np.random.binomial(n=1, p=p_y, size=500)
data = pd.DataFrame({"x":x, "y":y})
data.head()

Unnamed: 0,x,y
0,1,0
1,0,1
2,1,0
3,0,0
4,1,0


##### Estimate causal relationship between RTO & workplace productivity.

In [8]:
estimate_uplift(data)

{'estimated_effect': -0.10784172546760751, 'standard_error': 0.087250515533598}

##### Preliminary Results

From these "estimated_effect" results we can infer that RTO has an inverse effect on workplace productivity. RTO workers are inferred to be less productive than non-RTO workers. To confirm results, we'll run a chi-square contingency test (i.e., chi-square test of independence/association). 

##### Chi-Square Contingency

Perform a chi-square test to determine whether there exists a significant association between two categorical variables, RTO status and workplace productivity.  

In [1]:
def chi2(df): 
    contingency_table = (
        df
        .assign(placeholder=1)
        .pivot_table(index="x", columns="y", values="placeholder", aggfunc="sum")
        .values
    )
    _, p, _, _ = chi2_contingency(contingency_table, lambda_="log-likelihood")
    print("p-value:", p)


In [26]:
chi2(data)

p-value: 0.02002396205089891


##### Conclusion

With a p-value < 0.05, we can assume statistical significance. We can now estimate the probability of a worker's productivity based on their RTO status.