In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, LogisticRegression
import scipy.spatial.distance as dist

#### Homework Reflection 5

1. Draw a diagram for the following negative feedback loop:

Sweating causes body temperature to decrease. High body temperature causes sweating. 

A negative feedback loop means that one thing increases another while the second thing decreases the first.

Remember that we are using directed acyclic graphs where two things cannot directly cause each other.

<font color='seagreen'> Creating a directed acyclic graph for a negative feedback loop seems to be contradictory, so the only way to create this graph without any cycles would be to have body heat as two separate nodes, which creates the following:

![Negative Feedback Loop DAG](2743.jpg)

2. Describe an example of a positive feedback loop. This means that one thing increases another while the second thing also increases the first.

<font color='seagreen'> An example of positive feedback loop is with fruit ripening. In a basket of apples, when one apple begins to ripen, it releases ethylene gas through its skin. When exposed to this gas, the apples near this apple begin to ripen, causing more ethylene gas to be released. This in turn causes the rest of the apples in the basket to ripen.

3. Draw a diagram for the following situation: 

Lightning storms frighten away deer and bears, decreasing their population, and cause flowers to grow, increasing their population.  
Bears eat deer, decreasing their population.  
Deer eat flowers, decreasing their population.

![DAG2](54910.jpg)

Write a dataset that simulates this situation. (Show the code.) Include noise/randomness in all cases.

In [5]:
np.random.seed(42)

n = 1000 # sample size 
lightning = np.random.binomial(1, 0.3, n)  # 1 if storm, 0 otherwise

# Bears decrease with lightning, plus noise
bears = 50 - 10 * lightning + np.random.normal(0, 3, n)

# Deer decrease with lightning and bears, plus noise
deer = 200 - 20 * lightning - 0.5 * bears + np.random.normal(0, 15, n)

# Flowers increase with lightning and decrease with deer, plus noise
flowers = 500 + 120 * lightning - 0.8 * deer + np.random.normal(0, 30, n)

df = pd.DataFrame({
    'lightning': lightning,
    'bears': bears,
    'deer': deer,
    'flowers': flowers
})

df.head()

Unnamed: 0,lightning,bears,deer,flowers
0,0,50.533103,153.638687,403.324563
1,1,35.993967,160.756433,471.901897
2,1,41.140594,136.858898,474.416856
3,0,51.831757,185.484961,320.350699
4,0,51.679371,175.396911,345.066384


Identify a backdoor path with one or more confounders for the relationship between deer and flowers.

<font color='seagreen'>

1. Deer <-- Lightning --> Flowers  

2. Deer <-- Bear <-- Lightning --> Flowers

4. Draw a diagram for a situation of your own invention. The diagram should include at least four nodes, one confounder, and one collider. Be sure that it is acyclic (no loops). Which node would say is most like a treatment (X)? Which node is most like an outcome (Y)?

![Exercise on MH](65804.jpg)

<font color='seagreen'> This DAG looks at the effect of exercise (X) on mental health (Y) where exercising would improve mental health. The confounder (stress) affects both exercise and mental health by decreasing both. Sleep quality, the collider, is affected by both exercise and mental health where exercising would improve sleep quality but poor mental health worsens sleep quality.

#### Homework Reflection 6

1. What is a potential problem with computing the Marginal Treatment Effect simply by comparing each untreated item to its counterfactual and taking the maximum difference? (Hint: think of statistics here. Consider that only the most extreme item ends up being used to estimate the MTE. That's not necessarily a bad thing; the MTE is supposed to come from the untreated item that will produce the maximum effect. But there is nevertheless a problem.)  
Possible answer: We are likely to find the item with the most extreme difference, which may be high simply due to randomness. (Please explain/justify this answer, or give a different one if you can think of one.)

<font color='seagreen'> The possible answer involves estimating the treatment effect for each untreated unit by comparing it to a counterfactual prediction. Then, the unit with the maximum estimated treatment effect is selected as the estimate of the MTE. While this aligns with the idea behind MTE of identifying the unit that would benefit the most from treatment, it introduces a statistical problem: the unit with the largest observed effect is likely to be an outlier, not because it truly has the largest effect, but because of random noise in the data. This can cause extreme value bias where many estimates are computed, each with some level of random variation, and the largest value often inflated by that variation. Essentially, random fluctuation is being captured that happens to make that unit look extreme. Thus, finding the unit with the highest causal effect often turns into finding the unit with the highest noisy estimate, leading to an overestimate of the true maximum effect.

2. Propose a solution that remedies this problem and write some code that implements your solution. It's very important here that you clearly explain what your solution will do.  
Possible answer: maybe we could take the 90th percentile of the treatment effect and use it as a proxy for the Marginal Treatment Effect.  
(Either code this answer or choose a different one.)

In [7]:
np.random.seed(42)
n = 1000

# Underlying true effects with some noise
true_effects = np.random.normal(5, 2, n)
estimation_noise = np.random.normal(0, 3, n)
estimated_effects = true_effects + estimation_noise

# Finding the 90th percentile of the estimated effects as MTE
mte_90th_percentile = np.percentile(estimated_effects, 90)

summary = {
    'True MTE (max true effects)': np.max(true_effects),
    'Estimated MTE (max estimate)': np.max(estimated_effects),
    'Proposed Solution (90th percentile)': mte_90th_percentile
}

summary_df = pd.DataFrame([summary])
summary_df

Unnamed: 0,True MTE (max true effects),Estimated MTE (max estimate),Proposed Solution (90th percentile)
0,12.705463,17.954193,9.728453


#### Homework Reflection 7

1. Create a linear regression model involving a confounder that is left out of the model. Show whether the true correlation between $X$ and $Y$ is overestimated, underestimated, or neither. Explain in words why this is the case for the given coefficients you have chosen.

In [16]:
np.random.seed(42)
n = 1000

# Confounder Z
Z = np.random.normal(0, 1, n)

# Treatment X influenced by Z
X = 2 * Z + np.random.normal(0, 1, n)

# Outcome Y influenced by X and Z
Y = 3 * X - 1.5 * Z + np.random.normal(0, 1, n)

data = pd.DataFrame({
    'X': X,
    'Y': Y,
    'Z': Z
})

# Model 1: with confounder Z
X_Z = sm.add_constant(data[['X', 'Z']])
model_with_Z = sm.OLS(data['Y'], X_Z).fit()

# Model 2: without confounder Z
X_only = sm.add_constant(data[['X']])
model_without_Z = sm.OLS(data['Y'], X_only).fit()

# Coefficients
coefficients = {
    'With Z': model_with_Z.params,
    'Without Z': model_without_Z.params
}
coefficients_df = pd.DataFrame(coefficients)
coefficients_df

Unnamed: 0,With Z,Without Z
X,2.989823,2.403799
Z,-1.457839,
const,0.006134,0.04212


<font color='seagreen'> With this model, Z positively influenced X where as Z negatively influenced Y. The coefficient of X with Z was 2.99 and without Z was 2.40. The negative influence of Z on Y gets incorrectly absorbed into the estimate of X's effect, leading to an underestimate of the true causal effect of X on Y. 

On the other hand, if the model was created where Z positively influenced both X and Y, the coefficient without Z would be greater than the coefficient with Z, causing an overestimation. Overall, the influence of Z on X and Y affects the bias direction. If Z's effect of X is positive or negative and Z's effect on Y is negative or positive, respectively, then the correlation will be underestimated. If Z's effect on X is positive or negative and Z's effect on Y is positive or negative, respectively, the correlation will be overestimated. 

This happens because when a confounder is left out, the regression model mistakes the confounder's effect for part of the treatment effect. Whether the estimate is too big or too small depends on the direction of the confounder's effect on both the treatment and the outcome.

2. Performe a linear regression analysis in which one of the coefficients is zero, e.g.  

W = [noise]  
X = [noise]  
Y = 2 * X + [noise]  

And compute the p-value of a coefficient - in this case, the coefficient of W.  
(This is the likelihood that the estimated coefficient would be as high or low as it is, given that the actual coefficient is zero.)  
If the p-value is less than 0.05, this ordinarily means that we judge the coefficient to be nonzero (incorrectly, in this case.)  
Run the analysis 1000 times and report the best (smallest) p-value.  
If the p-value is less than 0.05, does this mean the coefficient acutally is nonzero? What is the problem with repeating the analysis?

In [20]:
np.random.seed(42)
n = 1000

# Simulate data
W = np.random.normal(0, 1, n)
X = np.random.normal(0, 1, n)
Y = 2 * X + np.random.normal(0, 1, n)

data = pd.DataFrame({
    'W': W,
    'X': X,
    'Y': Y
})

# Model with confounder W
X_W = sm.add_constant(data[['X', 'W']])
model_with_W = sm.OLS(data['Y'], X_W).fit()

# Compute p-value of coefficient of W
p_value_W = model_with_W.pvalues['W']
p_value_W

np.float64(0.4933752408637956)

In [21]:
# Run the analysis 1000 times
p_values = []
for _ in range(1000):
    W = np.random.normal(0, 1, n)
    X = np.random.normal(0, 1, n)
    Y = 2 * X + np.random.normal(0, 1, n)

    data = pd.DataFrame({
        'W': W,
        'X': X,
        'Y': Y
    })

    X_W = sm.add_constant(data[['X', 'W']])
    model_with_W = sm.OLS(data['Y'], X_W).fit()
    p_values.append(model_with_W.pvalues['W'])

smallest_p_value = min(p_values)
smallest_p_value

np.float64(0.00019365846399438825)

<font color='seagreen'> After running the analysis 1000 times, the smallest p-value observed is 0.00019, which is less than 0.05. This means that W is statistically significant even though W has no true effect on Y. Despite this low p-value, the coefficient is not necessarily nonzero. A low p-value can happen by chance, especially when the analysis is run as many as 1000 times. Running the regression once revealed a p-value of 0.49, indicating that W is statistically insignificant. However, running it 1000 times is guaranteed to find at least one misleadingly low p-value. 

#### Homework Reflection 8

Include the code you used to solve the two coding quiz problems and write about the obstacles/challenges/insights you encountered while solving them.

In [None]:
# ATE with inverse probability weighting

hw81 = pd.read_csv('homework_8.1.csv', index_col=0)

X = hw81['X']
Y = hw81['Y']
Z = pd.DataFrame(hw81['Z'])

# Predicting propensity scores using predict_proba
model = LogisticRegression()
model.fit(Z, X)
prob = model.predict_proba(Z)[:, 1]

# Calculating weights using inverse probability weighting
weights = np.where(X == 1, 1/prob, 1/(1-prob))
hw81['weights'] = weights

# Calculating ATE using weighted average
X_0 = hw81[hw81['X'] == 0]
X_1 = hw81[hw81['X'] == 1]
avg_treated = np.sum(X_1['Y'] * X_1['weights'])/np.sum(X_1['weights'])
avg_untreated = np.sum(X_0['Y'] * X_0['weights'])/np.sum(X_0['weights'])
ate = avg_treated - avg_untreated
ate

np.float64(2.2743411898510133)

In [None]:
# Matching using Mahalanobis distance

hw82 = pd.read_csv('homework_8.2.csv', index_col=0)

# Separating treated and untreated groups
treated = hw82[hw82['X'] == 1].reset_index(drop=True)
untreated = hw82[hw82['X'] == 0].reset_index(drop=True)

# Calculating the covariance matrix and its inverse using Z1 and Z2
matrix = np.vstack((hw82['Z1'], hw82['Z2']))
cov = np.cov(matrix)
cov_inv = np.linalg.inv(cov)

# Calculate Mahalanobis distance for each untreated unit
matches = []
for i, row in treated.iterrows():
    treated_z = row[['Z1', 'Z2']].values
    distances = untreated[['Z1', 'Z2']].apply(lambda x: dist.mahalanobis(treated_z, x.values, cov_inv), axis=1)
    nearest_index = distances.idxmin()
    matches.append((i, nearest_index))

matched_pairs = pd.DataFrame([{'treated_index': t_idx, 'control_index': c_idx, 'treated_Y': treated.loc[t_idx,'Y'], 'control_Y': untreated.loc[c_idx, 'Y']} for t_idx, c_idx in matches]) 

# Calculate the Average Treatment Effect (ATE)
matched_pairs['TE'] = matched_pairs['treated_Y'] - matched_pairs['control_Y']
print(f'ATE: {matched_pairs["TE"].mean()}')

# Finding the worst match based on Mahalanobis distance
matched_pairs['distance'] = matched_pairs.apply(lambda row: dist.mahalanobis(treated.loc[row['treated_index'], ['Z1', 'Z2']].values, untreated.loc[row['control_index'], ['Z1', 'Z2']].values, cov_inv), axis=1)
worst_match = matched_pairs.loc[matched_pairs['distance'].idxmax()]
worst_treated = treated.loc[worst_match['treated_index'], ['Z1', 'Z2']]
worst_treated

ATE: 3.4376789979126094


Z1    2.696224
Z2    0.538155
Name: 241, dtype: float64

<font color='seagreen'> The biggest challenge for me when solving for these two code quiz problems was figuring out how to take the theoretical information we learned during lectures and readings and apply it into code format. It required a lot of trial and error as well as searching the Internet for resources. From doing these two problems, I feel like I have a better understanding of the concepts that we learned this week and how to apply it to real data. I was also able to brush up on some numpy and pandas skills that I haven't used very often, reminding me the robust usage of these two packages. Overall, I think this code quiz was a really great exercise and test of our knowledge. Although it was challenging, it really helped me pinpoint any holes in my knowledge of these concepts. 