In [8]:
# Crash Course in Causal Inference: Multiple-Choice Quiz
# Optimized for Google Colab

from IPython.display import Markdown, display

quiz_data = {
  "questions": [
    {
      "questionNumber": 1,
      "question": "What is the primary causal question investigated in the NHEFS worked example?",
      "answerOptions": [
        {"text": "Does exercise level affect weight change?", "rationale": "While the dataset contains exercise and weight variables, this is a secondary question. The primary focus of the analysis was on the effect of smoking cessation.", "isCorrect": False},
        {"text": "Does quitting smoking (qsmk) cause weight gain (wt82_71)?", "rationale": "The script explicitly states this is the main causal question of interest, where 'qsmk' is the treatment and 'wt82_71' is the outcome.", "isCorrect": True},
        {"text": "Does income affect the rate of death in the follow-up period?", "rationale": "The dataset includes income and death, but the worked example specifically focuses on the relationship between smoking cessation and weight change.", "isCorrect": False},
        {"text": "Does a higher baseline weight (wt71) increase the likelihood of quitting smoking?", "rationale": "This question addresses the relationship between a confounder (wt71) and the treatment (qsmk), but it is not the main causal effect being measured.", "isCorrect": False}
      ],
      "hint": "Focus on the variables designated as the treatment and the outcome in the initial problem statement."
    },
    {
      "questionNumber": 2,
      "question": "In the context of causal inference, what is a **confounder**?",
      "answerOptions": [
        {"text": "A variable that is only affected by the treatment and then affects the outcome.", "rationale": "This describes a mediator variable, which is a step in the causal pathway from treatment to outcome, not a confounder.", "isCorrect": False},
        {"text": "A variable that causes both the treatment and the outcome.", "rationale": "This is the precise definition of a confounder, as it opens a 'backdoor path' creating a spurious association between treatment and outcome.", "isCorrect": True},
        {"text": "A variable whose value is missing for a significant number of observations.", "rationale": "This describes missing data, which is a data quality issue, but not the causal role of a confounder.", "isCorrect": False},
        {"text": "The primary variable of interest that the researcher is manipulating.", "rationale": "This describes the treatment or exposure variable, not the definition of a confounder.", "isCorrect": False}
      ],
      "hint": "Think about what a variable must do to create a spurious, non-causal association between the treatment and the outcome."
    },
    {
      "questionNumber": 3,
      "question": "Why is **Age** considered a major confounder in the analysis of quitting smoking on weight gain?",
      "answerOptions": [
        {"text": "Because Age directly determines the treatment, meaning older people always quit smoking.", "rationale": "While age is associated with treatment, it does not determine it, and this statement is too strong and factually incorrect.", "isCorrect": False},
        {"text": "Because Age is related to both the probability of quitting smoking and the metabolic rate (which affects weight gain).", "rationale": "The script specifically points out that Age affects 'the smoking urge' (treatment) and 'metabolism' (outcome), fitting the definition of a confounder.", "isCorrect": True},
        {"text": "Because it is required to calculate the Adjusted R-squared in the regression model.", "rationale": "While age is used in the regression, this describes a statistical requirement, not its role as a causal variable (a confounder).", "isCorrect": False},
        {"text": "Because the naive comparison showed quitters are younger, which balances the groups.", "rationale": "The naive comparison showed quitters are *older* (46.17 vs. 42.79), confirming it's a source of imbalance, not balance.", "isCorrect": False}
      ],
      "hint": "Recall how age affects both the likelihood of engaging in the behavior (smoking) and the body's natural processes (metabolism)."
    },
    {
      "questionNumber": 4,
      "question": "The **naive effect** (simple difference in means) showed that quitters gained **2.54 kg** more than non-quitters. Why is this estimate considered **biased**?",
      "answerOptions": [
        {"text": "It fails to account for influential outliers with extreme weight changes.", "rationale": "While outliers can affect mean estimates, the primary reason for bias in this context is confounding, not specifically outliers.", "isCorrect": False},
        {"text": "It only measures association and does not control for the systematic differences (confounders) between the quitters and non-quitters.", "rationale": "The naive estimate is biased because it includes both the true causal effect and the spurious effect introduced by backdoor paths (confounders).", "isCorrect": True},
        {"text": "It uses the mean instead of the median, which is less robust to skewed weight distributions.", "rationale": "The choice between mean and median is about central tendency robustness, not the structural issue of confounding bias.", "isCorrect": False},
        {"text": "It is based on OLS regression, which assumes linearity and normality of residuals.", "rationale": "The naive estimate is the simple difference of group means, calculated *before* running any regression model.", "isCorrect": False}
      ],
      "hint": "Consider the core problem that causal inference methods (like regression and IPW) are designed to solve."
    },
    {
      "questionNumber": 5,
      "question": "In a Causal Diagram (DAG), how is a **backdoor path** from Treatment to Outcome blocked to achieve an unbiased causal estimate?",
      "answerOptions": [
        {"text": "By observing or statistically adjusting for the outcome variable in the analysis.", "rationale": "The outcome is the variable being measured; adjusting for it will bias the causal effect (M-bias or collider bias).", "isCorrect": False},
        {"text": "By setting the treatment variable to a fixed value for all observations in the dataset.", "rationale": "Fixing the treatment is the counterfactual step, but the path is blocked by controlling for the *confounder* node, not the treatment node.", "isCorrect": False},
        {"text": "By conditioning on (adjusting for) the variables that constitute the backdoor path (the common causes or confounders).", "rationale": "Adjusting for a common cause (confounder) closes the spurious path $T \\leftarrow C \\rightarrow O$, isolating the true causal effect $T \\rightarrow O$.", "isCorrect": True},
        {"text": "By removing any observations that have missing values for the confounders.", "rationale": "Removing missing data addresses data quality, but does not inherently close the causal path via the remaining observations.", "isCorrect": False}
      ],
      "hint": "The solution requires actively doing something to the common cause node in the DAG to prevent the spurious flow of information."
    },
    {
      "questionNumber": 6,
      "question": "What is the primary role of **Regression Adjustment (OLS)** in estimating the causal effect of quitting smoking?",
      "answerOptions": [
        {"text": "To calculate the propensity score for each individual in the dataset.", "rationale": "Propensity scores are calculated using Logistic Regression (Logit), not Ordinary Least Squares (OLS) regression.", "isCorrect": False},
        {"text": "To compare people who are different in the confounders and who both quit smoking.", "rationale": "The goal is to compare groups that are *similar* in confounders but *different* in treatment (quit vs. not quit).", "isCorrect": False},
        {"text": "To statistically 'hold constant' the values of all measured confounders while estimating the difference in outcome between treated and control groups.", "rationale": "This describes the mechanism of OLS: the coefficient for the treatment variable ($qsmk$) represents the effect when all other variables (confounders) are equal.", "isCorrect": True},
        {"text": "To visually represent the distribution of the treatment variable across all confounders.", "rationale": "Visualization (like the Love plot) helps check balance, but this is not the function of the regression model itself.", "isCorrect": False}
      ],
      "hint": "Think about what the multiple regression equation does to the coefficient of the treatment variable when many other variables are included."
    },
    {
      "questionNumber": 7,
      "question": "The **Propensity Score (PS)** is calculated using a logistic regression with the formula $P(\\text{Treatment}=1 | \\text{Confounders})$. What does an individual's PS of **0.85** signify?",
      "answerOptions": [
        {"text": "They had an 85% chance of gaining 2.54 kg of weight.", "rationale": "The PS is the probability of receiving the treatment (quitting), not the probability of the outcome (weight gain).", "isCorrect": False},
        {"text": "They had an 85% chance of *quitting smoking*, given their baseline characteristics (confounders).", "rationale": "The PS is the estimated probability of the treatment being received, conditional on the observed confounders.", "isCorrect": True},
        {"text": "They are a highly influential observation and should be removed from the dataset.", "rationale": "A PS of 0.85 indicates a relatively common case (expected to quit), not necessarily an influential one. Cases with PS close to 0 or 1 get high IPW weights.", "isCorrect": False},
        {"text": "They are in the control group (Did Not Quit) and should have a high IPW weight.", "rationale": "A high PS suggests they are likely to quit (treated). If they did *not* quit (control), then they would have a very high IPW weight.", "isCorrect": False}
      ],
      "hint": "The PS is a probability, and the dependent variable in the logistic regression is the treatment status ($qsmk$)."
    },
    {
      "questionNumber": 8,
      "question": "What is the key objective of the **Inverse Probability Weighting (IPW)** method?",
      "answerOptions": [
        {"text": "To create a dataset where the outcome distribution is normal (Gaussian).", "rationale": "IPW aims to balance confounders, not to transform the distribution of the outcome variable.", "isCorrect": False},
        {"text": "To create a 'pseudo-population' where the treatment assignment is independent of the confounders (i.e., randomly assigned).", "rationale": "IPW achieves balance by making the weighted distribution of confounders the same in both the treated and control groups, mimicking randomization.", "isCorrect": True},
        {"text": "To directly model the relationship between the treatment and the outcome using weighted ordinary least squares (OLS).", "rationale": "While IPW *can* use weighted OLS for inference, its primary objective is the *reweighting* step (creating the pseudo-population) to eliminate confounding bias.", "isCorrect": False},
        {"text": "To identify and remove all influential observations with extreme propensity scores.", "rationale": "IPW actually *upweights* these extreme cases, it doesn't remove them. Truncation is a separate step to handle extreme weights.", "isCorrect": False}
      ],
      "hint": "The goal of weighting is to adjust the balance of the confounders, making the non-randomized data behave more like a randomized experiment."
    },
    {
      "questionNumber": 9,
      "question": "An individual in the **Treated group (quit smoking)** has a very low propensity score ($\\text{PS} = 0.05$). What does this imply about their IPW weight?",
      "answerOptions": [
        {"text": "Their weight will be very low, $1 - \\text{PS} = 0.95$, as they were expected to quit.", "rationale": "This is the weight formula for the *control* group. A low weight is for common cases, not rare ones.", "isCorrect": False},
        {"text": "Their weight will be very high, $1 / \\text{PS} = 20$, because they were an 'unlikely quitter who quit'.", "rationale": "For the treated group, the weight is $1/\\text{PS}$. A low $\\text{PS}$ means they represent many similar people who did not quit, so they get a high weight.", "isCorrect": True},
        {"text": "Their weight will be 0.05, as the weight is simply the propensity score.", "rationale": "The weight is the *inverse* of the probability of receiving the treatment they actually got.", "isCorrect": False},
        {"text": "Their weight will be moderate, close to the mean PS of the dataset.", "rationale": "Cases that behave contrary to expectation (low PS but treated, or high PS but control) get high IPW weights to correct the imbalance.", "isCorrect": False}
      ],
      "hint": "Recall the IPW formula: $W = 1 / P(\\text{Actual Treatment} | \\text{Confounders})$. How does a small denominator affect the resulting fraction?"
    },
    {
      "questionNumber": 10,
      "question": "What is the primary metric used to evaluate **balance** of the confounders *before* and *after* applying IPW weights?",
      "answerOptions": [
        {"text": "R-squared value of the propensity score model.", "rationale": "R-squared measures the explanatory power of the PS model, but not the direct balance of confounder means between groups.", "isCorrect": False},
        {"text": "p-value from a t-test comparing the means of confounders.", "rationale": "p-values are heavily influenced by sample size; the Standardized Mean Difference (SMD) is the preferred size-independent metric.", "isCorrect": False},
        {"text": "Cook's distance for influential observations.", "rationale": "Cook's distance measures the influence of individual data points on the regression results, not the overall balance of the treated and control groups.", "isCorrect": False},
        {"text": "The **Standardized Mean Difference (SMD)**, with a target typically below 0.1.", "rationale": "SMD is the standard metric used in matching and weighting to check if confounder distributions are similar across treatment groups, regardless of sample size.", "isCorrect": True}
      ],
      "hint": "This metric is used to measure the *size* of the difference in group means, scaled by the standard deviation, and is typically plotted on a 'Love Plot'."
    },
    {
      "questionNumber": 11,
      "question": "The analysis showed that the **Naive Effect** was $2.54 \\text{ kg}$, while the **Adjusted Regression Effect** was $3.41 \\text{ kg}$. What does this difference primarily indicate?",
      "answerOptions": [
        {"text": "The OLS regression model suffered from severe heteroscedasticity.", "rationale": "Heteroscedasticity is a violation of regression assumptions and does not explain the difference in the primary effect estimate.", "isCorrect": False},
        {"text": "The confounding bias was *negative*, causing the naive estimate to overestimate the true causal effect.", "rationale": "The bias is $\\text{Naive} - \\text{Adjusted} = 2.54 - 3.41 = -0.87 \\text{ kg}$. A *negative* bias means the naive estimate *underestimated* the true effect.", "isCorrect": False},
        {"text": "The confounding bias was *positive*, causing the naive estimate to underestimate the true causal effect.", "rationale": "The adjusted effect is higher ($3.41 \\text{ kg}$) than the naive effect ($2.54 \\text{ kg}$). This means the naive estimate was biased *downwards* (underestimated the true effect) by $0.87 \\text{ kg}$.", "isCorrect": False},
        {"text": "A substantial **confounding bias** of $0.87 \\text{ kg}$ was present in the naive estimate.", "rationale": "The difference between the naive and adjusted estimate ($\\lvert 2.54 - 3.41 \\rvert = 0.87 \\text{ kg}$) is the estimate of the total confounding bias removed by adjustment.", "isCorrect": True}
      ],
      "hint": "The bias is calculated as $\\text{Bias} = \\text{Naive Estimate} - \\text{Adjusted Estimate}$. The magnitude of the difference shows the extent of the confounding problem."
    },
    {
      "questionNumber": 12,
      "question": "After successfully calculating the IPW weights, how is the causal effect (ATE) estimated using the weighted means approach?",
      "answerOptions": [
        {"text": "By running a simple OLS regression of $\\text{Outcome} \\sim \\text{Treatment}$ without any weights.", "rationale": "This is the naive, unadjusted approach, which does not use the weights calculated in the IPW step.", "isCorrect": False},
        {"text": "By taking the difference between the weighted mean outcome of the treated group and the weighted mean outcome of the control group.", "rationale": "The difference between the two weighted means gives the IPW estimate of the ATE, as the weighting has balanced the confounders.", "isCorrect": True},
        {"text": "By summing the IPW weights of the treated group and subtracting the sum of the weights of the control group.", "rationale": "Summing weights is not a measure of the outcome effect. The sum of the IPW weights across the whole population should equal the sample size.", "isCorrect": False},
        {"text": "By comparing the weighted median outcome instead of the weighted mean outcome.", "rationale": "IPW is designed to estimate the Average Treatment Effect (ATE), which is based on the mean, not the median.", "isCorrect": False}
      ],
      "hint": "The ATE is the average difference in outcomes. With IPW, you just need to ensure the means are calculated on the balanced, pseudo-population."
    },
    {
      "questionNumber": 13,
      "question": "The **Regression Adjustment** and **IPW** methods yielded very similar causal estimates (approx. $3.4 \\text{ kg}$). What is the main implication of this agreement?",
      "answerOptions": [
        {"text": "It proves that the non-linear relationship between confounders and the outcome is negligible.", "rationale": "The agreement only suggests both models correctly handled confounding; it does not explicitly rule out non-linear relationships.", "isCorrect": False},
        {"text": "It suggests the regression model was likely correctly specified, increasing confidence in the final causal estimate.", "rationale": "When different methods that rely on different assumptions (e.g., OLS needs correct outcome modeling, IPW needs correct PS modeling) agree, it strengthens the finding.", "isCorrect": True},
        {"text": "It means the causal effect is not statistically significant and should be discarded.", "rationale": "The similarity of results does not determine statistical significance, and the reported p-value suggests the effect is highly significant.", "isCorrect": False},
        {"text": "It indicates the initial step of dropping missing data was performed correctly.", "rationale": "This agreement is about the *estimate* derived from the available data, not the missing data handling procedure.", "isCorrect": False}
      ],
      "hint": "Think about what it means when two different statistical 'routes' lead to the same destination. What does that tell you about the path?"
    },
    {
      "questionNumber": 14,
      "question": "The final causal estimate was $\\mathbf{3.41 \\text{ kg}}$ with a $\\mathbf{95\\% \\text{ CI}}$ of $[\\mathbf{2.80}, \\mathbf{4.02}]$. What does this confidence interval mean?",
      "answerOptions": [
        {"text": "95% of the individuals in the sample gained weight between 2.80 kg and 4.02 kg.", "rationale": "The CI is a statement about the population mean, not the range of individual data points in the sample.", "isCorrect": False},
        {"text": "There is a 95% probability that the true average causal effect of quitting smoking in the population is between 2.80 kg and 4.02 kg.", "rationale": "The CI means that if you repeated the study many times, 95% of the CIs constructed would contain the true population effect.", "isCorrect": True},
        {"text": "It suggests the result is not statistically significant because the interval contains the mean (3.41 kg).", "rationale": "The result is statistically significant because the interval **does not contain zero**, indicating a real effect.", "isCorrect": False},
        {"text": "The confounding bias is guaranteed to be less than 4.02 kg.", "rationale": "The CI is for the causal effect itself, not an estimate of the remaining confounding bias.", "isCorrect": False}
      ],
      "hint": "A confidence interval is an estimated range of values which is likely to include an unknown population parameter, not a range for individual observations."
    },
    {
      "questionNumber": 15,
      "question": "In the OLS Regression Diagnostics, what would a distinct 'cone' shape in the **Scale-Location Plot** (Residuals vs. Fitted Values) primarily suggest?",
      "answerOptions": [
        {"text": "Violation of the assumption of linearity, suggesting a need for polynomial terms.", "rationale": "Non-linearity usually appears as a curved pattern in the Residuals vs. Fitted Values plot, not the Scale-Location plot.", "isCorrect": False},
        {"text": "Violation of the assumption of normally distributed residuals.", "rationale": "Normality is best checked using the Q-Q Plot and the residuals histogram.", "isCorrect": False},
        {"text": "Violation of the assumption of **homoscedasticity** (constant variance of errors).", "rationale": "The Scale-Location plot checks for homoscedasticity, and a 'cone' shape (variance increasing or decreasing with fitted values) indicates heteroscedasticity.", "isCorrect": True},
        {"text": "The presence of highly influential observations (high leverage points).", "rationale": "Influential observations are best assessed using metrics like Cook's distance, not primarily the Scale-Location plot.", "isCorrect": False}
      ],
      "hint": "This plot helps determine if the spread of the error terms is consistent across all predicted values of the outcome."
    }
  ]
}

def format_quiz_for_colab(quiz_data):
    """Formats the quiz data into a readable Markdown string for Google Colab."""
    output = []
    output.append("# üìä Crash Course in Causal Inference: Multiple-Choice Quiz\n")
    output.append("*Test your understanding of causal inference concepts from the NHEFS worked example*\n")
    output.append("---\n")

    for q_data in quiz_data['questions']:
        q_num = q_data['questionNumber']
        question = q_data['question']
        hint = q_data.get('hint', '')

        output.append(f"## Question {q_num}\n")
        output.append(f"**{question}**\n")

        if hint:
            output.append(f"üí° *Hint: {hint}*\n")

        # Options
        options = q_data['answerOptions']
        option_letters = ['A', 'B', 'C', 'D']
        correct_letter = ""

        output.append("\n### Answer Options:\n")
        for i, option in enumerate(options):
            letter = option_letters[i]
            output.append(f"**{letter}.** {option['text']}\n")
            if option['isCorrect']:
                correct_letter = letter

        output.append("\n<details>")
        output.append("<summary><b>üëâ Click here to reveal the answer</b></summary>\n")

        # Answer
        output.append(f"### ‚úÖ Correct Answer: **{correct_letter}**\n")

        # Explanation
        output.append("### üìñ Detailed Explanations:\n")
        for i, option in enumerate(options):
            letter = option_letters[i]
            if option['isCorrect']:
                output.append(f"**{letter}. ‚úÖ CORRECT:** {option['rationale']}\n")
            else:
                output.append(f"**{letter}. ‚ùå Incorrect:** {option['rationale']}\n")

        output.append("\n</details>\n")
        output.append("\n---\n")

    return "\n".join(output)

# Generate and display the formatted quiz
formatted_quiz = format_quiz_for_colab(quiz_data)
display(Markdown(formatted_quiz))

# Optional: Print instructions
print("\n" + "="*70)
print("üìã INSTRUCTIONS:")
print("="*70)
print("1. Read each question carefully")
print("2. Consider all answer options before selecting")
print("3. Click 'Click here to reveal the answer' to see the solution")
print("4. Review the detailed explanations for all options")
print("5. The quiz covers 15 key concepts from causal inference")
print("="*70)

# üìä Crash Course in Causal Inference: Multiple-Choice Quiz

*Test your understanding of causal inference concepts from the NHEFS worked example*

---

## Question 1

**What is the primary causal question investigated in the NHEFS worked example?**

üí° *Hint: Focus on the variables designated as the treatment and the outcome in the initial problem statement.*


### Answer Options:

**A.** Does exercise level affect weight change?

**B.** Does quitting smoking (qsmk) cause weight gain (wt82_71)?

**C.** Does income affect the rate of death in the follow-up period?

**D.** Does a higher baseline weight (wt71) increase the likelihood of quitting smoking?


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **B**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** While the dataset contains exercise and weight variables, this is a secondary question. The primary focus of the analysis was on the effect of smoking cessation.

**B. ‚úÖ CORRECT:** The script explicitly states this is the main causal question of interest, where 'qsmk' is the treatment and 'wt82_71' is the outcome.

**C. ‚ùå Incorrect:** The dataset includes income and death, but the worked example specifically focuses on the relationship between smoking cessation and weight change.

**D. ‚ùå Incorrect:** This question addresses the relationship between a confounder (wt71) and the treatment (qsmk), but it is not the main causal effect being measured.


</details>


---

## Question 2

**In the context of causal inference, what is a **confounder**?**

üí° *Hint: Think about what a variable must do to create a spurious, non-causal association between the treatment and the outcome.*


### Answer Options:

**A.** A variable that is only affected by the treatment and then affects the outcome.

**B.** A variable that causes both the treatment and the outcome.

**C.** A variable whose value is missing for a significant number of observations.

**D.** The primary variable of interest that the researcher is manipulating.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **B**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** This describes a mediator variable, which is a step in the causal pathway from treatment to outcome, not a confounder.

**B. ‚úÖ CORRECT:** This is the precise definition of a confounder, as it opens a 'backdoor path' creating a spurious association between treatment and outcome.

**C. ‚ùå Incorrect:** This describes missing data, which is a data quality issue, but not the causal role of a confounder.

**D. ‚ùå Incorrect:** This describes the treatment or exposure variable, not the definition of a confounder.


</details>


---

## Question 3

**Why is **Age** considered a major confounder in the analysis of quitting smoking on weight gain?**

üí° *Hint: Recall how age affects both the likelihood of engaging in the behavior (smoking) and the body's natural processes (metabolism).*


### Answer Options:

**A.** Because Age directly determines the treatment, meaning older people always quit smoking.

**B.** Because Age is related to both the probability of quitting smoking and the metabolic rate (which affects weight gain).

**C.** Because it is required to calculate the Adjusted R-squared in the regression model.

**D.** Because the naive comparison showed quitters are younger, which balances the groups.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **B**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** While age is associated with treatment, it does not determine it, and this statement is too strong and factually incorrect.

**B. ‚úÖ CORRECT:** The script specifically points out that Age affects 'the smoking urge' (treatment) and 'metabolism' (outcome), fitting the definition of a confounder.

**C. ‚ùå Incorrect:** While age is used in the regression, this describes a statistical requirement, not its role as a causal variable (a confounder).

**D. ‚ùå Incorrect:** The naive comparison showed quitters are *older* (46.17 vs. 42.79), confirming it's a source of imbalance, not balance.


</details>


---

## Question 4

**The **naive effect** (simple difference in means) showed that quitters gained **2.54 kg** more than non-quitters. Why is this estimate considered **biased**?**

üí° *Hint: Consider the core problem that causal inference methods (like regression and IPW) are designed to solve.*


### Answer Options:

**A.** It fails to account for influential outliers with extreme weight changes.

**B.** It only measures association and does not control for the systematic differences (confounders) between the quitters and non-quitters.

**C.** It uses the mean instead of the median, which is less robust to skewed weight distributions.

**D.** It is based on OLS regression, which assumes linearity and normality of residuals.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **B**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** While outliers can affect mean estimates, the primary reason for bias in this context is confounding, not specifically outliers.

**B. ‚úÖ CORRECT:** The naive estimate is biased because it includes both the true causal effect and the spurious effect introduced by backdoor paths (confounders).

**C. ‚ùå Incorrect:** The choice between mean and median is about central tendency robustness, not the structural issue of confounding bias.

**D. ‚ùå Incorrect:** The naive estimate is the simple difference of group means, calculated *before* running any regression model.


</details>


---

## Question 5

**In a Causal Diagram (DAG), how is a **backdoor path** from Treatment to Outcome blocked to achieve an unbiased causal estimate?**

üí° *Hint: The solution requires actively doing something to the common cause node in the DAG to prevent the spurious flow of information.*


### Answer Options:

**A.** By observing or statistically adjusting for the outcome variable in the analysis.

**B.** By setting the treatment variable to a fixed value for all observations in the dataset.

**C.** By conditioning on (adjusting for) the variables that constitute the backdoor path (the common causes or confounders).

**D.** By removing any observations that have missing values for the confounders.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **C**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** The outcome is the variable being measured; adjusting for it will bias the causal effect (M-bias or collider bias).

**B. ‚ùå Incorrect:** Fixing the treatment is the counterfactual step, but the path is blocked by controlling for the *confounder* node, not the treatment node.

**C. ‚úÖ CORRECT:** Adjusting for a common cause (confounder) closes the spurious path $T \leftarrow C \rightarrow O$, isolating the true causal effect $T \rightarrow O$.

**D. ‚ùå Incorrect:** Removing missing data addresses data quality, but does not inherently close the causal path via the remaining observations.


</details>


---

## Question 6

**What is the primary role of **Regression Adjustment (OLS)** in estimating the causal effect of quitting smoking?**

üí° *Hint: Think about what the multiple regression equation does to the coefficient of the treatment variable when many other variables are included.*


### Answer Options:

**A.** To calculate the propensity score for each individual in the dataset.

**B.** To compare people who are different in the confounders and who both quit smoking.

**C.** To statistically 'hold constant' the values of all measured confounders while estimating the difference in outcome between treated and control groups.

**D.** To visually represent the distribution of the treatment variable across all confounders.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **C**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** Propensity scores are calculated using Logistic Regression (Logit), not Ordinary Least Squares (OLS) regression.

**B. ‚ùå Incorrect:** The goal is to compare groups that are *similar* in confounders but *different* in treatment (quit vs. not quit).

**C. ‚úÖ CORRECT:** This describes the mechanism of OLS: the coefficient for the treatment variable ($qsmk$) represents the effect when all other variables (confounders) are equal.

**D. ‚ùå Incorrect:** Visualization (like the Love plot) helps check balance, but this is not the function of the regression model itself.


</details>


---

## Question 7

**The **Propensity Score (PS)** is calculated using a logistic regression with the formula $P(\text{Treatment}=1 | \text{Confounders})$. What does an individual's PS of **0.85** signify?**

üí° *Hint: The PS is a probability, and the dependent variable in the logistic regression is the treatment status ($qsmk$).*


### Answer Options:

**A.** They had an 85% chance of gaining 2.54 kg of weight.

**B.** They had an 85% chance of *quitting smoking*, given their baseline characteristics (confounders).

**C.** They are a highly influential observation and should be removed from the dataset.

**D.** They are in the control group (Did Not Quit) and should have a high IPW weight.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **B**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** The PS is the probability of receiving the treatment (quitting), not the probability of the outcome (weight gain).

**B. ‚úÖ CORRECT:** The PS is the estimated probability of the treatment being received, conditional on the observed confounders.

**C. ‚ùå Incorrect:** A PS of 0.85 indicates a relatively common case (expected to quit), not necessarily an influential one. Cases with PS close to 0 or 1 get high IPW weights.

**D. ‚ùå Incorrect:** A high PS suggests they are likely to quit (treated). If they did *not* quit (control), then they would have a very high IPW weight.


</details>


---

## Question 8

**What is the key objective of the **Inverse Probability Weighting (IPW)** method?**

üí° *Hint: The goal of weighting is to adjust the balance of the confounders, making the non-randomized data behave more like a randomized experiment.*


### Answer Options:

**A.** To create a dataset where the outcome distribution is normal (Gaussian).

**B.** To create a 'pseudo-population' where the treatment assignment is independent of the confounders (i.e., randomly assigned).

**C.** To directly model the relationship between the treatment and the outcome using weighted ordinary least squares (OLS).

**D.** To identify and remove all influential observations with extreme propensity scores.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **B**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** IPW aims to balance confounders, not to transform the distribution of the outcome variable.

**B. ‚úÖ CORRECT:** IPW achieves balance by making the weighted distribution of confounders the same in both the treated and control groups, mimicking randomization.

**C. ‚ùå Incorrect:** While IPW *can* use weighted OLS for inference, its primary objective is the *reweighting* step (creating the pseudo-population) to eliminate confounding bias.

**D. ‚ùå Incorrect:** IPW actually *upweights* these extreme cases, it doesn't remove them. Truncation is a separate step to handle extreme weights.


</details>


---

## Question 9

**An individual in the **Treated group (quit smoking)** has a very low propensity score ($\text{PS} = 0.05$). What does this imply about their IPW weight?**

üí° *Hint: Recall the IPW formula: $W = 1 / P(\text{Actual Treatment} | \text{Confounders})$. How does a small denominator affect the resulting fraction?*


### Answer Options:

**A.** Their weight will be very low, $1 - \text{PS} = 0.95$, as they were expected to quit.

**B.** Their weight will be very high, $1 / \text{PS} = 20$, because they were an 'unlikely quitter who quit'.

**C.** Their weight will be 0.05, as the weight is simply the propensity score.

**D.** Their weight will be moderate, close to the mean PS of the dataset.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **B**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** This is the weight formula for the *control* group. A low weight is for common cases, not rare ones.

**B. ‚úÖ CORRECT:** For the treated group, the weight is $1/\text{PS}$. A low $\text{PS}$ means they represent many similar people who did not quit, so they get a high weight.

**C. ‚ùå Incorrect:** The weight is the *inverse* of the probability of receiving the treatment they actually got.

**D. ‚ùå Incorrect:** Cases that behave contrary to expectation (low PS but treated, or high PS but control) get high IPW weights to correct the imbalance.


</details>


---

## Question 10

**What is the primary metric used to evaluate **balance** of the confounders *before* and *after* applying IPW weights?**

üí° *Hint: This metric is used to measure the *size* of the difference in group means, scaled by the standard deviation, and is typically plotted on a 'Love Plot'.*


### Answer Options:

**A.** R-squared value of the propensity score model.

**B.** p-value from a t-test comparing the means of confounders.

**C.** Cook's distance for influential observations.

**D.** The **Standardized Mean Difference (SMD)**, with a target typically below 0.1.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **D**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** R-squared measures the explanatory power of the PS model, but not the direct balance of confounder means between groups.

**B. ‚ùå Incorrect:** p-values are heavily influenced by sample size; the Standardized Mean Difference (SMD) is the preferred size-independent metric.

**C. ‚ùå Incorrect:** Cook's distance measures the influence of individual data points on the regression results, not the overall balance of the treated and control groups.

**D. ‚úÖ CORRECT:** SMD is the standard metric used in matching and weighting to check if confounder distributions are similar across treatment groups, regardless of sample size.


</details>


---

## Question 11

**The analysis showed that the **Naive Effect** was $2.54 \text{ kg}$, while the **Adjusted Regression Effect** was $3.41 \text{ kg}$. What does this difference primarily indicate?**

üí° *Hint: The bias is calculated as $\text{Bias} = \text{Naive Estimate} - \text{Adjusted Estimate}$. The magnitude of the difference shows the extent of the confounding problem.*


### Answer Options:

**A.** The OLS regression model suffered from severe heteroscedasticity.

**B.** The confounding bias was *negative*, causing the naive estimate to overestimate the true causal effect.

**C.** The confounding bias was *positive*, causing the naive estimate to underestimate the true causal effect.

**D.** A substantial **confounding bias** of $0.87 \text{ kg}$ was present in the naive estimate.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **D**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** Heteroscedasticity is a violation of regression assumptions and does not explain the difference in the primary effect estimate.

**B. ‚ùå Incorrect:** The bias is $\text{Naive} - \text{Adjusted} = 2.54 - 3.41 = -0.87 \text{ kg}$. A *negative* bias means the naive estimate *underestimated* the true effect.

**C. ‚ùå Incorrect:** The adjusted effect is higher ($3.41 \text{ kg}$) than the naive effect ($2.54 \text{ kg}$). This means the naive estimate was biased *downwards* (underestimated the true effect) by $0.87 \text{ kg}$.

**D. ‚úÖ CORRECT:** The difference between the naive and adjusted estimate ($\lvert 2.54 - 3.41 \rvert = 0.87 \text{ kg}$) is the estimate of the total confounding bias removed by adjustment.


</details>


---

## Question 12

**After successfully calculating the IPW weights, how is the causal effect (ATE) estimated using the weighted means approach?**

üí° *Hint: The ATE is the average difference in outcomes. With IPW, you just need to ensure the means are calculated on the balanced, pseudo-population.*


### Answer Options:

**A.** By running a simple OLS regression of $\text{Outcome} \sim \text{Treatment}$ without any weights.

**B.** By taking the difference between the weighted mean outcome of the treated group and the weighted mean outcome of the control group.

**C.** By summing the IPW weights of the treated group and subtracting the sum of the weights of the control group.

**D.** By comparing the weighted median outcome instead of the weighted mean outcome.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **B**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** This is the naive, unadjusted approach, which does not use the weights calculated in the IPW step.

**B. ‚úÖ CORRECT:** The difference between the two weighted means gives the IPW estimate of the ATE, as the weighting has balanced the confounders.

**C. ‚ùå Incorrect:** Summing weights is not a measure of the outcome effect. The sum of the IPW weights across the whole population should equal the sample size.

**D. ‚ùå Incorrect:** IPW is designed to estimate the Average Treatment Effect (ATE), which is based on the mean, not the median.


</details>


---

## Question 13

**The **Regression Adjustment** and **IPW** methods yielded very similar causal estimates (approx. $3.4 \text{ kg}$). What is the main implication of this agreement?**

üí° *Hint: Think about what it means when two different statistical 'routes' lead to the same destination. What does that tell you about the path?*


### Answer Options:

**A.** It proves that the non-linear relationship between confounders and the outcome is negligible.

**B.** It suggests the regression model was likely correctly specified, increasing confidence in the final causal estimate.

**C.** It means the causal effect is not statistically significant and should be discarded.

**D.** It indicates the initial step of dropping missing data was performed correctly.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **B**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** The agreement only suggests both models correctly handled confounding; it does not explicitly rule out non-linear relationships.

**B. ‚úÖ CORRECT:** When different methods that rely on different assumptions (e.g., OLS needs correct outcome modeling, IPW needs correct PS modeling) agree, it strengthens the finding.

**C. ‚ùå Incorrect:** The similarity of results does not determine statistical significance, and the reported p-value suggests the effect is highly significant.

**D. ‚ùå Incorrect:** This agreement is about the *estimate* derived from the available data, not the missing data handling procedure.


</details>


---

## Question 14

**The final causal estimate was $\mathbf{3.41 \text{ kg}}$ with a $\mathbf{95\% \text{ CI}}$ of $[\mathbf{2.80}, \mathbf{4.02}]$. What does this confidence interval mean?**

üí° *Hint: A confidence interval is an estimated range of values which is likely to include an unknown population parameter, not a range for individual observations.*


### Answer Options:

**A.** 95% of the individuals in the sample gained weight between 2.80 kg and 4.02 kg.

**B.** There is a 95% probability that the true average causal effect of quitting smoking in the population is between 2.80 kg and 4.02 kg.

**C.** It suggests the result is not statistically significant because the interval contains the mean (3.41 kg).

**D.** The confounding bias is guaranteed to be less than 4.02 kg.


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **B**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** The CI is a statement about the population mean, not the range of individual data points in the sample.

**B. ‚úÖ CORRECT:** The CI means that if you repeated the study many times, 95% of the CIs constructed would contain the true population effect.

**C. ‚ùå Incorrect:** The result is statistically significant because the interval **does not contain zero**, indicating a real effect.

**D. ‚ùå Incorrect:** The CI is for the causal effect itself, not an estimate of the remaining confounding bias.


</details>


---

## Question 15

**In the OLS Regression Diagnostics, what would a distinct 'cone' shape in the **Scale-Location Plot** (Residuals vs. Fitted Values) primarily suggest?**

üí° *Hint: This plot helps determine if the spread of the error terms is consistent across all predicted values of the outcome.*


### Answer Options:

**A.** Violation of the assumption of linearity, suggesting a need for polynomial terms.

**B.** Violation of the assumption of normally distributed residuals.

**C.** Violation of the assumption of **homoscedasticity** (constant variance of errors).

**D.** The presence of highly influential observations (high leverage points).


<details>
<summary><b>üëâ Click here to reveal the answer</b></summary>

### ‚úÖ Correct Answer: **C**

### üìñ Detailed Explanations:

**A. ‚ùå Incorrect:** Non-linearity usually appears as a curved pattern in the Residuals vs. Fitted Values plot, not the Scale-Location plot.

**B. ‚ùå Incorrect:** Normality is best checked using the Q-Q Plot and the residuals histogram.

**C. ‚úÖ CORRECT:** The Scale-Location plot checks for homoscedasticity, and a 'cone' shape (variance increasing or decreasing with fitted values) indicates heteroscedasticity.

**D. ‚ùå Incorrect:** Influential observations are best assessed using metrics like Cook's distance, not primarily the Scale-Location plot.


</details>


---



üìã INSTRUCTIONS:
1. Read each question carefully
2. Consider all answer options before selecting
3. Click 'Click here to reveal the answer' to see the solution
4. Review the detailed explanations for all options
5. The quiz covers 15 key concepts from causal inference
