**Worked Example 1: Causal Analysis in Healthcare Dataset**

*Objective*

Estimate the effect of Aspirin (a specific medication) on patient health outcomes.
*
Dataset Descripti*on
The healthcare dataset contains information about patients, including their age, gender, medical condition, medication administered, and test results.

*Steps for Analysis*

**Step 1: Data Preparation**

Hypothesis:

Cleaning, dealing with missing values, and encoding categorical variables are all part of data preparation.

Accurate analysis is ensured by clean data, which is crucial in healthcare settings.


In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/Tanvivalkunde/ADSA/main/healthcare_dataset.csv')

# Assume missing values handling and other cleaning processes are done

# Encoding categorical variables (e.g., Gender, Medical Condition)
encoder = OneHotEncoder()
categorical_variables = ['Gender', 'Medical Condition']
encoded_vars = encoder.fit_transform(df[categorical_variables]).toarray()


**Step 2: Define Treatment and Outcome**

Theory:

Treatment is the intervention or exposure of interest, here represented by the use of Aspirin.

The outcome is the variable we want to study the effect on, such as the patient's health condition, measured through 'Test Results'.

In [3]:
# Defining the treatment - Taking Aspirin as a binary treatment variable
df['treatment'] = np.where(df['Medication'] == 'Aspirin', 1, 0)

# Assuming 'Test Results' is a binary outcome (e.g., 1 for positive, 0 for negative)
# This encoding depends on how 'Test Results' are recorded
df['outcome'] = np.where(df['Test Results'] == 'Positive', 1, 0)


**Step 3: Propensity Score Estimation**

Theory:

Propensity score is the probability of receiving the treatment based on observed characteristics.

Logistic regression is commonly used to estimate propensity scores.

In [4]:
from sklearn.linear_model import LogisticRegression

# Combining encoded categorical variables with other covariates
covariates = np.concatenate([encoded_vars, df[['Age']]], axis=1)

# Logistic Regression to estimate propensity scores
model = LogisticRegression()
model.fit(covariates, df['treatment'])
df['propensity_score'] = model.predict_proba(covariates)[:, 1]


**Step 4: Matching**

Theory:

Matching involves pairing each treated unit with one or more control units with similar propensity scores.

This step aims to balance the distribution of observed characteristics between treated and control groups.

In [5]:
# Implementing a basic nearest neighbor matching within a caliper
def match_units(treated_df, control_df, caliper=0.05):
    matched = []
    for i, row in treated_df.iterrows():
        control_pool = control_df[np.abs(control_df['propensity_score'] - row['propensity_score']) < caliper]
        if not control_pool.empty:
            match = control_pool.iloc[0]
            matched.append(match)
            control_df = control_df.drop(match.name)
    return pd.DataFrame(matched)

treated = df[df['treatment'] == 1]
control = df[df['treatment'] == 0]
matched_control = match_units(treated, control)


**Step 5: Estimate Treatment Effect**
    
Theory:

The treatment effect is estimated by comparing the outcomes between matched treated and control groups.

This comparison aims to reveal the causal effect of the treatment.

In [6]:
# Comparing outcomes between matched treatment and control groups
treated_outcomes = treated['outcome']
control_outcomes = matched_control['outcome']
effect_estimate = treated_outcomes.mean() - control_outcomes.mean()

print("Estimated Treatment Effect:", effect_estimate)


Estimated Treatment Effect: 0.0


**Worked Example 2: Causal Analysis in Educational Data**

*Objective*

To estimate the effect of student engagement (measured through the frequency of raised hands) on their academic performance.


*Dataset Description*

The dataset includes various features related to students in a school setting, such as their gender, grade level, engagement metrics (like raisedhands), and their academic performance (Class).

Steps for Analysis

**Step 1: Data Preparation**

Theory:

Data preparation is crucial to ensure that the dataset is clean, and the variables are correctly formatted for analysis.
                                                                                                          
Encoding categorical variables into a numeric format is necessary for many machine learning models.

In [7]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
edu_df = pd.read_csv('https://raw.githubusercontent.com/Tanvivalkunde/ADSA/main/xAPI-Edu-Data.csv')

# Encoding categorical variables
encoder = OneHotEncoder()
edu_df['gender_code'] = edu_df['gender'].astype('category').cat.codes


**Step 2: Define Treatment and Outcome**

Theory:


Treatment in this context is defined as high student engagement, operationalized as the frequency of raised hands in class.

The outcome is the academic performance of the students, represented by the Class variable.

In [8]:
# Define high engagement as treatment based on a threshold for raised hands
threshold = 50
edu_df['treatment'] = np.where(edu_df['raisedhands'] >= threshold, 1, 0)

# Assuming 'Class' is a categorical outcome (e.g., 'L', 'M', 'H')
edu_df['Class_code'] = edu_df['Class'].astype('category').cat.codes


**Step 3: Propensity Score Estimation**

Theory:

Propensity scores estimate the likelihood of receiving the treatment based on observed characteristics.

Logistic regression is used to estimate these scores.

In [9]:
from sklearn.linear_model import LogisticRegression

# Use logistic regression to estimate propensity scores
covariates = ['gender_code']  # Add other relevant covariates
model = LogisticRegression()
model.fit(edu_df[covariates], edu_df['treatment'])
edu_df['propensity_score'] = model.predict_proba(edu_df[covariates])[:, 1]


**Step 4: Matching**

Theory:


Matching creates pairs or groups of treated and untreated units with similar propensity scores.

This step aims to balance the comparison groups in terms of observed characteristics.

In [10]:
# Implement matching (a basic example using nearest neighbor matching within a caliper)
def match_units(treated_df, control_df, caliper=0.05):
    matched = []
    for i, row in treated_df.iterrows():
        control_pool = control_df[np.abs(control_df['propensity_score'] - row['propensity_score']) < caliper]
        if not control_pool.empty:
            match = control_pool.iloc[0]
            matched.append(match)
            control_df = control_df.drop(match.name)
    return pd.DataFrame(matched)

treated = edu_df[edu_df['treatment'] == 1]
control = edu_df[edu_df['treatment'] == 0]
matched_control = match_units(treated, control)


**Step 5: Estimate Treatment Effect**

Theory:

The treatment effect is measured as the difference in academic performance between the matched groups.

This effect helps understand the impact of engagement on academic outcomes

In [11]:
# Compare outcomes between matched treatment and control groups
treated_outcomes = treated['Class_code']
control_outcomes = matched_control['Class_code']
effect_estimate = treated_outcomes.mean() - control_outcomes.mean()

print("Estimated Treatment Effect:", effect_estimate)


Estimated Treatment Effect: -0.36019988242210466


LICENSE
MIT License

Copyright (c) 2022 Tanvi Manohar Valkunde

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.