# **Causal Inference Example using Dowhy**

## **Importing Necessary Libraries**

In [1]:
# install  libraries
!pip install dowhy numpy pandas seaborn 


In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import dowhy
from dowhy import CausalModel

import warnings
warnings.filterwarnings('ignore')

## **Dataset**
Here we are generating a dataset which simulates user behavior over a period of several months. Let's break down each column in the DataFrame and what it represents:



*   **user_id:** This column represents the unique identifier for each user. Each user is assigned a unique ID.
*   **signup_month:** This column indicates the month in which a user signed up. If the value is 0, it means the user did not sign up during any month. Otherwise, the value represents the month in which the user signed up.
*   **month:** This column represents the month of the observation. It ranges from 1 to 12, indicating the months in a year.
*   **spend:** This column represents the simulated spending behavior of users. The spending values are generated using a Poisson distribution centered around 500. Additionally, the spending decreases by 10 for each subsequent month. For users who signed up before the current month and are in the treatment group, their spending is increased by 100. This simulates a treatment effect on spending behavior.
*   **treatment:** This column indicates whether a user is in the treatment group or not. If the user signed up (signup_month is greater than 0), they are considered in the treatment group. Otherwise, they are not in the treatment group.

The dataset simulates a scenario where users can sign up during different months, and their spending behavior is tracked over a year. The treatment group consists of users who signed up, and these users experience a treatment effect in terms of increased spending.

Here's a quick example to help illustrate the dataset:

    User A: user_id = 0, signup_month = 3, month = 1, spend = 490, treatment = True
    User B: user_id = 1, signup_month = 0, month = 1, spend = 510, treatment = False
    User C: user_id = 2, signup_month = 0, month = 2, spend = 490, treatment = False
    User D: user_id = 3, signup_month = 2, month = 2, spend = 580, treatment = True
In this example, User A signed up in month 3 and is in the treatment group. User B did not sign up, so they are not in the treatment group. Users C and D did not sign up but are observed in month 2, and User D is in the treatment group. The spending values are influenced by the treatment effect and the month of observation.

This dataset is used to analyze the causal impact of signing up on user spending behavior, taking into account the treatment effect and other factors.








In [None]:
np.random.seed(42)

num_users = 10000
num_months = 12

signup_months = np.random.choice(np.arange(1, num_months), num_users) * np.random.randint(0,2, size=num_users) # signup_months == 0 means customer did not sign up
df = pd.DataFrame({
    'user_id': np.repeat(np.arange(num_users), num_months),
    'signup_month': np.repeat(signup_months, num_months), # signup month == 0 means customer did not sign up
    'month': np.tile(np.arange(1, num_months+1), num_users), # months are from 1 to 12
    'spend': np.random.poisson(500, num_users*num_months) #np.random.beta(a=2, b=5, size=num_users * num_months)*1000 # centered at 500
})
# A customer is in the treatment group if and only if they signed up
df["treatment"] = df["signup_month"]>0
# Simulating an effect of month (monotonically decreasing--customers buy less later in the year)
df["spend"] = df["spend"] - df["month"]*10
# Simulating a simple treatment effect of 100
after_signup = (df["signup_month"] < df["month"]) & (df["treatment"])
df.loc[after_signup,"spend"] = df[after_signup]["spend"] + 100
df

## **Causal Graph**
In this third cell, we're creating a causal graph using the DOT language. A causal graph visually represents the relationships between different variables and how they influence each other in a causal system.

The causal graph visually outlines the relationships between the treatment (signup in a specific month), the month of signup, spending before treatment, and spending after treatment. The graph implies potential causal connections between these variables. The "Z" node represents a latent confounder or unobserved variable that influences the treatment.

This graph and each of its variables are explained as follows:

**Scenario:** We have generated synthetic data to simulate user behavior over several months. The data includes information about user sign-up, months, spending, and treatment status (whether they signed up). We are interested in understanding the causal relationship between signing up and post-signup spending behavior, while considering the influence of the month of sign-up.

**Causal Graph:** The causal graph we've defined captures the relationships between the variables in our scenario. Let's break down how the graph applies to our example:



*   **treatment:** This represents whether a user signed up in a specific month. It's connected to "post_spends" to indicate that signing up might influence post-signup spending.
*   **pre_spends:** This represents spending behavior before the user signs up. It's connected to "treatment" because spending behavior before sign-up might influence the decision to sign up.
*   **post_spends:** This represents spending behavior after the user signs up. It's connected to both "treatment" (indicating the treatment effect) and "signup_month" (indicating that the month of sign-up might influence post-signup spending).
*   **Z:** This represents an unobserved latent variable. It's connected to "treatment" to indicate that there might be unobserved factors influencing both the treatment and the outcome.
*   **signup_month:** This represents the month in which the user signed up. It's connected to both "post_spends" (indicating the influence on post-signup spending) and "treatment" (indicating that the month of sign-up might influence the decision to sign up).

**Example Scenario from the Graph:**
Let's consider a specific user from our dataset:


*   User ID: 123
*   Signup Month: 2
*   Month: 4
*   Spending Before Signup: $480

*   Spending After Signup: $580

Using the causal graph, we can reason through the connections:

*   The user signed up in month 2 ("treatment" = True).
*   The user's spending behavior before signing up was $480 ("pre_spends" = $480).
*   The user's spending behavior after signing up was $580 ("post_spends" = $580).
*   The month of sign-up was February ("signup_month" = 2).

From this example, we can interpret that the user signed up in February, which potentially influenced their post-signup spending behavior to increase by $100 (treatment effect). The causal graph helps us visually represent and reason about these relationships.








In [None]:
i = 3

causal_graph = """digraph {
treatment[label="Program Signup in month i"];
pre_spends;
post_spends;
Z->treatment;
pre_spends -> treatment;
treatment->post_spends;
signup_month->post_spends;
signup_month->treatment;
}"""

## **Preparing Data for causal analysis**
In this cell, we are performing data manipulation and aggregation to create a new DataFrame that focuses on specific user behavior for a given month i

The resulting DataFrame df_i_signupmonth contains information about spending behavior for users who either didn't sign up or signed up in the given month i. It provides average spending values before and after month i for each user.

The output shows a subset of the resulting DataFrame. For example, let's consider the first row:

*   User ID: 0
*   Signup Month: 0
*   Pre-spends: 504.5 (average spending before month i)
*   Post-spends: 422.777778 (average spending after month i)

This row provides insights into the spending behavior of a user who didn't sign up (signup_month = 0), showing their average spending before and after month i.

The DataFrame is structured in a way that allows you to analyze the spending behavior of users based on their signup status and the month of signup. This information could be further used for causal analysis, hypothesis testing, or other forms of data-driven investigation.

In [None]:
df_i_signupmonth = (
    df[df.signup_month.isin([0, i])]
    .groupby(["user_id", "signup_month", "treatment"])
    .apply(
        lambda x: pd.Series(
            {
                "pre_spends": x.loc[x.month < i, "spend"].mean(),
                "post_spends": x.loc[x.month > i, "spend"].mean(),
            }
        )
    )
    .reset_index()
)
print(df_i_signupmonth)

## **Graph Explanation:**
The graph that's being displayed in the output is a visual representation of the causal relationships we've specified in our causal model. we've already explained the nodes and arrows in the graph

Relationship of Each node with other:

*   Arrows going from **pre_spends** and **z** to **treatment**: This indicates that spending behavior before signing up and the latent variable might influence the treatment variable (whether a user signs up).
*   Arrows going from **z** to **post_spends** and **signup_months**: This indicates that the latent variable might influence both post-signup spending behavior and the month of sign-up.
*   Arrows going from **signup_months** to **treatment**: This represents the relationship between the month of sign-up and the decision to sign up.
*   Arrow going from **treatment** to **post_spends**: This represents the causal effect of signing up on post-signup spending behavior.

The graph visualization helps us visually understand the assumptions and relationships we've encoded in our causal model. It provides insights into the variables that might influence each other and the directions of causality we've specified.

In [None]:
model = dowhy.CausalModel(data=df_i_signupmonth,
                          graph=causal_graph.replace("\n", " "),
                          treatment="treatment",
                          outcome="post_spends")
model.view_model()

## **Estimands**

Estimands in causality refer to the specific quantities or parameters that researchers aim to estimate when conducting causal inference or causal analysis. In other words, an estimand defines what exactly you want to measure or quantify in order to answer a causal question. Estimands play a crucial role in the design of causal studies and the interpretation of their results.

Estimands help clarify the causal question at hand and guide the choice of appropriate methods for estimating causal effects. They typically involve a comparison between different groups, treatments, interventions, or time points, and they define the specific outcome or effect that is of interest.

It's important to define the estimand clearly before conducting any causal analysis, as different choices of estimands can lead to different study designs, analysis methods, and interpretations of results. Additionally, addressing issues like selection bias, confounding, and other sources of bias is crucial for obtaining valid estimands and making accurate causal inferences.

In this  cell, we are using the identify_effect method of the CausalModel to identify the estimands for causal effects based on the causal graph and the data we've generated.

**Output Explanation:**
The output displays the identified estimands for causal effects in our scenario. Estimands are represented as mathematical expressions that describe how to compute the causal effect. Each estimand has a specific name and assumptions associated with it.

**Estimand 1 - Backdoor:**
*   **Estimand name:** Backdoor
*   **Estimand expression:** This expression represents the causal effect of treatment (signup) on post_spends by conditioning on signup_month to block the backdoor path.
*   **Assumption 1 - Unconfoundedness:** This assumption states that if there are no unobserved confounders (U) that affect both treatment and post_spends, then conditioning on signup_month is sufficient to make the treatment effect identifiable.

**Estimand 2 - Instrumental Variable (IV):**
*   **Estimand name:** IV
*   **Estimand expression:** This expression represents the causal effect of treatment on post_spends using pre_spends and Z as instrumental variables.
*   **Assumption 1 - As-if-random:** This assumption states that the instrumental variables (pre_spends and Z) are unrelated to any unobserved confounders U.
*   **Assumption 2 - Exclusion:** This assumption states that the instrumental variables (pre_spends and Z) affect the outcome only through the treatment variable (treatment), and not directly.

**Estimand 3 - Frontdoor (Not Applicable):**
*   **Estimand name:** Frontdoor
*   **Estimand expression:** This section states "No such variable(s) found!" indicating that the frontdoor estimand is not applicable in our scenario. Frontdoor estimands are used when there's a mediator variable that fully mediates the effect of the treatment on the outcome.

In [None]:
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

## **Causal Estimate**
Here we are estimating the causal effect using the identified estimand, the chosen estimation method, and the target units. Let's break down the output and its implications:

**Output Explanation:**
*   **Identified Estimand:** The estimand you previously identified using the identify_effect method is displayed again. It reminds you of the assumptions and the type of estimand you are dealing with.
*   **Realized Estimand:** This section provides information about the specific estimand that was realized based on the chosen estimation method and the target units.
  *   **b:** **post_spends~treatment+signup_month:** This notation indicates the relationship used to estimate the causal effect. It shows that the outcome variable **post_spends** is regressed on both **treatment** and **signup_month.**
  *   **Target units: att:** This indicates that the treatment effect being estimated is the Average Treatment Effect on the Treated (ATT). In other words, it's measuring the difference in the outcome (**post_spends**) for treated units (users who signed up) compared to what their outcome would have been if they hadn't signed up.
*  **Estimate:** This section provides the estimated value of the causal effect.
  *  **Mean value:** The estimated ATT is approximately 86.27. This means that, on average, users who signed up experienced an increase of around $86.27 in their post-signup spending compared to what their spending would have been if they hadn't signed up.

In summary, this cell calculates and presents the estimated causal effect of signing up on post-signup spending behavior using propensity score matching with the ATT as the target unit. The result indicates the average increase in spending due to signing up for the treatment group. This output provides insight into the impact of the treatment on the outcome variable based on the causal model and assumptions we've specified.


In [None]:
estimate = model.estimate_effect(identified_estimand,
                                 method_name='backdoor.propensity_score_matching',
                                 target_units='att')
print(estimate)

## **Refutation Testing**
Refutation tests, also known as falsification tests, are an essential concept in causal inference and the scientific method. They involve attempting to disprove or refute a causal hypothesis by examining the implications of that hypothesis and comparing them to observed data. The idea is that if the hypothesis cannot be falsified based on the observed data, it gains credibility as a potential explanation for the phenomenon under study.

Refutation tests serve as a critical step in the scientific process for ensuring that causal hypotheses are rigorously tested against empirical evidence. If a hypothesis passes multiple refutation tests and consistently aligns with a variety of observations and experimental results, it gains stronger support as a plausible explanation for the observed phenomenon. However, even a hypothesis that survives multiple tests should remain open to future testing and potential revision as new evidence emerges.

In this cell, we are using the **refute_estimate** method of the **CausalModel** to perform a refutation test on the estimated causal effect. Refutation tests are used to assess the robustness and validity of the estimated causal effect. Let's break down the output and its implications:

**Output Explanation:**
*   **Refutation: Use a Placebo Treatment: **This section indicates that we're using a placebo treatment refuter to assess the estimated causal effect's validity. A placebo treatment refuter involves introducing a placebo treatment that should not have any true causal effect. This helps test whether the analysis can correctly identify a lack of effect in a situation where we know there should be none.
*   **Estimated Effect:** The estimated causal effect that we obtained earlier is displayed here (86.27).
*   **New Effect:** This is the effect estimated after introducing the placebo treatment. In this case, the new effect is -3.61.
*   **p-value:** The p-value associated with the refutation test is shown (0.23). This p-value indicates the probability of observing the new effect (-3.61) or an even more extreme effect under the assumption that the null hypothesis is true (i.e., there is no causal effect). A p-value close to 1 suggests that the new effect is consistent with the null hypothesis.

**Interpretation:**
In this refutation test, we introduced a placebo treatment that should not have any true causal effect. The fact that the new effect after introducing the placebo treatment is close to zero (and even slightly negative) indicates that the analysis correctly identifies a lack of effect in this scenario. The p-value of 0.23 suggests that the observed new effect is consistent with the null hypothesis of no causal effect.

In [None]:
refutation = model.refute_estimate(identified_estimand, estimate, method_name='placebo_treatment_refuter',
                     placebo_type='permute', num_simulations=20)
print(refutation)

## **Conclusion of Causal Inference:**
Now let's tie everything together and interpret the results based on the entire example and the causal graph we've assumed:

**1. Original Estimated Causal Effect:**
*       You estimated the Average Treatment Effect on the Treated (ATT) using the identified estimand and the propensity score matching method.
*   The estimated ATT was approximately 86.2
*   This suggests that users who signed up experienced an average increase in post-signup spending of around $86.27 compared to what their spending would have been if they hadn't signed up.
*   This estimation was based on the causal graph assumptions you've specified.

**2. Refutation Test - Placebo Treatment:**


*   In the refutation test, you introduced a placebo treatment that should have no true causal effect.
*   The new effect, after introducing the placebo treatment, was approximately -3.61.
*   The p-value associated with the refutation test was 0.23.
*   The p-value indicates that the observed new effect is consistent with the null hypothesis of no causal effect.
*   In other words, the analysis correctly identified that the placebo treatment had no effect, as expected.

**Interpretation and Implications:**


*   The original estimated causal effect suggested a positive impact of signing up on post-signup spending behavior.
*   However, the refutation test with the placebo treatment confirmed the robustness of your analysis. It showed that the method correctly identified situations where there should be no effect.
*   The results of the refutation test support the validity of your original estimation. It indicates that your analysis approach is able to distinguish between real causal effects and situations where there should be no effect.
*   This adds confidence to the initial result, suggesting that the observed increase in post-signup spending among users who signed up is indeed likely due to the treatment (signing up).

Overall, based on the causal graph, the original estimation, and the refutation test, we can reasonably conclude that there is a positive causal effect of signing up on post-signup spending behavior among users. The robustness of your analysis provides more assurance that the observed effect is likely not due to chance or confounding factors.

This interpretation is based on the provided information, actual conclusions may vary depending on the specifics of your dataset, the quality of your causal assumptions, and other factors. Causal inference is a complex process that requires careful consideration of the data, assumptions, and analysis methods.

# **Correlation**

**Correlation Analayis**
After causality we have run a correlation analysis on the same dataset to explore the relationships between variables. However, keep in mind that correlation does not imply causation. While correlations can reveal associations between variables, they don't provide information about the direction or cause-and-effect relationships.

This code calculates the correlation coefficient between the signup_month and post_spends columns and prints the result. The correlation coefficient will be a value between -1 and 1, indicating the strength and direction of the linear relationship between the two variables. However, remember that a correlation doesn't necessarily imply causation, and it's important to interpret the results carefully.

In [None]:
correlation = df_i_signupmonth["signup_month"].corr(df_i_signupmonth["post_spends"])
print("Correlation between signup_month and post_spends:", correlation)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the correlation matrix
correlation_matrix = df_i_signupmonth.corr()

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

## Interpetation on Correlation Analysis
Correlation analysis focuses on quantifying the degree of association between two variables. It measures how changes in one variable are related to changes in another variable without necessarily implying a causal relationship. Correlation does not provide information about causality; it only indicates the strength and direction of a linear relationship.

In the context of our example:


*       You can use correlation analysis to measure how closely related **signup_month** and **post_spends** are. For instance, we calculated the Pearson correlation coefficient between these two variables.
*   A positive correlation coefficient indicates that as s**ignup_month** increases, **post_spends** tends to increase (and vice versa).
*   A negative correlation coefficient indicates an inverse relationship: as **signup_month** increases, **post_spends** tends to decrease.

In conclusion, causal learning and correlation analysis are both valuable tools, but they serve different purposes. **Causal learning is about understanding cause and effect, while correlation analysis is about measuring the strength of association**.

# **Explainable AI**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.inspection import partial_dependence
from sklearn.metrics import mean_squared_error

In [None]:
# Split the data into features (X) and target (y)
X = df_i_signupmonth.drop(columns=["post_spends"])
y = df_i_signupmonth["post_spends"]

In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Fit a linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [None]:
# Calculate feature importances for decision tree
tree_model = DecisionTreeRegressor()
tree_model.fit(X_train, y_train)
feature_importances = tree_model.feature_importances_

In [None]:
# Plot feature importances
plt.figure(figsize=(10, 6))
plt.bar(X_train.columns, feature_importances)
plt.title("Feature Importances (Decision Tree)")
plt.xlabel("Features")
plt.ylabel("Importance")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
linear_predictions = linear_model.predict(X_test)
linear_mse = mean_squared_error(y_test, linear_predictions)
print("Linear Regression Mean Squared Error:", linear_mse)