<a href="https://colab.research.google.com/github/platypus2000jp/Colaboratory/blob/main/Gemini%E3%81%A7%E5%9B%A0%E6%9E%9C%E6%8E%A8%E8%AB%96AI%E3%81%AE%E6%A7%8B%E7%AF%89.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build an AI for causal inference.

## 必要なライブラリをインストールする

### Subtask:
因果推論を行うために必要なライブラリ（例: DoWhy, CausalImpact, scikit-learnなど）をインストールします。


**Reasoning**:
Install the necessary libraries for causal inference as specified in the instructions.



In [2]:
%pip install dowhy causalimpact scikit-learn

Collecting dowhy
  Downloading dowhy-0.13-py3-none-any.whl.metadata (18 kB)
Collecting causalimpact
  Downloading causalimpact-0.2.6-py3-none-any.whl.metadata (3.8 kB)
Collecting causal-learn>=0.1.3.0 (from dowhy)
  Downloading causal_learn-0.1.4.3-py3-none-any.whl.metadata (4.6 kB)
Collecting cvxpy<1.5 (from dowhy)
  Downloading cvxpy-1.4.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.8 kB)
Collecting scipy<=1.15.3 (from dowhy)
  Downloading scipy-1.15.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting momentchi2 (from causal-learn>=0.1.3.0->dowhy)
  Downloading momentchi2-0.1.8-py3-none-any.whl.metadata (6.1 kB)
Collecting ecos>=2 (from cvxpy<1.5->dowhy)
  Downloading ecos-2.0.14-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.0 kB)
Collecting pybind11 (from cvxpy<1.5->dowhy)
  Do

## データを読み込む

### Subtask:
分析したいデータをColab環境に読み込みます。


**Reasoning**:
Load the California housing training data into a pandas DataFrame and display the first few rows along with data types and column names to verify successful loading and understand the data structure.



In [3]:
import pandas as pd

df = pd.read_csv('/content/sample_data/california_housing_train.csv')
display(df.head())
display(df.info())

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


None

## 因果モデルを定義する

### Subtask:
介入、結果、共変量などを特定し、因果グラフなどのモデルを定義します。


**Reasoning**:
Determine the outcome, treatment, and covariates for the causal inference task based on the available columns in the dataframe and consider the causal relationships to define the causal graph structure.



In [4]:
outcome = 'median_house_value'
treatment = 'median_income'
covariates = ['housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households']

print(f"Outcome variable: {outcome}")
print(f"Treatment variable: {treatment}")
print(f"Covariates: {covariates}")

# Considering the causal graph structure:
# It's reasonable to assume that 'median_income' (treatment) can influence 'median_house_value' (outcome).
# The other variables like 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', and 'households' (covariates)
# could potentially influence both 'median_income' and 'median_house_value'.
# For example, areas with higher population and households might have different housing characteristics and income levels.
# A simplified causal graph structure would be:
# Covariates -> Treatment
# Covariates -> Outcome
# Treatment -> Outcome

Outcome variable: median_house_value
Treatment variable: median_income
Covariates: ['housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households']


## 因果効果を推定する

### Subtask:
適切な因果推論手法を用いて、介入の因果効果を推定します。


**Reasoning**:
Import the CausalModel class from dowhy, instantiate it with the identified treatment, outcome, and covariates, identify the causal effect, estimate the causal effect using linear regression, and display the estimation result.



In [5]:
from dowhy import CausalModel

# 1. Setup the causal model
model = CausalModel(data=df,
                    treatment=treatment,
                    outcome=outcome,
                    common_causes=covariates)

# 2. Identify the causal effect
identified_estimand = model.identify_effect()
print(identified_estimand)

# 3. Estimate the causal effect
estimate = model.estimate_effect(identified_estimand,
                                 method_name="backdoor.linear_regression")

# 4. Display the estimation result
print(estimate)



Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
       d                                                                       ↪
────────────────(E[median_house_value|total_bedrooms,housing_median_age,househ ↪
d[median_income]                                                               ↪

↪                              
↪ olds,population,total_rooms])
↪                              
Estimand assumption 1, Unconfoundedness: If U→{median_income} and U→median_house_value then P(median_house_value|median_income,total_bedrooms,housing_median_age,households,population,total_rooms,U) = P(median_house_value|median_income,total_bedrooms,housing_median_age,households,population,total_rooms)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

### Estimand : 4
Estimand name: general_adjustment
Estimand expression:
       d                                 

  intercept_parameter = self.model.params[0]


## 結果を評価・解釈する

### Subtask:
推定された因果効果の頑健性を評価し、結果を解釈します。


**Reasoning**:
Evaluate the robustness of the estimated causal effect and interpret the results based on the provided instructions.



In [6]:
# 1. Check for statistical significance.
# The 'estimate' object from dowhy often contains information about statistical significance,
# such as p-values or confidence intervals, if the estimation method supports it.
# We need to access these attributes from the estimate object.
# For linear regression, standard errors and p-values are usually available.
print("Estimated Causal Effect Details:")
print(estimate)

# Check if confidence intervals are available
if hasattr(estimate, 'get_confidence_intervals'):
    confidence_intervals = estimate.get_confidence_intervals()
    print(f"\nConfidence Intervals (95%): {confidence_intervals}")

# Check if p-value is available
if hasattr(estimate, 'get_p_value'):
    p_value = estimate.get_p_value()
    print(f"P-value: {p_value}")
    if p_value < 0.05: # Using a common significance level of 0.05
        print("The estimated causal effect is statistically significant at the 0.05 level.")
    else:
        print("The estimated causal effect is not statistically significant at the 0.05 level.")

# 2. Interpret the magnitude of the estimated causal effect.
estimated_effect_value = estimate.value
print(f"\nEstimated Causal Effect (ATE): {estimated_effect_value:.2f}")
print(f"Interpretation: A one-unit increase in 'median_income' is estimated to increase 'median_house_value' by approximately ${estimated_effect_value:.2f}, assuming all other controlled factors remain constant.")

# 3. Briefly touch upon the assumptions of the causal inference method (backdoor criterion with linear regression).
print("\nAssumptions of the Backdoor Criterion with Linear Regression:")
print("- **No unmeasured confounders:** All common causes of 'median_income' and 'median_house_value' are included in the covariates.")
print("- **Ignorability/Conditional Independence:** Treatment assignment ('median_income') is independent of the outcome ('median_house_value') given the observed covariates.")
print("- **Sufficiency of Covariates:** The chosen covariates block all backdoor paths between 'median_income' and 'median_house_value'.")
print("- **Linearity:** The relationship between the outcome and the treatment and covariates is linear.")
print("- **No multicollinearity:** Covariates are not perfectly correlated with each other or the treatment.")
print("Violation of these assumptions could lead to biased estimates.")

# 4. Discuss the validity of the model, including whether covariates were appropriately adjusted.
print("\nModel Validity Discussion:")
print(f"- We adjusted for the following covariates: {covariates}.")
print("- These covariates were chosen because they are plausible common causes of both median income and median house value.")
print("- However, there might be other potential confounding factors not included in the dataset (e.g., neighborhood quality, school district ratings, proximity to amenities) that could affect both income and housing value.")
print("- If important confounders are omitted, the 'no unmeasured confounders' assumption is violated, and the estimated effect could be biased.")
print("- The appropriateness of the linear model also depends on the true relationship between the variables.")

# 5. Summarize the trustworthiness of the estimated causal effect and potential implications.
print("\nSummary of Trustworthiness and Implications:")
print(f"- The estimated causal effect of median income on median house value is approximately ${estimated_effect_value:.2f}.")
if hasattr(estimate, 'get_p_value') and estimate.get_p_value() < 0.05:
    print("- The effect is statistically significant, suggesting it is unlikely due to random chance.")
else:
     print("- Based on the p-value (or lack thereof), we cannot definitively say the effect is statistically significant at the 0.05 level.")
if hasattr(estimate, 'get_confidence_intervals'):
    print(f"- The 95% confidence interval ({confidence_intervals}) gives a range of plausible values for the true causal effect.")
print("- The trustworthiness of this estimate relies heavily on the validity of the assumptions, particularly that all significant confounders have been included and that the relationships are reasonably linear.")
print("- If the assumptions hold, this result suggests that policies or interventions aimed at increasing median income in an area could potentially lead to an increase in median house values.")
print("- However, given the potential for unmeasured confounders and the simplifying assumptions of the linear model, caution is needed in interpreting this as a definitive causal link.")
print("- Further analysis, potentially using different causal inference methods or including more comprehensive data, would be necessary to strengthen the confidence in this finding.")

Estimated Causal Effect Details:
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
       d                                                                       ↪
────────────────(E[median_house_value|total_bedrooms,housing_median_age,househ ↪
d[median_income]                                                               ↪

↪                              
↪ olds,population,total_rooms])
↪                              
Estimand assumption 1, Unconfoundedness: If U→{median_income} and U→median_house_value then P(median_house_value|median_income,total_bedrooms,housing_median_age,households,population,total_rooms,U) = P(median_house_value|median_income,total_bedrooms,housing_median_age,households,population,total_rooms)

## Realized estimand
b: median_house_value~median_income+total_bedrooms+housing_median_age+households+population+total_rooms
Target units: ate

## Estimate
Mean valu

## Summary:

### Data Analysis Key Findings

*   The necessary libraries for causal inference (`dowhy`, `causalimpact`, `scikit-learn`) were successfully installed.
*   The `california_housing_train.csv` dataset, containing 17000 entries and 9 columns with no missing values, was successfully loaded.
*   `median_house_value` was defined as the outcome variable, `median_income` as the treatment variable, and `housing_median_age`, `total_rooms`, `total_bedrooms`, `population`, and `households` as covariates.
*   A causal model was set up using `dowhy` to estimate the effect of `median_income` on `median_house_value`, controlling for the specified covariates, using the backdoor criterion with linear regression.
*   The estimated Average Treatment Effect (ATE) of `median_income` on `median_house_value` is approximately \$48033.46.
*   The 95% confidence interval for the estimated effect is approximately [47297.07, 48769.85].
*   The analysis identified and discussed the key assumptions of the method, including the critical assumption of no unmeasured confounders.

### Insights or Next Steps

*   The statistically significant and substantial estimated effect suggests a strong positive association between median income and median house value, after controlling for the included covariates.
*   To increase confidence in the causal claim, future steps should involve sensitivity analysis to assess the impact of potential unmeasured confounders and exploring alternative causal inference methods or models.
