# Abstract </br>
Causal inference in machine learning transcends the identification of patterns and correlations, venturing into the realm of determining causality from observational data. understanding the foundational concepts and methodologies for preparing data for causal analysis. It addresses common challenges, such as confounding variables, selection bias, and the importance of experimental design principles in observational studies. Through a combination of theoretical insights and practical Python code demonstrations, we will learn how to effectively prepare datasets for causal analysis, ensuring the integrity and reliability of findings.

# Using the World Happiness Dataset for Causal Inference

https://www.kaggle.com/datasets/nikbearbrown/the-economics-of-happiness-simple-data-20152019?resource=download

We'll focus on demonstrating how to explore and prepare this specific dataset for causal analysis, highlighting novel approaches and bold strategies.

This dataset presumably contains world happiness scores and related factors from 2015 to 2019, with imputations for missing values. Our objective will be to identify potential causal relationships between the happiness scores (dependent variable) and various predictors (independent variables such as GDP per capita, social support, etc.).

In the realm of social sciences and economics, understanding the determinants of happiness is pivotal. This section leverages the "World Happiness" dataset, focusing on innovative and bold strategies for preparing data for causal inference. Through this practical example, we aim to uncover the underlying causes of happiness across different countries and years.

Data Preparation: A Novel Approach
Data preparation involves several steps tailored to enhance causal inference from the World Happiness dataset.


# ***Handling Missing Data with Imputation***</br>

**The dataset has undergone imputation by Prof Nick Brown**; however, understanding the imputation technique and its implications on causal inference is crucial.

# **Approach I: Linear Regression with Simple Imputation**

by apply principles of experimental design such as stratification and covariate adjustment to strengthen our causal inference.</br>
**We will prepare a linear model considering "GDP per Capita" as the independent variable and "Happiness Score" as the dependent variable, including potential confounders in our model.**
</br></br>
***hooo nooo!!!***
</br></br>
It appears there are missing values in our dataset that would prevent the linear regression model from being trained.
</br></br>
This is why Handling missing values is an essential part of preparing data for causal inference and analysis. Missing data can introduce bias and affect the validity of our causal conclusions.
</br></br>
Let's address the missing values by imputing them. Imputation involves replacing missing values with substitute values, such as the mean or median of the non-missing values in the column. We'll perform mean imputation for simplicity, which is suitable for this demonstration but ***may not always be the best choice for every scenario***.


In [2]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer

# Load the dataset
file_path = 'https://raw.githubusercontent.com/nikbearbrown/INFO_7390_Art_and_Science_of_Data/main/CSV/TEH_World_Happiness_2015_2019_Imputed.csv'  # Update this to the actual file path
df = pd.read_csv(file_path)

# Selecting independent and dependent variables
X = df[['GDP per capita', 'Social support', 'Healthy life', 'Freedom']]
y = df['Happiness Score']

# Handling missing values with mean imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Splitting the imputed dataset into training and testing sets
X_train_imputed, X_test_imputed, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)

# Initializing and fitting the linear regression model
model = LinearRegression()
model.fit(X_train_imputed, y_train)

# Predicting on the testing set
y_pred_imputed = model.predict(X_test_imputed)

# Calculating model performance using Mean Squared Error (MSE)
mse_imputed = mean_squared_error(y_test, y_pred_imputed)

# Outputting the MSE, coefficients, and intercept
print("Mean Squared Error:", mse_imputed)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)


Mean Squared Error: 0.310344652797569
Coefficients: [1.17004867 0.52218426 1.18506192 1.90623347]
Intercept: 2.216329277565546


After addressing the missing values through mean imputation and re-fitting our linear regression model, we successfully analyzed the relationship between "GDP per Capita" and the "Happiness Score" while controlling for potential confounders such as "Social Support", "Healthy Life", and "Freedom". </br>Here are the key findings:

- **Mean Squared Error (MSE)**: 0.3103, indicating the average squared difference between the estimated values and the actual value.
- **Model Coefficients**: The coefficients for "GDP per Capita", "Social Support", "Healthy Life", and "Freedom" are 1.1700, 0.5222, 1.1851, and 1.9062, respectively. These coefficients represent the change in the "Happiness Score" for a one-unit change in each predictor variable, holding all other variables constant.
- **Model Intercept**: 2.2163, representing the "Happiness Score" when all predictor variables are 0.

**Identifying and Controlling for Time-Varying Confounders**

Given the panel structure of the dataset (repeated observations over time), we address the challenge of time-varying confounders—variables that change over time and can affect both the treatment and outcome. We introduce the use of Fixed Effects models to control for unobserved heterogeneity across countries and over time.

These results highlight the importance of controlling for confounders in causal analysis. </br>
- Each of the included variables contributes to explaining variations in the "Happiness Score", underscoring the complexity of causal relationships in observational data. Additionally,
- addressing selection bias and applying experimental design principles in observational studies, such as through stratification and covariate adjustment, further strengthens the reliability of causal inferences.

# **Approach II: OLS Regression with Detailed Data Preparation**


### Data Loading and Preprocessing
To account for the categorical nature of 'Country' and 'Year', these columns are converted to categorical data types, which helps in treating them as fixed effects in the subsequent analysis.

### Handling Categorical Variables
Dummy variables are generated for both 'Country' and 'Year' to incorporate these as fixed effects in the model. This is crucial for causal inference as it controls for unobserved heterogeneity across countries and years that might influence the happiness scores. The first category is dropped for each to avoid multicollinearity, a condition where predictor variables are highly correlated, which can distort the analysis.

### Feature Engineering
The model includes 'GDP per capita' and 'Social support' as continuous predictor variables, reflecting the economic and social factors' contributions to happiness. The inclusion of country and year dummies allows the model to adjust for the fixed effects of these categories, aiming to isolate the impact of the GDP per capita and social support on the happiness score.

### Model Specification
An intercept term is added explicitly to the model to account for the baseline level of happiness when all other predictor variables are zero. This is a common practice in regression analysis to ensure the model accurately captures the relationship between the predictors and the outcome variable.

### Missing Values Handling
The code explicitly replaces infinite values with NaN and removes any rows with NaN values from both the predictor and outcome variables. This step is essential to ensure that the model does not fail due to undefined or missing values, which could otherwise introduce bias or inaccuracies in the causal analysis.

### Model Fitting and Summary
The `statsmodels` library's OLS (Ordinary Least Squares) method is used to fit the model. This statistical technique is employed to estimate the unknown parameters in a linear regression model, aiming to minimize the sum of squared differences between observed and predicted values. The summary of the model provides detailed output, including coefficients for each predictor, indicating their impact on the Happiness Score. It also includes statistical tests and metrics that assess the model's overall fit and the significance of individual predictors.

By controlling for country and year effects and focusing on economic and social predictors, this approach attempts to uncover the causal relationships underlying the data. This methodology exemplifies how to prepare and analyze data for causal inference in machine learning, emphasizing the importance of careful data handling, model specification, and interpretation of results in uncovering causal relationships from observational data.


In [3]:
import statsmodels.api as sm
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/nikbearbrown/INFO_7390_Art_and_Science_of_Data/main/CSV/TEH_World_Happiness_2015_2019_Imputed.csv')

# Ensure 'Country' and 'Year' are treated as categorical variables
df['Country'] = df['Country'].astype('category')
df['Year'] = df['Year'].astype('category')

# Generate dummy variables for 'Country' and 'Year', dropping the first to avoid multicollinearity
country_dummies = pd.get_dummies(df['Country'], drop_first=True)
year_dummies = pd.get_dummies(df['Year'], drop_first=True)

# Combine all relevant features along with the country and year dummies
X = pd.concat([df[['GDP per capita', 'Social support']], country_dummies, year_dummies], axis=1)

# Adding an intercept term for the OLS model
X['intercept'] = 1

Y = df['Happiness Score']

# Handle missing values explicitly across the entire DataFrame to ensure no NaNs or infinities
X.replace([np.inf, -np.inf], np.nan, inplace=True)  # Replace infinities with NaN
X.dropna(inplace=True)  # Drop any rows with NaN values

Y.replace([np.inf, -np.inf], np.nan, inplace=True)  # Do the same for Y
Y.dropna(inplace=True)  # And drop NaNs in Y

# Ensure Y has the same index as X to align both datasets
Y = Y[X.index]

# Now, fit the model with the cleaned and aligned data
fe_model = sm.OLS(Y, X).fit()

print(fe_model.summary())


                            OLS Regression Results                            
Dep. Variable:        Happiness Score   R-squared:                       0.967
Model:                            OLS   Adj. R-squared:                  0.957
Method:                 Least Squares   F-statistic:                     99.48
Date:                Sat, 23 Mar 2024   Prob (F-statistic):               0.00
Time:                        23:26:32   Log-Likelihood:                 130.79
No. Observations:                 773   AIC:                             88.41
Df Residuals:                     598   BIC:                             902.2
Df Model:                         174                                         
Covariance Type:            nonrobust                                         
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
GDP per capita          

### Key Differences

- **Imputation Method:** The first approach suggests simple imputation, while the second approach does not employ advanced imputation techniques like Iterative Imputer but rather drops rows with missing values.
- **Categorical Variable Handling:** The second approach more thoroughly accounts for categorical variables through the creation of dummy variables, allowing for a more nuanced control of fixed effects in the analysis.
- **Statistical Detail:** The use of `statsmodels.api.OLS` in the second approach offers a deeper statistical insight into the model, including the significance of individual predictors and model diagnostics, which is particularly valuable for causal inference and understanding the dynamics between variables.

Each approach has its strengths and is suited to different stages of data analysis and preparation. The choice between them would depend on the specific goals of the analysis, the nature of the data, and the level of detail required for the causal inference being undertaken.

### References

1. Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.
2. Rubin, D.B. (2005). Causal Inference Using Potential Outcomes: Design, Modeling, Decisions. Journal of the American Statistical Association.
3. Hernán, M.A., & Robins, J.M. (2020). Causal Inference: What If. Chapman & Hall/CRC.

