# Experiment 4: Estimation of latent variables

The goal of this experiment is try to recover information about latent variables in a causal graph.

The experiment has the following steps:
1. Define a causal graph (Directed Acyclic graph) relating five (5) hypothetical variables. Some of them will be supposed to be unknown (latent).
1. Define **linear** causal rules between those variables
1. Generate a data set following the rules
1. Try to recover the distribution of the latent variables

## 1. Define a causal graph

Here I define a causal graph whose relations may cause problems if not adequately treated. Nodes in gray will not be visible in the dataset.

In [2]:
import numpy as np
import pandas as pd
import pandas_profiling as pp
import plotly_express as px

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression

import networkx as nx
from nxpd import draw

# np.random.seed(seed=42) # Test set is not representative
np.random.seed(seed=22)

ModuleNotFoundError: No module named 'pandas_profiling'

In [None]:
G = nx.DiGraph()
G.graph['dpi'] = 120
G.add_nodes_from(['X', ('A',{'color':'gray'}), 'B', ('C',{'color':'gray'}), 'Y'])
G.add_edges_from([('A','X')], label='ax')
G.add_edges_from([('A','B')], label='ab')
G.add_edges_from([('C','B')], label='cb')
G.add_edges_from([('C','Y')], label='cy')
G.add_edges_from([('X','Y')], label='xy')
draw(G, show='ipynb')

The problem presented to the ML practitioner will be:

**Can you estimate the value of the latent variables A and C?**

## 2-3. Define causal rules between variables and create data set

Let's define some linear rules to later generate a data set:

In [1]:
n = 10000

In [2]:
sigma_A = 2
mu_A = -2
A = sigma_A * np.random.randn(n,1) + mu_A

sigma_C = 8
mu_C = 3
C = sigma_C * np.random.randn(n,1) + mu_C

NameError: name 'np' is not defined

In [56]:
B = 5*A - 2*C + np.random.randn(n,1)/10
X = -3*A + np.random.randn(n,1)/10
Y = X + 2*C + np.random.randn(n,1)

In [57]:
df_data = np.concatenate((A,B,C,X,Y), axis=1)
df = pd.DataFrame(data=df_data, columns=['A','B','C','X','Y'])
df.head()

Unnamed: 0,A,B,C,X,Y
0,-0.784381,-19.459063,7.847726,2.253039,18.95065
1,-1.167287,-25.755908,9.970554,3.465935,21.947194
2,-0.061676,-24.310245,11.937561,0.230676,23.29706
3,-0.256623,-13.950437,6.378393,0.712639,12.985815
4,-2.069758,-17.965845,3.748819,6.102475,13.040906


## Prepare training set

In [66]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='B'), df[['B']], test_size=0.3, random_state=42)

In [67]:
y_train = np.squeeze(y_train)
y_test  = np.squeeze(y_test)

## Explore data set

In [61]:
pp.ProfileReport(df, style={'full_width':True})



In [63]:
px.scatter(data_frame=df, x='X', y='Y')

In [64]:
px.scatter(data_frame=df, x='B', y='Y')

## 4. Estimate latent variables

I will set an equation system to solve the problem, supposing linear relationships. I will also add the Unknown terms used by Pearl:

B = ab·A + cb·C  
Y = cy·C + xy·X  
X = ax·A

As we do not have data for A and C, we can remove them:

B = X · (ab/ax - cb·xy/cy) + Y · cb/cy

As in SEMs, we need to fix some parameters to make the system identifiable.  
I already know (from Experiment 1) that:  
xy = 1


I will set:  
cb = 1  
ab = 1

So I only need to solve:  

B = X · (1/ax - 1/cy) + Y · 1/cy


In [75]:
linear_model = LinearRegression(fit_intercept=False)
linear_model.fit(X_train[['X', 'Y']], y_train)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None,
         normalize=False)

In [76]:
linear_model.coef_

array([-0.67143249, -0.99490002])

So we can conclude that:

cy = -1
ax = -0.6

B = 5*A - 2*C + np.random.randn(n,1)/10

X = -3*A + np.random.randn(n,1)/10

Y = X + 2*C + np.random.randn(n,1)

ME QUEDO AQUÍ. COMPRUEBA HASTA QUÉ PUNTO TODO ESTO TIENE SENTIDO :P