# Estimating distributions (part 1)
The goal of this notebook is to explore a first approach to approximate $p(y|x)$ and $p(x|y)$ on a tabular dataset where $x$ is discrete-valued, $x\in\mathbb{D}^k$, and the target $y$ is boolean, $y\in\{0,1\}$.

## Imports

In [None]:
import numpy as np
import pandas as pd

## Load data set

In [None]:
df = pd.read_csv("sample_data/tennis.csv", delimiter=",", header=0)
df

In [None]:
df = df.drop("Day", axis=1)
df

In [None]:
X_names = df.columns.to_list()[:-1]
X_names

In [None]:
X = df.iloc[:, 0:-1]
X

In [None]:
Y_name = df.columns.to_list()[-1]
Y_name

In [None]:
Y = df.iloc[:, -1]
Y

## Build the table of observations
Take $x$ to be the random variable Outlook and count the observations based on the data set.

In [None]:
N = X["Outlook"].size
xvalues = np.unique(X["Outlook"].values).tolist()
yvalues = np.unique(Y.values).tolist()
dimx = len(xvalues)
dimy = len(yvalues)

In [None]:
obs = pd.DataFrame(0, columns=yvalues, index=xvalues)
for i in range(N):
    xi = X["Outlook"][i]
    yi = Y[i]
    obs[yi][xi] += 1
obs

## Approximate the joint distribution $p(x,y)$
Take $x$ to be Outlook and approximate the joint distribution based on the table of observations.

In [None]:
m = obs.sum(axis=1)
m

In [None]:
l = obs.sum(axis=0)
l

In [None]:
obs["m"] = m
obs.loc["l"] = l
obs

In [None]:
joint_proba = pd.DataFrame(0, columns=yvalues, index=xvalues)
for x in xvalues:
    joint_proba.loc[x] = obs[yvalues].loc[x] / N
joint_proba

## Approximate $p(y|x)$
Take $x$ to be Outlook and estimate the conditional probability of $y$ given $x$. Then, sample 10 values of $y$ given $x$ equal Sunny.

In [None]:
p_y_x = pd.DataFrame(0, columns=yvalues, index=xvalues)
for x in xvalues:
    p_y_x.loc[x] = obs[yvalues].loc[x] / obs["m"].loc[x]
p_y_x

In [None]:
np.random.choice(yvalues, size=10, p=p_y_x.loc["Sunny"])

## Approximate $p(x|y)$
Take $x$ to be Outlook and approximate the conditional distribution based on the table of observations. Then, sample 10 values of Outlook for $y$ equal Yes.

In [None]:
p_x_y = pd.DataFrame(0, columns=yvalues, index=xvalues)
for y in yvalues:
    p_x_y[y] = obs[y] / obs[y].loc["l"]
p_x_y

In [None]:
np.random.choice(xvalues, size=10, p=p_x_y["Yes"])

## Approximate $p(y,o,h,w,t)$
*$p(y,o,h,w,t) = p(y) * p(o|y) * p(h|y,o) * p(w|y,o) * p(t|y,o,h,w)$*

### $p(y)$

Usamos la función groupby para obtener las columnas que queremos agrupar para obtener su probabilidad conjunta, para calcularla obtenemos el tamaño con la función size() y lo dividimos entre la cantidad de observaciones que tiene el data set, N.

In [None]:
p_y = pd.DataFrame(0, columns=[], index=[])
p_y['P'] = df.groupby(['Tennis']).size() / N
p_y

### $p(o|y)$
*$p(o|y) = p(o,y) / p(y)$* 

Para obtener la probabilidad condicional utilizamos la regla de la cadena: $p(x|y) = p(x,y) / p(y)$

In [None]:
# p(o,y)
joint_o_y = pd.DataFrame(0, columns=[], index=[])
joint_o_y['P'] = df.groupby(['Outlook', 'Tennis']).size() / N
joint_o_y

In [None]:
# p(o|y)
cond_proba_o_y = joint_o_y / p_y
cond_proba_o_y

### $p(h|y,o)$
*$p(h|y,o) = p(h,y,o) / p(y,o)$* 

In [None]:
# p(h,y,o)
joint_h_y_o = pd.DataFrame(0, columns=[], index=[])
joint_h_y_o['P'] = df.groupby(['Humidity', 'Outlook', 'Tennis']).size() / N

joint_h_y_o

In [None]:
#p(y,o)
joint_y_o = joint_h_y_o.groupby(['Outlook','Tennis']).sum()
joint_y_o

In [None]:
#p(h|y,o) 
cond_proba_h_y_o = joint_h_y_o / joint_y_o
cond_proba_h_y_o

### $p(w|y,o)$
*$p(w|y,o) = p(w,y,o) / p(y,o)$* 

In [None]:
# p(w,y,o)
joint_w_y_o = pd.DataFrame(0, columns=[], index=[])
joint_w_y_o['P'] = df.groupby(['Wind', 'Outlook', 'Tennis']).size() / N

joint_w_y_o

In [None]:
#p(w|y,o) 
cond_proba_w_y_o = joint_w_y_o / joint_y_o
cond_proba_w_y_o

### $p(t|y,o,h,w)$
*$p(t|y,o,h,w) = p(t,y,o,h,w) / p(y,o,h,w)$* 

In [None]:
# p(t,y,o,h,w)
joint_t_y_o_h_w = pd.DataFrame(0, columns=[], index=[])
joint_t_y_o_h_w['P'] = df.groupby(['Temp', 'Outlook', 'Tennis', 'Humidity', 'Wind']).size() / N

joint_t_y_o_h_w

In [None]:
# p(y,o,h,w)
joint_y_o_h_w = joint_t_y_o_h_w.groupby(['Outlook', 'Tennis', 'Humidity', 'Wind']).sum()
joint_y_o_h_w

In [None]:
#p(t|y,o,h,w)
cond_proba_t_y_o_h_w = joint_t_y_o_h_w / joint_y_o_h_w
cond_proba_t_y_o_h_w

### $p(y,o,h,w,t)$

In [None]:
joint_y_o_h_w_t = p_y * cond_proba_o_y * cond_proba_h_y_o * cond_proba_w_y_o * cond_proba_t_y_o_h_w
joint_y_o_h_w_t

## Sampling
Se samplean 10 X con $p(y,o,h,w,t)$

In [None]:
index_list =  [','.join(map(str, item)) for item in joint_y_o_h_w_t.index]
np.random.choice(index_list, size=20, p=joint_y_o_h_w_t['P'].tolist())

## Observaciones

El resultado de la probabilidad conjunta $p(y,o,h,w,t)$ es una probabilidad uniforme para las combinaciones de $(y,o,h,w,t)$ que aparecen en el data set, lo cual se debe a que los datos provistos contienen como máximo una única ocurrencia cada combinación, por lo que la probabilidad de todos los casos posibles es $0$ o $0.071429$.