# Estimating distributions (part 1)
The goal of this notebook is to explore a first approach to approximate $p(y|x)$ and $p(x|y)$ on a tabular dataset where $x$ is discrete-valued, $x\in\mathbb{D}^k$, and the target $y$ is boolean, $y\in\{0,1\}$.

## Imports

In [1]:
import numpy as np
import pandas as pd

## Load data set

In [2]:
df = pd.read_csv("sample_data/tennis.csv", delimiter=",", header=0)
df

Unnamed: 0,Day,Outlook,Temp,Humidity,Wind,Tennis
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
2,D3,Overcast,Hot,High,Weak,Yes
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes
5,D6,Rain,Cool,Normal,Strong,No
6,D7,Overcast,Cool,Normal,Strong,Yes
7,D8,Sunny,Mild,High,Weak,No
8,D9,Sunny,Cool,Normal,Weak,Yes
9,D10,Rain,Mild,Normal,Weak,Yes


In [3]:
df = df.drop("Day", axis=1)
df

Unnamed: 0,Outlook,Temp,Humidity,Wind,Tennis
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


In [4]:
X_names = df.columns.to_list()[:-1]
X_names

['Outlook', 'Temp', 'Humidity', 'Wind']

In [5]:
X = df.iloc[:, 0:-1]
X

Unnamed: 0,Outlook,Temp,Humidity,Wind
0,Sunny,Hot,High,Weak
1,Sunny,Hot,High,Strong
2,Overcast,Hot,High,Weak
3,Rain,Mild,High,Weak
4,Rain,Cool,Normal,Weak
5,Rain,Cool,Normal,Strong
6,Overcast,Cool,Normal,Strong
7,Sunny,Mild,High,Weak
8,Sunny,Cool,Normal,Weak
9,Rain,Mild,Normal,Weak


In [6]:
Y_name = df.columns.to_list()[-1]
Y_name

'Tennis'

In [7]:
Y = df.iloc[:, -1]
Y

0      No
1      No
2     Yes
3     Yes
4     Yes
5      No
6     Yes
7      No
8     Yes
9     Yes
10    Yes
11    Yes
12    Yes
13     No
Name: Tennis, dtype: object

## Build the table of observations
Take $x$ to be the random variable Outlook and count the observations based on the data set.

In [8]:
N = X["Outlook"].size
xvalues = np.unique(X["Outlook"].values).tolist()
yvalues = np.unique(Y.values).tolist()
dimx = len(xvalues)
dimy = len(yvalues)

In [9]:
obs = pd.DataFrame(0, columns=yvalues, index=xvalues)
for i in range(N):
    xi = X["Outlook"][i]
    yi = Y[i]
    obs[yi][xi] += 1
obs

Unnamed: 0,No,Yes
Overcast,0,4
Rain,2,3
Sunny,3,2


## Approximate the joint distribution $p(x,y)$
Take $x$ to be Outlook and approximate the joint distribution based on the table of observations.

In [10]:
m = obs.sum(axis=1)
m

Overcast    4
Rain        5
Sunny       5
dtype: int64

In [11]:
l = obs.sum(axis=0)
l

No     5
Yes    9
dtype: int64

In [12]:
obs["m"] = m
obs.loc["l"] = l
obs

Unnamed: 0,No,Yes,m
Overcast,0.0,4.0,4.0
Rain,2.0,3.0,5.0
Sunny,3.0,2.0,5.0
l,5.0,9.0,


In [13]:
joint_proba = pd.DataFrame(0, columns=yvalues, index=xvalues)
for x in xvalues:
    joint_proba.loc[x] = obs[yvalues].loc[x] / N
joint_proba

Unnamed: 0,No,Yes
Overcast,0.0,0.285714
Rain,0.142857,0.214286
Sunny,0.214286,0.142857


## Approximate $p(y|x)$
Take $x$ to be Outlook and estimate the conditional probability of $y$ given $x$. Then, sample 10 values of $y$ given $x$ equal Sunny.

In [14]:
p_y_x = pd.DataFrame(0, columns=yvalues, index=xvalues)
for x in xvalues:
    p_y_x.loc[x] = obs[yvalues].loc[x] / obs["m"].loc[x]
p_y_x

Unnamed: 0,No,Yes
Overcast,0.0,1.0
Rain,0.4,0.6
Sunny,0.6,0.4


In [15]:
np.random.choice(yvalues, size=10, p=p_y_x.loc["Sunny"])

array(['No', 'No', 'No', 'No', 'Yes', 'No', 'Yes', 'No', 'No', 'No'],
      dtype='<U3')

## Approximate $p(x|y)$
Take $x$ to be Outlook and approximate the conditional distribution based on the table of observations. Then, sample 10 values of Outlook for $y$ equal Yes.

In [16]:
p_x_y = pd.DataFrame(0, columns=yvalues, index=xvalues)
for y in yvalues:
    p_x_y[y] = obs[y] / obs[y].loc["l"]
p_x_y

Unnamed: 0,No,Yes
Overcast,0.0,0.444444
Rain,0.4,0.333333
Sunny,0.6,0.222222


In [17]:
np.random.choice(xvalues, size=10, p=p_x_y["Yes"])

array(['Sunny', 'Overcast', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain',
       'Overcast', 'Overcast', 'Overcast'], dtype='<U8')

## Approximate $p(y,o,h,w,t)$
*$p(y,o,h,w,t) = p(y) * p(o|y) * p(h|y,o) * p(w|y,o) * p(t|y,o,h,w)$*

### $p(y)$

Usamos la función groupby para obtener las columnas que queremos agrupar para obtener su probabilidad conjunta, para calcularla obtenemos el tamaño con la función size() y lo dividimos entre la cantidad de observaciones que tiene el data set, N.

In [18]:
p_y = pd.DataFrame(0, columns=[], index=[])
p_y['P'] = df.groupby(['Tennis']).size() / N
p_y

Unnamed: 0_level_0,P
Tennis,Unnamed: 1_level_1
No,0.357143
Yes,0.642857


### $p(o|y)$
*$p(o|y) = p(o,y) / p(y)$* 

Para obtener la probabilidad condicional utilizamos la regla de la cadena: $p(x|y) = p(x,y) / p(y)$

In [19]:
# p(o,y)
joint_o_y = pd.DataFrame(0, columns=[], index=[])
joint_o_y['P'] = df.groupby(['Outlook', 'Tennis']).size() / N
joint_o_y

Unnamed: 0_level_0,Unnamed: 1_level_0,P
Outlook,Tennis,Unnamed: 2_level_1
Overcast,Yes,0.285714
Rain,No,0.142857
Rain,Yes,0.214286
Sunny,No,0.214286
Sunny,Yes,0.142857


In [20]:
# p(o|y)
cond_proba_o_y = joint_o_y / p_y
cond_proba_o_y

Unnamed: 0_level_0,Unnamed: 1_level_0,P
Outlook,Tennis,Unnamed: 2_level_1
Overcast,Yes,0.444444
Rain,No,0.4
Rain,Yes,0.333333
Sunny,No,0.6
Sunny,Yes,0.222222


### $p(h|y,o)$
*$p(h|y,o) = p(h,y,o) / p(y,o)$* 

In [21]:
# p(h,y,o)
joint_h_y_o = pd.DataFrame(0, columns=[], index=[])
joint_h_y_o['P'] = df.groupby(['Humidity', 'Outlook', 'Tennis']).size() / N

joint_h_y_o

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,P
Humidity,Outlook,Tennis,Unnamed: 3_level_1
High,Overcast,Yes,0.142857
High,Rain,No,0.071429
High,Rain,Yes,0.071429
High,Sunny,No,0.214286
Normal,Overcast,Yes,0.142857
Normal,Rain,No,0.071429
Normal,Rain,Yes,0.142857
Normal,Sunny,Yes,0.142857


In [22]:
#p(y,o)
joint_y_o = joint_h_y_o.groupby(['Outlook','Tennis']).sum()
joint_y_o

Unnamed: 0_level_0,Unnamed: 1_level_0,P
Outlook,Tennis,Unnamed: 2_level_1
Overcast,Yes,0.285714
Rain,No,0.142857
Rain,Yes,0.214286
Sunny,No,0.214286
Sunny,Yes,0.142857


In [23]:
#p(h|y,o) 
cond_proba_h_y_o = joint_h_y_o / joint_y_o
cond_proba_h_y_o

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,P
Outlook,Tennis,Humidity,Unnamed: 3_level_1
Overcast,Yes,High,0.5
Overcast,Yes,Normal,0.5
Rain,No,High,0.5
Rain,No,Normal,0.5
Rain,Yes,High,0.333333
Rain,Yes,Normal,0.666667
Sunny,No,High,1.0
Sunny,Yes,Normal,1.0


### $p(w|y,o)$
*$p(w|y,o) = p(w,y,o) / p(y,o)$* 

In [24]:
# p(w,y,o)
joint_w_y_o = pd.DataFrame(0, columns=[], index=[])
joint_w_y_o['P'] = df.groupby(['Wind', 'Outlook', 'Tennis']).size() / N

joint_w_y_o

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,P
Wind,Outlook,Tennis,Unnamed: 3_level_1
Strong,Overcast,Yes,0.142857
Strong,Rain,No,0.142857
Strong,Sunny,No,0.071429
Strong,Sunny,Yes,0.071429
Weak,Overcast,Yes,0.142857
Weak,Rain,Yes,0.214286
Weak,Sunny,No,0.142857
Weak,Sunny,Yes,0.071429


In [25]:
#p(w|y,o) 
cond_proba_w_y_o = joint_w_y_o / joint_y_o
cond_proba_w_y_o

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,P
Outlook,Tennis,Wind,Unnamed: 3_level_1
Overcast,Yes,Strong,0.5
Overcast,Yes,Weak,0.5
Rain,No,Strong,1.0
Rain,Yes,Weak,1.0
Sunny,No,Strong,0.333333
Sunny,No,Weak,0.666667
Sunny,Yes,Strong,0.5
Sunny,Yes,Weak,0.5


### $p(t|y,o,h,w)$
*$p(t|y,o,h,w) = p(t,y,o,h,w) / p(y,o,h,w)$* 

In [26]:
# p(t,y,o,h,w)
joint_t_y_o_h_w = pd.DataFrame(0, columns=[], index=[])
joint_t_y_o_h_w['P'] = df.groupby(['Temp', 'Outlook', 'Tennis', 'Humidity', 'Wind']).size() / N

joint_t_y_o_h_w

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,P
Temp,Outlook,Tennis,Humidity,Wind,Unnamed: 5_level_1
Cool,Overcast,Yes,Normal,Strong,0.071429
Cool,Rain,No,Normal,Strong,0.071429
Cool,Rain,Yes,Normal,Weak,0.071429
Cool,Sunny,Yes,Normal,Weak,0.071429
Hot,Overcast,Yes,High,Weak,0.071429
Hot,Overcast,Yes,Normal,Weak,0.071429
Hot,Sunny,No,High,Strong,0.071429
Hot,Sunny,No,High,Weak,0.071429
Mild,Overcast,Yes,High,Strong,0.071429
Mild,Rain,No,High,Strong,0.071429


In [27]:
# p(y,o,h,w)
joint_y_o_h_w = joint_t_y_o_h_w.groupby(['Outlook', 'Tennis', 'Humidity', 'Wind']).sum()
joint_y_o_h_w

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,P
Outlook,Tennis,Humidity,Wind,Unnamed: 4_level_1
Overcast,Yes,High,Strong,0.071429
Overcast,Yes,High,Weak,0.071429
Overcast,Yes,Normal,Strong,0.071429
Overcast,Yes,Normal,Weak,0.071429
Rain,No,High,Strong,0.071429
Rain,No,Normal,Strong,0.071429
Rain,Yes,High,Weak,0.071429
Rain,Yes,Normal,Weak,0.142857
Sunny,No,High,Strong,0.071429
Sunny,No,High,Weak,0.142857


In [28]:
#p(t|y,o,h,w)
cond_proba_t_y_o_h_w = joint_t_y_o_h_w / joint_y_o_h_w
cond_proba_t_y_o_h_w

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,P
Outlook,Tennis,Humidity,Wind,Temp,Unnamed: 5_level_1
Overcast,Yes,High,Strong,Mild,1.0
Overcast,Yes,High,Weak,Hot,1.0
Overcast,Yes,Normal,Strong,Cool,1.0
Overcast,Yes,Normal,Weak,Hot,1.0
Rain,No,High,Strong,Mild,1.0
Rain,No,Normal,Strong,Cool,1.0
Rain,Yes,High,Weak,Mild,1.0
Rain,Yes,Normal,Weak,Cool,0.5
Rain,Yes,Normal,Weak,Mild,0.5
Sunny,No,High,Strong,Hot,1.0


### $p(y,o,h,w,t)$

In [29]:
joint_y_o_h_w_t = p_y * cond_proba_o_y * cond_proba_h_y_o * cond_proba_w_y_o * cond_proba_t_y_o_h_w
joint_y_o_h_w_t

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,P
Outlook,Tennis,Humidity,Wind,Temp,Unnamed: 5_level_1
Overcast,Yes,High,Strong,Mild,0.071429
Overcast,Yes,High,Weak,Hot,0.071429
Overcast,Yes,Normal,Strong,Cool,0.071429
Overcast,Yes,Normal,Weak,Hot,0.071429
Rain,No,High,Strong,Mild,0.071429
Rain,No,Normal,Strong,Cool,0.071429
Rain,Yes,High,Weak,Mild,0.071429
Rain,Yes,Normal,Weak,Cool,0.071429
Rain,Yes,Normal,Weak,Mild,0.071429
Sunny,No,High,Strong,Hot,0.071429


## Sampling
Se samplean 10 X con $p(y,o,h,w,t)$

In [30]:
index_list =  [','.join(map(str, item)) for item in joint_y_o_h_w_t.index]
np.random.choice(index_list, size=20, p=joint_y_o_h_w_t['P'].tolist())

array(['Rain,No,High,Strong,Mild', 'Sunny,No,High,Weak,Hot',
       'Sunny,Yes,Normal,Weak,Cool', 'Sunny,No,High,Weak,Mild',
       'Rain,Yes,Normal,Weak,Mild', 'Rain,No,Normal,Strong,Cool',
       'Sunny,Yes,Normal,Weak,Cool', 'Overcast,Yes,High,Weak,Hot',
       'Rain,Yes,High,Weak,Mild', 'Sunny,Yes,Normal,Strong,Mild',
       'Sunny,No,High,Weak,Hot', 'Overcast,Yes,High,Weak,Hot',
       'Sunny,Yes,Normal,Strong,Mild', 'Rain,Yes,Normal,Weak,Cool',
       'Sunny,No,High,Weak,Hot', 'Overcast,Yes,Normal,Weak,Hot',
       'Sunny,No,High,Weak,Hot', 'Overcast,Yes,High,Strong,Mild',
       'Sunny,No,High,Weak,Mild', 'Overcast,Yes,High,Strong,Mild'],
      dtype='<U31')

## Observaciones

El resultado de la probabilidad conjunta $p(y,o,h,w,t)$ es una probabilidad uniforme para las combinaciones de (y,o,h,w,t) que aparecen en el data set, lo cual se debe a que los datos provistos contienen como máximo una única ocurrencia cada uno, por lo que la probabilidad de todos los casos posibles es 0 o 0.071429.