# Estimating distributions (part 1)
The goal of this notebook is to explore a first approach to approximate $p(y|x)$ and $p(x|y)$ on a tabular dataset where $x$ is discrete-valued, $x\in\mathbb{D}^k$, and the target $y$ is boolean, $y\in\{0,1\}$.

## Imports

In [3]:
import numpy as np
import pandas as pd

## Load data set

In [4]:
df = pd.read_csv("sample_data/tennis.csv", delimiter=",", header=0)
df

Unnamed: 0,Day,Outlook,Temp,Humidity,Wind,Tennis
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
2,D3,Overcast,Hot,High,Weak,Yes
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes
5,D6,Rain,Cool,Normal,Strong,No
6,D7,Overcast,Cool,Normal,Strong,Yes
7,D8,Sunny,Mild,High,Weak,No
8,D9,Sunny,Cool,Normal,Weak,Yes
9,D10,Rain,Mild,Normal,Weak,Yes


In [5]:
df = df.drop("Day", axis=1)
df

Unnamed: 0,Outlook,Temp,Humidity,Wind,Tennis
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


In [6]:
X_names = df.columns.to_list()[:-1]
X_names

['Outlook', 'Temp', 'Humidity', 'Wind']

In [7]:
X = df.iloc[:, 0:-1]
X

Unnamed: 0,Outlook,Temp,Humidity,Wind
0,Sunny,Hot,High,Weak
1,Sunny,Hot,High,Strong
2,Overcast,Hot,High,Weak
3,Rain,Mild,High,Weak
4,Rain,Cool,Normal,Weak
5,Rain,Cool,Normal,Strong
6,Overcast,Cool,Normal,Strong
7,Sunny,Mild,High,Weak
8,Sunny,Cool,Normal,Weak
9,Rain,Mild,Normal,Weak


In [8]:
Y_name = df.columns.to_list()[-1]
Y_name

'Tennis'

In [9]:
Y = df.iloc[:, -1]
Y

0      No
1      No
2     Yes
3     Yes
4     Yes
5      No
6     Yes
7      No
8     Yes
9     Yes
10    Yes
11    Yes
12    Yes
13     No
Name: Tennis, dtype: object

## Build the table of observations
Take $x$ to be the random variable Outlook and count the observations based on the data set.

In [10]:
N = X["Outlook"].size
xvalues = np.unique(X["Outlook"].values).tolist()
yvalues = np.unique(Y.values).tolist()
dimx = len(xvalues)
dimy = len(yvalues)

In [11]:
obs = pd.DataFrame(0, columns=yvalues, index=xvalues)
for i in range(N):
    xi = X["Outlook"][i]
    yi = Y[i]
    obs[yi][xi] += 1
obs

Unnamed: 0,No,Yes
Overcast,0,4
Rain,2,3
Sunny,3,2


## Approximate the joint distribution $p(x,y)$
Take $x$ to be Outlook and approximate the joint distribution based on the table of observations.

In [12]:
m = obs.sum(axis=1)
m

Overcast    4
Rain        5
Sunny       5
dtype: int64

In [13]:
l = obs.sum(axis=0)
l

No     5
Yes    9
dtype: int64

In [14]:
obs["m"] = m
obs.loc["l"] = l
obs

Unnamed: 0,No,Yes,m
Overcast,0.0,4.0,4.0
Rain,2.0,3.0,5.0
Sunny,3.0,2.0,5.0
l,5.0,9.0,


In [15]:
joint_proba = pd.DataFrame(0, columns=yvalues, index=xvalues)
for x in xvalues:
    joint_proba.loc[x] = obs[yvalues].loc[x] / N
joint_proba

Unnamed: 0,No,Yes
Overcast,0.0,0.285714
Rain,0.142857,0.214286
Sunny,0.214286,0.142857


## Approximate $p(y|x)$
Take $x$ to be Outlook and estimate the conditional probability of $y$ given $x$. Then, sample 10 values of $y$ given $x$ equal Sunny.

In [16]:
p_y_x = pd.DataFrame(0, columns=yvalues, index=xvalues)
for x in xvalues:
    p_y_x.loc[x] = obs[yvalues].loc[x] / obs["m"].loc[x]
p_y_x

Unnamed: 0,No,Yes
Overcast,0.0,1.0
Rain,0.4,0.6
Sunny,0.6,0.4


In [22]:
np.random.choice(yvalues, size=10, p=p_y_x.loc["Sunny"])

array(['Yes', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'Yes', 'No'],
      dtype='<U3')

## Approximate $p(x|y)$
Take $x$ to be Outlook and approximate the conditional distribution based on the table of observations. Then, sample 10 values of Outlook for $y$ equal Yes.

In [18]:
p_x_y = pd.DataFrame(0, columns=yvalues, index=xvalues)
for y in yvalues:
    p_x_y[y] = obs[y] / obs[y].loc["l"]
p_x_y

Unnamed: 0,No,Yes
Overcast,0.0,0.444444
Rain,0.4,0.333333
Sunny,0.6,0.222222


In [19]:
np.random.choice(xvalues, size=10, p=p_x_y["Yes"])

array(['Overcast', 'Rain', 'Overcast', 'Overcast', 'Overcast', 'Rain',
       'Rain', 'Overcast', 'Rain', 'Rain'], dtype='<U8')