### discret distribution: contingency table

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


#### Let's consider a dataset with categorical variables

In [2]:
# draw sample of size N
N = 100
data = pd.DataFrame(np.ceil(np.random.default_rng(seed = 1234).dirichlet((10, 5, 3), N)[:, :2] *10).astype('int'), columns = ['X', 'Y'])
# map categorical values
cats = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
data.X = data.X.map({x:cats[i] for i, x in enumerate(np.unique(data.X))})
data.Y = data.Y.map({y:cats[j].lower() for j, y in enumerate(np.unique(data.Y))})
# show
data.head()

Unnamed: 0,X,Y
0,A,d
1,D,b
2,E,b
3,C,c
4,D,c


### Cardinality

- the ***cardinality*** of a categorical variable is the number of outcomes it can take

In [3]:
cardX = np.unique(data.X).shape[0]
'categories of X: %s, cardinality of X is |X| = %1d' %(np.unique(data.X), cardX)

"categories of X: ['A' 'B' 'C' 'D' 'E'], cardinality of X is |X| = 5"

In [4]:
cardY = np.unique(data.Y).shape[0]
'categories of Y: %s, cardinality of Y is |Y| = %1d' %(np.unique(data.Y), cardY)

"categories of Y: ['a' 'b' 'c' 'd' 'e' 'f'], cardinality of Y is |Y| = 6"

### Contingency table

A contingency table displays the frequency or count of observations for each combination of a set of categorical variables. A contingency table is the multivariate version of the histogram of frequencies (or a histogram of frequencies is a 1D contingency table).

Here's a basic example of a 2x2 contingency table:

```
           |      y_1      |      y_2      |    marginal X
---------------------------------------------------------------
     x_1   |      n11      |      n12      |   n11+n12 = n1.
     x_2   |      n21      |      n22      |   n21+n22 = n2.
----------------------------------------------------------------
marginal Y | n11+n21 = n.1 | n12+n22 = n.2 | n11+n12+n21+n22 = N
```

In this table:

- $x_1$ and $x_2$ represent the possible outcomes of a random variable $X$.
- $y_1$ and $y_2$ represent the possible outcomes of a random variable $Y$.
- $n_{ij}$ represent the frequencies or counts of join observations of $X=x_i$ and $Y=y_j$.

The marginal count in each row $n_{i.}$ represents the total number of observations for $X=x_i$, while the marginal count in each column $n_{.j}$ represents the total number of observations for $Y=y_j$. The count in the bottom right cell $N$ is the total number of observations in the entire dataset.


### Compute the contingency table: a $\left(|X|,|Y|\right)$ matrix

##### 1. factorize categorical variables

In [5]:
data['x'] = data.X.map({x: i for i,x in enumerate(np.unique(data.X.sort_values()))})
data['y'] = data.Y.map({y: j for j,y in enumerate(np.unique(data.Y.sort_values()))})
data.head()

Unnamed: 0,X,Y,x,y
0,A,d,0,3
1,D,b,3,1
2,E,b,4,1
3,C,c,2,2
4,D,c,3,2


#### 2. initialize a matrix to store the contingency table (add 1 row and 1 col to store the marginal counts)

In [6]:
ct = np.zeros((cardX +1, cardY +1))
ct

array([[0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.]])

#### 3. compute the joint counts

In [7]:
for _, obs in data.iterrows(): ct[obs.x, obs.y] += 1
ct, np.sum(ct)

(array([[ 0.,  0.,  3.,  3.,  1.,  0.,  0.],
        [ 0.,  1.,  3., 12.,  7.,  1.,  0.],
        [ 1.,  5., 12., 12.,  4.,  0.,  0.],
        [ 0.,  9., 17.,  0.,  0.,  0.,  0.],
        [ 1.,  5.,  3.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.]]),
 100.0)

#### 4. compute the marginal counts

In [8]:
# compute the marginal X
marginal_x = np.sum(ct[:-1, :-1], axis = 1)
# compute the marginal Y
marginal_y = np.sum(ct[:-1, :-1], axis = 0)
# add to contingency table
ct[:-1, -1] = marginal_x
ct[-1, :-1] = marginal_y
# show
ct

array([[ 0.,  0.,  3.,  3.,  1.,  0.,  7.],
       [ 0.,  1.,  3., 12.,  7.,  1., 24.],
       [ 1.,  5., 12., 12.,  4.,  0., 34.],
       [ 0.,  9., 17.,  0.,  0.,  0., 26.],
       [ 1.,  5.,  3.,  0.,  0.,  0.,  9.],
       [ 2., 20., 38., 27., 12.,  1.,  0.]])

#### 5. compute the total counts

In [9]:
ct[-1, -1] = np.sum(ct[:-1, :-1])
# show
ct

array([[  0.,   0.,   3.,   3.,   1.,   0.,   7.],
       [  0.,   1.,   3.,  12.,   7.,   1.,  24.],
       [  1.,   5.,  12.,  12.,   4.,   0.,  34.],
       [  0.,   9.,  17.,   0.,   0.,   0.,  26.],
       [  1.,   5.,   3.,   0.,   0.,   0.,   9.],
       [  2.,  20.,  38.,  27.,  12.,   1., 100.]])

#### Define a 2D contingency table class

In [24]:
class ContingencyTable2D():
    
    def __init__(self, X, Y):
 
        # cardinalities
        self.cardX = np.unique(X).shape[0]
        self.cardY = np.unique(Y).shape[0]
    
        # factorize
        X_ = X.map({x: i for i,x in enumerate(np.unique(X.sort_values()))})
        Y_ = Y.map({y: j for j,y in enumerate(np.unique(Y.sort_values()))})
        
        # joint counts
        self.counts = np.zeros((self.cardX, self.cardY))
        for x, y in zip(X_, Y_): self.counts[x, y] += 1

        #total counts
        self.n = np.sum(self.counts)
        
    def mrgX(self):
        return np.sum(self.counts, axis = 1)

    def mrgY(self):
        return np.sum(self.counts, axis = 0)


In [25]:
ct = ContingencyTable2D(data.X, data.Y)
ct.counts

array([[ 0.,  0.,  3.,  3.,  1.,  0.],
       [ 0.,  1.,  3., 12.,  7.,  1.],
       [ 1.,  5., 12., 12.,  4.,  0.],
       [ 0.,  9., 17.,  0.,  0.,  0.],
       [ 1.,  5.,  3.,  0.,  0.,  0.]])

In [26]:
ct.mrgX()

array([ 7., 24., 34., 26.,  9.])

In [27]:
ct.mrgY()

array([ 2., 20., 38., 27., 12.,  1.])

In [28]:
ct.n

100.0