# Imports

In [1]:
import pyreadr
import pandas as pd

## Convert data files to csv

### Decathlon

In [2]:
# Load RData file
rdata = pyreadr.read_r("data/decathlon.RData")
rdata

OrderedDict([('X',
                            100m  Long.jump  Shot.put  High.jump   400m  110m.hurdle  \
              rownames                                                                 
              SEBRLE       11.04       7.58     14.83       2.07  49.81        14.69   
              CLAY         10.76       7.40     14.26       1.86  49.37        14.05   
              KARPOV       11.02       7.30     14.77       2.04  48.37        14.09   
              BERNARD      11.02       7.23     14.25       1.92  48.93        14.99   
              YURKOV       11.34       7.09     15.19       2.10  50.42        15.31   
              WARNERS      11.11       7.60     14.31       1.98  48.68        14.23   
              ZSIVOCZKY    11.13       7.30     13.48       2.01  48.62        14.17   
              McMULLEN     10.83       7.31     13.76       2.13  49.91        14.38   
              MARTINEAU    11.64       6.81     14.57       1.95  50.14        14.93   
             

In [3]:
# Access the data frame from the loaded RData file
data_frame = rdata["X"]
data_frame

Unnamed: 0_level_0,100m,Long.jump,Shot.put,High.jump,400m,110m.hurdle,Discus,Pole.vault,Javeline,1500m
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SEBRLE,11.04,7.58,14.83,2.07,49.81,14.69,43.75,5.02,63.19,291.7
CLAY,10.76,7.4,14.26,1.86,49.37,14.05,50.72,4.92,60.15,301.5
KARPOV,11.02,7.3,14.77,2.04,48.37,14.09,48.95,4.92,50.31,300.2
BERNARD,11.02,7.23,14.25,1.92,48.93,14.99,40.87,5.32,62.77,280.1
YURKOV,11.34,7.09,15.19,2.1,50.42,15.31,46.26,4.72,63.44,276.4
WARNERS,11.11,7.6,14.31,1.98,48.68,14.23,41.1,4.92,51.77,278.1
ZSIVOCZKY,11.13,7.3,13.48,2.01,48.62,14.17,45.67,4.42,55.37,268.0
McMULLEN,10.83,7.31,13.76,2.13,49.91,14.38,44.41,4.42,56.37,285.1
MARTINEAU,11.64,6.81,14.57,1.95,50.14,14.93,47.6,4.92,52.33,262.1
HERNU,11.37,7.56,14.41,1.86,51.1,15.06,44.99,4.82,57.19,285.1


In [4]:
# Convert the data frame to a CSV file
data_frame.to_csv("decathlon.csv", index=True)

### Train and test datafiles

In [5]:
train = {
    f"{letter}": pd.read_csv(
        f"data/train{letter}", delimiter=" ", names=["x1", "x2", "y"]
    )
    for letter in ["A", "B", "C"]
}
train["A"]

Unnamed: 0,x1,x2,y
0,12.138367,6.455699,1
1,10.622594,11.083096,0
2,11.777052,8.121582,1
3,10.960882,12.226554,0
4,11.296539,10.211002,0
...,...,...,...
95,10.188695,6.171622,1
96,11.024072,4.082187,1
97,11.090619,10.954867,0
98,9.840793,4.902898,1


In [6]:
test = {
    f"{letter}": pd.read_csv(
        f"data/test{letter}", delimiter=" ", names=["x1", "x2", "y"]
    )
    for letter in ["A", "B", "C"]
}
test["A"]

Unnamed: 0,x1,x2,y
0,12.356261,6.053231,1
1,11.842674,4.832690,1
2,12.080976,5.739803,1
3,9.435296,8.878012,0
4,11.023415,6.173500,1
...,...,...,...
95,12.132122,10.539799,0
96,10.948403,6.676466,1
97,8.728051,5.950333,1
98,9.376795,9.968496,0


# 1. Classification

## 1.1 Analyse linéaire discriminante

### a- Ecrire l'estimateur du maximum de vraisemblance des paramètres du modèle

$$\begin{align}

L_{(x_i, y_i)_i}(\pi, \mu_0, \mu_1, \Sigma) 
    & = log \hspace{8px} p(x_1,..., x_n, y_1,...y_n | \pi, \mu_0, \mu_1, \Sigma) \\
    & = \sum_{i=1}^n log \hspace{8px} p(x_i, y_i | \pi, \mu_0, \mu_1, \Sigma) \\
    & = \sum_{i=1}^n log \hspace{8px} p(x_i | y_i, \mu_0, \mu_1, \Sigma) * p(y_i | \pi) \\
    & = \sum_{i=1}^n log \hspace{8px} p(x_i | y_i, \mu_0, \mu_1, \Sigma) + \sum_{i=1}^n log \hspace{8px} p(y_i | \pi)
\end{align}$$

On remarque qu'en dérivant par rapport à $\pi$ dans (4), on perdra la somme de gauche.  
De même, en dérivant par rapport à $\mu_0, \mu_1$ ou $\Sigma$, on perdra la somme de droite.  
On va donc calculer $\hat{\pi}_{EMV}$ et $\hat{\mu_0}_{EMV}$, $\hat{\mu_1}_{EMV}$, $\hat{\Sigma}_{EMV}$ à part.

**Commençons par calculer $\hat{\pi}_{EMV}$**   

$$\begin{align}

L_{(y_i)_i}(\pi) & = log \hspace{8px} p(y_1,...y_n | \pi) \\
    & = \sum_{i=1}^n log \hspace{8px} p(y_i | \pi) \\
    & = \sum_{i=1}^n log \hspace{8px} \pi^{y_i} \cdot (1-\pi)^{1-y_i} \\
    & = \sum_{i=1}^n y_i \cdot log(\pi) + (1-y_i) \cdot log(1-\pi) \\ 

\end{align}$$

Dérivons cette quantité par rapport à $\pi$

$$\begin{align}

\nabla_\pi(L_{(y_i)_i}(\pi)) & = \sum_{i=1}^n \frac{y_i}{\pi} - \frac{1}{1-\pi} + \frac{y_i}{1-\pi} \\
    & = - \frac{n}{1-\pi} + \sum_{i=1}^n \frac{y_i}{\pi(1-\pi)}
    
\end{align}$$

Maximiser la vraisemblance revient à maximiser la log-vraisemblance par croissance de la fonction logarithme sur $[0, 1]$, l'ensemble de définition de $\pi$.  
Déterminons la valeur de $\pi$ qui maximise la log-vraisemblance.  

Soit $\pi \in [0, 1] / \nabla_\pi(L_{(y_i)_i}(\pi)) = 0$,
alors on a
$$ \sum_{i=1}^n \frac{y_i}{\pi(1-\pi)} = \frac{n}{1-\pi} $$
i.e.
$$ \pi = \frac{1}{n} \sum_{i=1}^{n} y_i = \bar{y}$$

Donc $\hat{\pi}_{EMV} = \bar{y}$ 


**A présent, calculons $\hat{\mu_0}, \hat{\mu_1}$ et $\hat{\Sigma}$**

$$\begin{align}

L_{(x_i)_i}(y_i, \mu_0, \mu_1, \Sigma)
    & = log \hspace{8px} p(x_1,...x_n | y_1,...,y_n, \mu_0, \mu_1, \Sigma) \\
    & = \sum_{i=1}^n log \hspace{8px} p(x_i | y_i, \mu_0, \mu_1, \Sigma) \\
    & = \begin{align*}
        \sum_{i=1}^n
            & \mathbb{1}_{y_i = 0} \cdot log \hspace{8px} \frac{1}{\sqrt{2\pi^n|\Sigma|}} \cdot \exp^{-\frac{1}{2} (x_i - \mu_0)^T \Sigma^{-1} (x_i - \mu_0)} \\
            & \mathbb{1}_{y_i = 1} \cdot log \hspace{8px} \frac{1}{\sqrt{2\pi^n|\Sigma|}} \cdot \exp^{-\frac{1}{2} (x_i - \mu_1)^T \Sigma^{-1} (x_i - \mu_1)} \\
        \end{align*}

\end{align}$$

On peut calculer, à partir de cette forme, le gradient de la log vraisemblance par rapport à $\mu_0$.

$$
\nabla_{\mu_0}(L_{(x_i)_i}(y_i, \mu_0, \mu_1, \Sigma))
    = - \sum_{i=1}^{n} \mathbb{1}_{y_i = 0} \cdot \Sigma^{-1} \cdot (x_i - \mu_0)
$$
Et alors, on peut maximiser la log vraisemblance.  
Soit $\mu_0 \in \mathbb{R}^2 / \nabla_{\mu_0}(L_{(x_i)_i}(y_i, \mu_0, \mu_1, \Sigma)) = 0$, on a alors
$$
- \sum_{i=1}^{n} \mathbb{1}_{y_i = 0} \cdot ( \Sigma^{-1} \cdot x_i - \Sigma^{-1} \cdot \mu_0 ) = 0
$$
Introduisons les notations $n_0 = card\{i \in [1;n] / y_i = 0\}$, le nombre d'individus dans la classe 0,  
et $\bar{x_0} = \sum_{i=1}^{n} \mathbb{1}_{y_i = 0} \cdot x_i$, la moyenne des individus de la classe 0.  

On a alors
$$ n_0 \cdot \Sigma^{-1} \mu_0 = \Sigma^{-1} \cdot \sum_{i=1}^{n} \mathbb{1}_{y_i = 0} \cdot x_i
$$
Et donc
$$ \mu_0 = \hat{\mu_0} = \frac{\bar{x_0}}{n_0}
$$
