### Implementación Base
#### 1. 
Pita: comparar con las distribuciones del dataset completo, **sin splitear**

In [1]:
import numpy as np
from utils import ClassEncoder
from datasets import get_iris_dataset
X_full_iris, y_full_iris = get_iris_dataset()
print(X_full_iris.shape)
print(y_full_iris.shape)

(150, 4)
(150, 1)


In [2]:
def print_distribution(_y):
    encoder = ClassEncoder()
    encoded_y = encoder.fit_transform(_y) # convert class to number (encode)
    print('Distribución de clases:')
    distribution = np.bincount(encoded_y.flatten())/len(encoded_y)
    for class_name, value in zip(encoder.names, distribution):
        print(f'{class_name}: {value:.4f}')

print_distribution(y_full_iris)


Distribución de clases:
setosa: 0.3333
versicolor: 0.3333
virginica: 0.3333


Se observa que el dataset Iris se encuentra balanceado, es decir que no hay alguna preponderancia por sobre alguna de las clases. Esto es una característica deseada ya que si tuvieramos un desbalance, por ejemplo 90% de una de las tres clases, nuestro modelo no generaliza bien, es decir es muy probable que caiga en overfitting.

### 1) QDA Entrenado con:  probabilidades a priori uniforme y  una clase con probabilidad 0.9, las demás 0.05 ( 3 combinaciones)

In [3]:
from utils import split_transpose, QDA, accuracy

def priori_test(dataset):
    X_full, y_full = dataset
    a_priori_A = [1/3, 1/3, 1/3]
    a_priori_B_1 = [0.9, 0.05, 0.05]
    a_priori_B_2 = [0.05, 0.9, 0.05]
    a_priori_B_3 = [0.05, 0.05, 0.9]
    
    a_priori_list = [a_priori_A, a_priori_B_1, a_priori_B_2, a_priori_B_3]
    # from utils import QDA
    # rng_seed = 6543
    train_x, train_y, test_x, test_y = split_transpose(X_full, y_full, 0.4, 6543)
    
    for i,_a_priori in enumerate(a_priori_list):
        model = QDA()
        model.fit(train_x, train_y, _a_priori)
        print('A prioris:')
        print(",".join([f' {class_name}:{p:.3f} ' for class_name, p in zip(model.encoder.names, _a_priori)]))
        train_acc = accuracy(train_y, model.predict(train_x))
        test_acc = accuracy(test_y, model.predict(test_x))
        print(f"[Model {i}] Train (apparent) error is {1-train_acc:.4f} while test error is {1-test_acc:.4f}")
        # print('\n')

priori_test(get_iris_dataset())

A prioris:
 setosa:0.333 , versicolor:0.333 , virginica:0.333 
[Model 0] Train (apparent) error is 0.0222 while test error is 0.0167
A prioris:
 setosa:0.900 , versicolor:0.050 , virginica:0.050 
[Model 1] Train (apparent) error is 0.0222 while test error is 0.0167
A prioris:
 setosa:0.050 , versicolor:0.900 , virginica:0.050 
[Model 2] Train (apparent) error is 0.0333 while test error is 0.0000
A prioris:
 setosa:0.050 , versicolor:0.050 , virginica:0.900 
[Model 3] Train (apparent) error is 0.0333 while test error is 0.0500


A partir de estos datos, se pueden hacer las siguientes suposiciones:
- El modelo 0 y el modelo 1 cometen el mismo grado de error al hacer dichas suposiciones sobre los priors.
- El modelo 2 parecería sobreajustar (hay overfitting) a los datos de test.
- El modelo 3 tiene un mayor error tanto en el entrenamiento como en la prueba, lo que indica que no logra generalizar y que dichos priors tiene un efecto detrimental en la performance del modelo. 

## 2) Repetir punto 1 para el dataset penguin

In [4]:
from datasets import get_penguins
X_full_penguin, y_full_penguin = get_penguins()

print_distribution(y_full_penguin)

Distribución de clases:
Adelie: 0.4415
Chinstrap: 0.1988
Gentoo: 0.3596


Como podemos observar este dataset no esta balanceado con respecto a la cantidad de datos por clase.

In [5]:
priori_test(get_penguins())

A prioris:
 Adelie:0.333 , Chinstrap:0.333 , Gentoo:0.333 
[Model 0] Train (apparent) error is 0.0098 while test error is 0.0073
A prioris:
 Adelie:0.900 , Chinstrap:0.050 , Gentoo:0.050 
[Model 1] Train (apparent) error is 0.0195 while test error is 0.0219
A prioris:
 Adelie:0.050 , Chinstrap:0.900 , Gentoo:0.050 
[Model 2] Train (apparent) error is 0.0098 while test error is 0.0219
A prioris:
 Adelie:0.050 , Chinstrap:0.050 , Gentoo:0.900 
[Model 3] Train (apparent) error is 0.0098 while test error is 0.0073


### 3 Implementar LDA

In [11]:
from utils import BaseBayesianClassifier, inv,det

class LDA(BaseBayesianClassifier):

  def _fit_params(self, X, y):
    self.inv_cov = inv(np.cov(X, bias=True))
    self.means = [X[:,y.flatten()==idx].mean(axis=1, keepdims=True) for idx in range(len(self.log_a_priori))]


  def _predict_log_conditional(self, x, class_idx):
    # predict the log(P(x|G=class_idx)), the log of the conditional probability of x given the class
    # this should depend on the model used
    unbiased_x =  x - self.means[class_idx]
    return 0.5*np.log(det(self.inv_cov)) -0.5 * unbiased_x.T @ self.inv_cov @ unbiased_x

Comparar LDA vs QDA (Sin multiples prioris, es decir que se estime automaticamente las prioris) con los dos datasets


In [18]:
for dataset_name,dataset in zip(['iris', 'penguins'],[get_iris_dataset(), get_penguins()]):
    for model_name, curr_model in zip(['QDA', 'LDA'],[QDA, LDA]):
        model = curr_model()
        x_full, y_full = dataset
        train_x, train_y, test_x, test_y = split_transpose(x_full, y_full, 0.4, 6543)
        model.fit(train_x, train_y)
        train_acc = accuracy(train_y, model.predict(train_x))
        test_acc = accuracy(test_y, model.predict(test_x))
        print(f"[Dataset={dataset_name}][Model={model_name}] train err {1-train_acc:.4f}, test err {1-test_acc:.4f}")
        
    

[Dataset=iris][Model=QDA] train err 0.0111, test err 0.0167
[Dataset=iris][Model=LDA] train err 0.1222, test err 0.2000
[Dataset=penguins][Model=QDA] train err 0.0146, test err 0.0146
[Dataset=penguins][Model=LDA] train err 0.0195, test err 0.0219


Se puede ver que en ambos dataset QDA suele performar mejor, pero no por mucho.

4. Utilizar otros 2 (dos) valores de *random seed* para obtener distintos splits de train y test, y repetir la comparación del punto anterior ¿Las conclusiones previas se mantienen?

In [41]:
import pandas as pd
df = pd.DataFrame()

for dataset_name,dataset in zip(['iris', 'penguins'],[get_iris_dataset(), get_penguins()]):
    for model_name, curr_model in zip(['QDA', 'LDA'],[QDA, LDA]):
        for seed in [6543, 5501,125]:
            model = curr_model()
            x_full, y_full = dataset
            train_x, train_y, test_x, test_y = split_transpose(x_full, y_full,test_sz=0.4, random_state=seed)
            model.fit(train_x, train_y)
            train_acc = accuracy(train_y, model.predict(train_x))
            test_acc = accuracy(test_y, model.predict(test_x))
            # print(f"[Dataset={dataset_name}][Model={model_name}] train err {1-train_acc:.4f}, test err {1-test_acc:.4f}")
            row = {
                'Dataset': dataset_name,
                'Model': model_name,
                'seed': seed,
                'Error (train)': 1-train_acc,
                'Error (test)': 1-test_acc,
            }
            
            df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)
print(df)
    

     Dataset Model  seed  Error (train)  Error (test)
0       iris   QDA  6543       0.011111      0.016667
1       iris   QDA  5501       0.022222      0.016667
2       iris   QDA   125       0.022222      0.016667
3       iris   LDA  6543       0.122222      0.200000
4       iris   LDA  5501       0.155556      0.150000
5       iris   LDA   125       0.155556      0.116667
6   penguins   QDA  6543       0.014634      0.014599
7   penguins   QDA  5501       0.014634      0.007299
8   penguins   QDA   125       0.009756      0.014599
9   penguins   LDA  6543       0.019512      0.021898
10  penguins   LDA  5501       0.019512      0.021898
11  penguins   LDA   125       0.019512      0.021898


## 5 Tensorized QDA vs QDA

In [33]:
from utils import TensorizedQDA

x_full, y_full = get_iris_dataset()
train_x, train_y, test_x, test_y = split_transpose(x_full, y_full,test_sz=0.4, random_state=6543)

tqda = TensorizedQDA()
tqda.fit(train_x, train_y)
train_acc = accuracy(train_y, tqda.predict(train_x))
test_acc = accuracy(test_y, tqda.predict(test_x))

print(f"Train (apparent) error is {1-train_acc:.4f} while test error is {1-test_acc:.4f}")

Train (apparent) error is 0.0111 while test error is 0.0167


In [34]:
%%timeit

tqda.predict(test_x)

912 μs ± 5.85 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [39]:
qda = QDA()
qda.fit(train_x, train_y)
train_acc = accuracy(train_y, tqda.predict(train_x))
test_acc = accuracy(test_y, tqda.predict(test_x))
print(f"Train (apparent) error is {1-train_acc:.4f} while test error is {1-test_acc:.4f}")

Train (apparent) error is 0.0111 while test error is 0.0167


In [40]:
%%timeit

qda.predict(test_x)

3.1 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
