# Adaption to new classes
project 7 by Guanzhao Wang, Haochen Wu, Yukai Wang

### Abstact outline:
In this project, the goal is to find a way which combines unsupervised and supervised method. The baselines are the methods to train the model with supervised categories. The unsupervised section follows. The primary goal is to train a model that is able to classify all the categories (labeled and unlabeled).

A classification model is trained under labeled categories on Fashion-MNIST dataset. Then, new categories are added without labels. There are several unsupervised clustering methods implemented and compared. They are Kmeans, Kmeans with PCA, Kmeans with Auto Encoder, Gaussian Mixture, Gaussian Mixture with PCA and Gaussian Mixture with Auto Encoder. These unlabeled categories will be labeled by the clustering methods and concatenated with other categories. The classification model will be trained again on the full dataset which contains the pre-labeled categories and new categories just labeled by the unsupervised methods. The result of the classification is drawn between all eight methods and an conclusion is given at the end.

### Teammates:
- Guanzhao Wang:

- Haochen Wu:

- Yukai Wang:

### Library
Several library functions are implemented. They are imported in the front of the next code block. The externel libraries are imported in the behind of the next block.

Our own code library is implemented in the <em>mylib</em> subdirectory. All the files in the library are listed below in an order used to train models:
- <strong>dataloader</strong>: 
>Include customFashionMNIST class for loading custom Fashion-MNIST dataset. Dataloader functions: <br><code>getTrainValidateLoaders(include_labels=range(10), transform=None, batch_size=64, split = 0.9, num_workers=1, mode=7, USE_GPU=False)</code>
     <br><code>getTestLoaders(include_labels=range(10), transform=None, batch_size=64, num_workers=1, USE_GPU=False)</code>
- <strong>transform</strong>: 
>Define transform used for dataloader. 
- <strong>model</strong>: 
>Include several Neural Networks. A normal CNN: <code>Net()</code>, <br>A ResNet model: <code>CustomFashionResNet()</code>, <br>clustering model: <code>Autoencoder()</code>
- <strong>loss</strong>: 
>Include loss functions. <br>crossEntropyLoss for training: <code>loss_function()</code>; <br>loss for autoencoder: <code>autoencoder_loss()</code>
- <strong>train</strong>: 
>Train functions for models and autoencoder: <br><code>train(train_val_loaders, net, loss_function, optimizer, USE_GPU, checkpoint_path)</code> and <br><code>autoencoder_train(train_loader, net, loss_function, optimizer, USE_GPU)</code>
- <strong>eval</strong>: 
>Include validation function: <code>validate(val_loader, net, loss, USE_GPU)</code>
- <strong>clustering</strong>: 
>Unsupervised method for labeling data. <code>label_data(unlabeled_data, labels, mode=0, USE_GPU=False)</code>
- <strong>report</strong>: 
>Two report functions. <br>Training report: <code>report_epoch_summary(eval_metrics)</code>; <br>Summary report: <code>report_summary(mode_metrics, mode_description)</code>

Externel libraries included:
- <strong>numpy</strong>:
- <strong>matplotlib</strong>: 
><br><code>matplotlib.pyplt</code>
- <strong>torch</strong>:
- <strong>torchvision</strong>:
- <strong>sklearn</strong>: 
>in clustering, for Kmean and GaussianMixture, confusion_matrix
- <strong>pl_bolts.models.autoencoders</strong>: 
><code>pip install lightning-bolts</code>

In [1]:
from mylibs.dataloader import getTrainValidateLoaders, getTestLoaders
from mylibs.train import train
from mylibs.eval import validate
from mylibs.model import Net, CustomFashionResNet
from mylibs.loss import loss_function
from mylibs.report import report_epoch_summary, report_summary
from mylibs.transform import transform_aug

import torch
from torchvision import transforms
import os



In [2]:
USE_GPU = True
BATCH_SIZE = 64
EPOCH = 30
NUM_WORKERS = 2
K = 7

In [3]:
device = torch.device("cuda" if USE_GPU else "cpu")

transform = transform_aug

mode_description = {0: "clustering: kmeans",
                    1: "clustering: kmeans with PCA",
                    2: "clustering: kmeans with Auto Encoder",
                    3: "clustering: Gaussian Mixture",
                    4: "clustering: Gaussian Mixture with PCA",
                    5: "clustering: Gaussian Mixture with Auto Encoder",
                    6: "use only labeled data",
                    7: "use full FasionMNIST data",
                   }

mode_description_short = {0: "Kmeans",
                          1: "Kmeans with PCA",
                          2: "Kmeans with Auto Encoder",
                          3: "Gaussian Mixture",
                          4: "Gaussian Mixture with PCA",
                          5: "Gaussian Mixture with Auto Encoder",
                          6: "Labeled data only",
                          7: "Full FasionMNIST",
                         }

mode_metrics = {}

In [4]:
def whole_flow(mode, useResnet):
    global mode_metrics
    print(f"Getting train and validate dataloaders for mode {mode}: {mode_description[mode]}")
    train_val_loaders = getTrainValidateLoaders(include_labels=range(K), transform=transform, batch_size=BATCH_SIZE, split=0.9, num_workers=NUM_WORKERS, mode=mode, USE_GPU=USE_GPU)
    if useResnet:
        model = CustomFashionResNet(color_scale = 1, num_classes = 10).to(device)
    else:
        model = Net().to(device)
    optimizer = torch.optim.Adadelta(model.parameters(), lr=0.01)
    eval_metrics = []

    model_name = "ResNet" if useResnet else "Net"
    checkpoint_path = f"./checkpoint/mode_{mode}/{model_name}"
    os.makedirs(checkpoint_path, exist_ok=True)
    
    print(f"Start Training... {model_name}")
    # scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
    for epoch in range(1, EPOCH+1):
        eval_metric = train(train_val_loaders, model, loss_function, optimizer, USE_GPU, f"{checkpoint_path}/epoch_{epoch}.pt")
        eval_metrics.append(eval_metric)
        print(f"Epoch: {epoch}")
        print(f"\tTrain      - Loss: {eval_metric['train']['loss']:.4f} Accuracy: {eval_metric['train']['acc']:.4f} F1_score: {eval_metric['train']['f1']:.4f}")
        print(f"\tValidation - Loss: {eval_metric['val']['loss']:.4f} Accuracy: {eval_metric['val']['acc']:.4f} F1_score: {eval_metric['val']['f1']:.4f}")

    report_epoch_summary(eval_metrics)
    
    all_val_f1 = [x['val']['f1'] for x in eval_metrics]
    best_epoch = all_val_f1.index(max(all_val_f1)) + 1
    print(f"Loading model at epoch {best_epoch} for best validation f1")
    checkpoint = torch.load(f"{checkpoint_path}/epoch_{epoch}.pt")
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    
    print("Preparing test loaders")
    labeled_test_loader, unlabeled_test_loader, test_loader = getTestLoaders(include_labels=range(K), transform=transform, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, USE_GPU=USE_GPU)
    
    model.eval()
    eval_metric = validate(labeled_test_loader, model, loss_function, USE_GPU)
    print(f"Result on labelled test set  : Loss: {eval_metric['loss']:.4f} Accuracy: {eval_metric['acc']:.4f} F1_score: {eval_metric['f1']:.4f}")

    eval_metric = validate(unlabeled_test_loader, model, loss_function, USE_GPU)
    print(f"Result on unlabelled test set: Loss: {eval_metric['loss']:.4f} Accuracy: {eval_metric['acc']:.4f} F1_score: {eval_metric['f1']:.4f}")

    eval_metric = validate(test_loader, model, loss_function, USE_GPU)
    print(f"Result on full test set      : Loss: {eval_metric['loss']:.4f} Accuracy: {eval_metric['acc']:.4f} F1_score: {eval_metric['f1']:.4f}")
    
    mode_metrics[mode] = eval_metrics


## Baseline #1, use only first K-class labelled data to train

In [None]:
whole_flow(6, False)

Getting train and validate dataloaders for mode 6: use only labeled data
Start Training... Net
Epoch: 1
	Train      - Loss: 1.0801 Accuracy: 0.6504 F1_score: 0.6379
	Validation - Loss: 0.7584 Accuracy: 0.7276 F1_score: 0.7197
Epoch: 2
	Train      - Loss: 0.6806 Accuracy: 0.7534 F1_score: 0.7476
	Validation - Loss: 0.6314 Accuracy: 0.7617 F1_score: 0.7551
Epoch: 3
	Train      - Loss: 0.5903 Accuracy: 0.7885 F1_score: 0.7843
	Validation - Loss: 0.5689 Accuracy: 0.7917 F1_score: 0.7885
Epoch: 4
	Train      - Loss: 0.5390 Accuracy: 0.8072 F1_score: 0.8045
	Validation - Loss: 0.5229 Accuracy: 0.8083 F1_score: 0.8047
Epoch: 5
	Train      - Loss: 0.5040 Accuracy: 0.8205 F1_score: 0.8183
	Validation - Loss: 0.5074 Accuracy: 0.8174 F1_score: 0.8137


## Baseline #2, use full Fashion-MNIST dataset to train

In [None]:
whole_flow(7, False)

# Label new categories by clustering

- label the dataset under different clustering methods
- compare acuracy on the test set

## Approach #1, use KMeans to compute label for unlabelled data

In [None]:
whole_flow(0, False)

## Approach #2, use KMeans with PCA to compute label for unlabelled data

In [None]:
whole_flow(1, False)

## Approach #3, use KMeans with Auto Encoder to compute label for unlabelled data

In [None]:
whole_flow(2, False)

## Approach #4, use Gaussian Mixture to compute label for unlabelled data

In [None]:
whole_flow(3, False)

## Approach #5, use Gaussian Mixture with PCA to compute label for unlabelled data

In [None]:
whole_flow(4, False)

## Approach #6, use Gaussian Mixture with Auto Encoder to compute label for unlabelled data

In [None]:
whole_flow(5, False)

In [None]:
report_summary(mode_metrics, mode_description_short)

## Conclusion

- overview of all the methods
- eight methods are compared
- final method for best accuracy