## Considerations for Classification
Here we have written down some fundamental considerations that have to be made for our given classification problem. 

#### Small, imbalanced dataset
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html
- The dataset is quite small, which means that we cannot allow ourselves to split it up in multiple pieces for training, testing and validation. We will consider a 0.9/0.1 proportion of split of training and validation set. Within the 90% of training, we will use k-fold cross-validation to see how the model performs on unseen data before feeding it the validation set (the truly unseen data).
- It is imbalanced. 
- Performance Metric : consider another peformance metric than accuracy (F1 score, for example) and confusion matrices to manually investigate the outputs and see if the model truly learned something, or if it is simply reflecting the underlying distribution of data (i.e. the class imbalance just makes it predict the most common class to achieve a good result, which we do not want). 
- Resampling data : We can oversample or undersample our data (add or remove samples from respectively the smaller and larger classes) to achieve a better balance. Because we have little data in our case, oversampling is probably the best option. It may lead to over-fitting, though. If we use over-sampling, it should be carried out as a step AFTER cross-validation partitioning, to avoid over-fitting as much as possible.
- Penalised model : we can penalise the model more if it misclassifies the minority classes. It should however be inverstigated whether this is possible for a problem with many minority classes. This should be added in the cost function estimates, but we need to find out how to do this for a PyTorch neural network. 

#### Current Pipeline : 
1. Pull lightcurves for ZTF sources from Alerce, get classes from TNS
2. Interpolate the data and merge all lightcurves in a single array
3. One-hot encode labels
4. Create training and test sets (proportions 90%/10%)
5. Use K-fold cross validation for model training
6. Over-sample the minority classes in each fold from the training set

On K-fold cross-validation implementation in PyTorch : 

https://www.machinecurve.com/index.php/2021/02/03/how-to-use-k-fold-cross-validation-with-pytorch/#summary-and-code-example-k-fold-cross-validation-with-pytorch

#### Input to RNN
- The RNN has no information about the time, so we need to feed it an array containing the lightcurve magnitude data points for each source. Ideally, the "timestep" (that it does not know) should be the same between each point, so that the RNN sees a data point for a source every X minutes/hours/days in all cases. This is an important pre-processing step for our problem.

## Import Packages and Data

In [None]:
import numpy as np
import pickle
import matplotlib.pyplot as plt
import torch
import torch.utils.data.Dataset as Dataset
import torch.utils.data.DataLoader as DataLoader

# Encode Labels
# Create Dataset Class
# Create DataLoader

class Dataset(Dataset):
    def __init__():
        
    def get_item():

        


## Model Definition

In [None]:
class ENID(Module):
    
    def __init__(self):
        
    def forward(self, x):
        
        return x

Net = ENID()
print(Net)

## Training
#### Cost Function & Optimizer

#### Training Loop