The workflow for a data science project will follow these lines:

1. Get and explore the data
2. Build a model 
3. Train the model
4. Save and predict

## 1. Get and Explore the Data
The first step can take quite some time; data quality is often something that needs to be checked, and correlations between data should often be explored and visualized.

This step can be a full project on its own: you clean the data, make sure you can access it properly, and create visualizations and hypothesis to gain insight into the data that can be shown in a dashboard.

The insight in the data is an essential ingredient for deciding on a model.

## 2. Build a model
Based on domain knowledge and a first exploration of the data, a model can be selected.

Sometimes, the relation between features and outcome is very obvious. You might have features that
correlate very high with the outcome variable, and a domain expert confirms that the correlations make sense.

If this is the case, you can often build a simple model. If you expect to have non-linear and complex interactions between the features,
you could use a model that works with non-linear data like a SVM plus kernel, or a random forest.

If you have enough data (as a rule of thumb, a lower threshold of 1000 observations) you can consider a neural network architecture.
If the expected complexity of the data is low, you can use a relative small network.
If you have lots and lots of data with a high complexity, you should consider to increase the complexity of your model too.

How you can build a model, and what suitable models are for different datatypes and situations, will be the subject of the whole course.

## 3. Train the model
Once you created a model, it hasnt learned anything yet. The model must be trained to learn the right connections, a bit like a baby that has to learn about what works and what doesn't.

In this notebook, I will introduce you to PyTorch. Another high level library is Tensorflow, which is used a lot too.
While the interface is comparable, the Tensorflow syntax is a bit more high-level. While this can be an advantage, 
it also has a downside: at the moment you ever need to dive a bit deeper into the architecture itself, it is much harder to
add something new with TensorFlow, compared to PyTorch.

## 4. Save and predict
Finally, you will want to use the trained model to predict new observations.

# Load the data
We will use the fashion MNIST dataset. You will find this dataset a lot in machine learning tutorials. It are small (28x28) images of clothing.

In [None]:
from pathlib import Path
import mads_datasets
from mads_datasets import DatasetFactoryProvider, DatasetType
import warnings
from tomlserializer import TOMLSerializer
warnings.simplefilter("ignore", UserWarning)
fashionfactory = DatasetFactoryProvider.create_factory(DatasetType.FASHION)

mads_datasets.__version__

In [None]:
datasets = fashionfactory.create_dataset()

We now have a `Dataset`. They implement at minimum an `.__getitem__` and `.__len__` function.

In [None]:
datasets["train"]

To get the data, we can use the `__getitem__` method by calling an index, just like you would do with a list or array.

In [None]:
x = datasets["train"][0]
type(x), type(x[0]), type(x[1]), x[0].max(), x[0].min(), x[0].dtype

This is equivalent to this (but no one does that, obviously. We implement the dunder method to make life easier, not more complex...):


In [None]:
x = datasets["train"].__getitem__(0)

X is a tuple. We can check the length:

In [None]:
len(x)

We can get the 0th item, which is the image (tensor). The other item is the label (int)

In [None]:
img = x[0]
img.shape

You can see the image has a channel-first convention: it is a 28x28 pixel image, and it has 1 channel (grey). Look into the official documentation if you want to know more about datasets and how to build your own: [docs](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)

Ok, we want to batch this into a dataloader. From the documentation:

> The Dataset retrieves our dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval

Why is the length of the dataloader different from the dataset? We had 60000 items before...

In [None]:
# we can either use pytorches DataLoader
from torch.utils.data import DataLoader
trainloader = DataLoader(datasets["train"], batch_size=64, shuffle=True)
testloader = DataLoader(datasets["valid"], batch_size=64, shuffle=True)
len(trainloader)

In [None]:
%timeit X, y = next(iter(trainloader))

In [None]:
from mltrainer.preprocessors import BasePreprocessor
preprocessor = BasePreprocessor()

In [None]:
# or the BaseDatastreamer from the datasetfactory. Check out which one is faster

streamers = fashionfactory.create_datastreamer(batchsize=64, preprocessor=preprocessor )
train = streamers["train"]
valid = streamers["valid"]
trainstreamer = train.stream()
validstreamer = valid.stream()

In [None]:
X, y = next(iter(trainstreamer))
X.shape

In [None]:
%timeit X, y = next(iter(trainstreamer))

In [None]:
len(train), len(valid)

In [None]:
X, y = next(iter(trainstreamer))

In [None]:
X.shape, y.shape

In [None]:
type(X[0])

So, what do we see here? Our datashape has four dimensions:

- 64: this is the batch size. Every batch has 64 observations; in this case 64 images
- 1: this is the channel. Colorimages typically have 3 channels. Our images have just one color, and thus 1 channel. So images can have more channels (e.g. infrared etc)
- (28,28) : this is the actual image, with dimensions 28x28

Lets visualize the first example, the first image:

In [None]:
img = X[1]
img.shape

In [None]:
import matplotlib.pyplot as plt
plt.imshow(img.squeeze(), cmap="gray")

# Create a model

In [None]:
import torch
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
    device = torch.device("mps")
    print("Using MPS")
elif torch.cuda.is_available():
    device = "cuda:0"
    print("using cuda")
else:
    device = "cpu"
    print("using cpu")

In [None]:
from torch import nn
from loguru import logger

logger.info(f"Using {device} device")

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.units_1 = 512
        self.units_2 = 256
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, self.units_1),
            nn.ReLU(),
            nn.Linear(self.units_1, self.units_2),
            nn.ReLU(),
            nn.Linear(self.units_2, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to("cpu")
from torchinfo import summary
summary(model, input_size=(1, 28, 28))

Hopefully, you recognize the setup from the `linearmodel` notebook. 

- We will `Flatten` the image. That means we will transform our (64, 1, 28, 28) data into (64, 784) shaped data. What we do here, is flattening the image into a one dimensional vector.
- We have a stack of hidden layers. These are essentially dotproducts. Our vector of 784 (28*28) elements is transformed into 512 elements, and then into 10 elements because we have 10 classes.
- in between the linear transformations you can see the activation functions,here a `ReLu` 
- The `forward` method is what is called during training. This gives you control over the flow of information: it is easy to create some parallel flow of data if you want to do something like that.

# Optimizer

We need an optimizer. We will dive into this in later lessons.

For now, it is enough to know this:

Your model makes a prediction. But how does the model know if it is right, or wrong?
And, more specific: how does the model know which weights it needs to modify in order

In [None]:
import torch.optim as optim
loss_fn = torch.nn.CrossEntropyLoss()

In [None]:
X, y = next(iter(trainstreamer))
model.to(device)
next(model.parameters()).is_cuda


In [None]:
yhat = model(X.to(device)) # make a prediction

In [None]:
loss_fn(yhat, y.to(device)) # calculate the loss

# Learn the weights

Specify a directory for logs

In [None]:
log_dir=Path("demo").absolute()
log_dir

Get a metric to know how well we are doing

In [None]:
from mltrainer import metrics
accuracy = metrics.Accuracy()

Create settings for the trainer that tell the model
- what an epoch is (train_steps)
- how many epochs we want to train (epochs)
- what metrics we want to use, both on the train and validation set (metrics)
- how to report (reporttypes) and where to store those reports (logdir)

In [None]:
from mltrainer import TrainerSettings, ReportTypes, Trainer

settings = TrainerSettings(
    epochs=3,
    metrics=[accuracy],
    logdir=log_dir,
    train_steps=len(train),
    valid_steps=len(valid),
    reporttypes=[ReportTypes.TENSORBOARD, ReportTypes.TOML],
)
settings

With the settings in place, we can set up the trainingloop:
- which model to train
- using which settings
- with a loss function and optimizer for training
- two datasets: one for training, one for validation
- a scheduler to adjust the learning rate when we no longer learn
- the device to train on for hardware acceleration (CPU, GPU, MPS)

In [None]:
trainer = Trainer(
    model=model,
    settings=settings,
    loss_fn=loss_fn,
    optimizer=optim.Adam,
    traindataloader=trainstreamer,
    validdataloader=validstreamer,
    scheduler=optim.lr_scheduler.ReduceLROnPlateau,
    device=device,
    )

And, lets train the model!

In [None]:
trainer.loop()

Check the `model.toml` and `settings.toml` files that have been created in the specified `demo/` dir we specified in the settings as `logdir`

# Save the model

You will have the latest model at trainer.model, or just use the old model (which is the same)

If you have a look at the settings.earlystop_kwargs, you can see that save is by default false. If you change this to true, the trainer would have kept track of the best model so far and saved it in between. Because this can take up additional time and in a learning setting like we are in we typically dont really want to save the model for later use, we dont need it here.

However, in a real life setting you probably want the best model!

In [None]:
settings.earlystop_kwargs

You can save it manually

In [None]:
modeldir = Path("demo/saved_model")
modelpath = modeldir / "trained_model"
if not modeldir.exists():
    modeldir.mkdir(parents=True)
    torch.save(model, modelpath)
    logger.info(f"Model saved to {modelpath}")

In the case you would have set the earlystop.save to true like this:

In [None]:
settings = TrainerSettings(
    epochs=3,
    metrics=[accuracy],
    logdir=log_dir,
    train_steps=len(train),
    valid_steps=len(valid),
    reporttypes=[ReportTypes.TENSORBOARD],
    earlystop_kwargs={'save': True, 'verbose': True, 'patience': 10}
)

The trainer would have saved checkpoints of the last best model. You can obtain the location of the checkpoint with `trainer.early_stopping.path`:

In [None]:
trainer.early_stopping.path

# load the model

In [None]:
# note that I would expect the loaded model to run on mps, but that doesnt work as expected
if device == "mps":
    device = "cpu"
print(f"using {device}")
loaded_model = torch.load(modelpath, map_location=device, weights_only=False)

In [None]:
# show that all parameters are on the same device, and use acceleration if available
for param in loaded_model.parameters():
    print(param.device)

Get a batch $X$, $y$ and make a prediction $\hat{y}$

In [None]:
X, y = next(iter(testloader))

In [None]:
yhat = loaded_model(X.to(device))
loss_fn(yhat, y.to(device))

Check the accuracy:
- for every example we have 10 numbers
- the location with the highest value is the prediction
- we can get the index with `argmax` over dimension 1
- we compare that index with the original number
- This gives us a count of all the correct predictions
- dividing that through the total length gives us the accuracy percentage. 

In [None]:
acc = (yhat.argmax(dim=1) == y.to(device)).sum() / len(y.to(device))
acc.item() * 100

Note that this is the accuracy for a single batch! 
Get another batch by running next() in the cell above, and calculate the accuracy again.