# 🏞️ Part 2: Data Exploration
With the basics out of the way, we can now jump into image recognition. Before we begin, we are going to load some useful Python libraries into our environment. You don't need to have a precise understanding of what each library does, but here is a rough summary.

| Library | Description |
| --- | --- |
| `torch` | This is the core `PyTorch` library doing most of the heavy lifting |
| `torchvision` | This is how we will extract the data (if needed), and apply some cleaning to images |
| `random` | This library provides random number generators. This is important if we want to sample images from our dataset | 
| `matplotlib.pyplot` | This is a very popular plotting library |
| `time` | This has functions which allows us to keep track of how long the code takes to run |
| `IPython` | This libraries allows us to access some of the options of these notebooks |
| `numpy` | This is a very popular data manipulation library |
| `helper` | A user-defined module (see the `./helper` subfolder for more details) of useful functions and classes |

In [None]:
# Import relevant libraries
import torch 
import torchvision
from torch import nn 
from torchvision import transforms
from torch.utils import data
import random
import matplotlib.pyplot as plt
import time
from IPython import display
import numpy as np
from helper import helper

random.seed(2021) # We set a seed to ensure our samples will be the same every time we run the code.

## ⚗️ The Data Science Pipleine 
*This section will be repeated in both Part 2 and Part 3*

> What on earth is data science?! -- George Washington (probably not)

Seriously though, nowadays, in such a data-rich world, data science has become the new buzzword, the new cool kid in the block. But what exactly is it? Unfortunately, no one can really pin down a [rigourous definition](https://hdsr.mitpress.mit.edu/pub/jhy4g6eg/release/7) of data science. At the high level:

> Data science is the systematic extraction of novel insight from data.

Good enough! With this definition, most practitioners can somewhat agree on a pipeline or flow. Here are the steps:
1. Identify your problem (What are you trying to do?)
2. Obtain your data (What resource do we have to work with?)
3. Explore your data (What does our data actually look like?)
4. Prepare your data (How do we clean/wrangle our data to make it ingestible?)
5. Model your data (How do we automate the process of drawing out insights?)
6. Evaluate your model (How good are our predictions?)
7. Deploy your model (How can the wider-user base access these insights?)

The 7th step is out-of-scope for this workshop, but we well be exploring the other steps to varying degrees:
* Steps 1-4 will be explored in Part 2.
* Steps 5-6 will be explored in Part 3 and Part 4.


## Step 1: Identify Your Problem 
![cc](../images/confused_cat.jpg)

**Figure:** A day-to-day snapshot of a data scientist at work. ([source](https://s.keepmeme.com/files/en_posts/20200925/confused-cat-looking-at-computer-with-a-lot-of-question-marks-meme-861f3efff59aedea603e35b8c3c059f0.jpg))

### The problem:
* Your boss comes up to you and gives you a stack of unlabelled black & white photos
* You get told you need to identify what item of clothing each photo represents (eg. t-shirt)
* You get a stack of 70,000 labelled picture to give you an idea of the task

*What do you do?!* Sure, you can label them by hand if there are only 100 or 1000 unlabelled images. But what if there are 1,000,000? This manual labelling is not tenable in the long term.

Why use machine learning to automated image recognition?
* **Scalable** -- provided you have a reasonable model and enough computational resources, getting a computer to label images is ***a lot*** easier.
* **Consistent** -- the output of the model is going to be more consistent than any crack team of labelers you can assemble (we are but humans after all).

## Step 2: Obtain Your Data
Alright, you decided you probably won't manually label \* phew \*. What next? Lucky for you, your boss provided you with some initial information:
* There are 70,000 labelled images.
* Each image is an item of clothing.
* Each image is a 28x28 sized image (784 pixels in total).
* Each pixel is black and white, and has a value between 0 and 255 indicating the brightness of the pixel
* There are a total of 10 types of items of clothing.

| Label | Type of Clothing |
| --- | --- |
| 0 | T-shirt/top |
| 1 | Trouser |
| 2 | Pullover |
| 3 | Dress |
| 4 | Coat | 
| 5 | Sandal |
| 6 | Shirt |
| 7 | Sneaker |
| 8 | Bag |
| 9 | Ankle Boot |

For example, the following is an image of a boot:

![Boot](../images/boot.png)

In fact, the dataset we are working with is called the [Fashion MNIST](https://www.kaggle.com/zalando-research/fashionmnist). In the cell below, we will try to extract the data and split them into the train and test sets (`train_iter` and `test_iter` respectively).

***Note:*** These are black and white photos, but they are rendered with a greyscale palette because I forgot to change it at the start and now it's too late. That's a window into the problem of legacy code my friends! 

---
## <font color='#F89536'> **Discussion:** </font> 

Why do we need to split the labelled data into train/test sets? (Don't worry, we will go through this in Step 4).

---

---
We will discuss the concept of **batch size** a bit more in Part 4. For now, let's just take the default values.

⚠️⚠️⚠️ If you clicked ![Binder](https://mybinder.org/badge_logo.svg) to get into the notebook, safely ignore the following. 

If you have opted to use your own anaconda to run this, we recommend the following parameters:
* `batch_size = 256`
* `n_workers = 4`

⚠️⚠️⚠️

---

In [None]:
# First define the function without running it
def load_data_fashion_mnist(batch_size, n_workers):
    """Download the Fashion-MNIST dataset and then load it into memory."""
    trans = [transforms.ToTensor()]
    trans = transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(root="../data",
                                                    train=True,
                                                    transform=trans,
                                                    download=True)
    mnist_test = torchvision.datasets.FashionMNIST(root="../data",
                                                   train=False,
                                                   transform=trans,
                                                   download=True)
    return (data.DataLoader(mnist_train, batch_size, shuffle=True,
                            num_workers=n_workers),
            data.DataLoader(mnist_test, batch_size, shuffle=False,
                            num_workers=n_workers))

# Then execute the function here
batch_size = 512  # Set to 256 on your own device
n_workers = 0      # Set to 4 on your own device
train_iter, test_iter = load_data_fashion_mnist(batch_size=batch_size, n_workers = n_workers)

(Sometimes you may get an warning when running this cell. This can be safely ignored).

#### 🎉🎉🎉 Congratulations! You have just read in your train and test data! 🎉🎉🎉

The format of `train_iter` and `test_iter` is a bit strange (they are what's called an iterable object -- we will discuss this in detail in the appendix if you are interested), so we have written a function to extract a single example. There is no need to fully understand each step, but there are comments if you wish to.

In [None]:
def get_single_sample(data):
    """ Extract a single example from train_iter or test_iter"""
    
    # First we extract a few random batches (say 3), each of which will have up to your set batch_size (eg. 1024 for mybinder)
    sampled_batches = random.sample(list(data), 3)
    print("The number of mini-batches we have extracted are:", len(sampled_batches))

    # Second we select a single batch to look at, let's say the 3rd one
    batch_no = 2
    ## 0 denotes the predictors
    ## 1 denotes the labels
    predictor = sampled_batches[batch_no][0]
    label = sampled_batches[batch_no][1]
    print("Out of the", len(sampled_batches), "mini-batches, we have selected the", batch_no + 1, "th one.")
    print("The number of images in the mini-batch we selected are:" , len(predictor), " and ", len(label), ". Note these two values should be equal.")

    # Third, we select a single example in the batch, let's say the 100th one
    example_no = 99
    single_predictor = predictor[example_no]
    single_label = label[example_no]
    return (single_predictor, single_label)

In [None]:
single_predictor, single_label = get_single_sample(train_iter)

print("The shape of the predictor", single_predictor.shape)
print("The shape of the label", single_label.shape)

The `shape()` function shows us the **dimensions** of our arrays. From the single example we have extracted:
* The predictor has `[1, 28, 28]`: this is a tensor
    * The `1` represents the number of channels (since it's black and white this is only =1. For colour, it =3 since it's RGB).
    * The `28` and `28` represents the dimension of the image (28x28)
* The label has `[]`: this is a scalar and represents the type of clothing (the target label we are trying to predict)

***Note:*** Dimensions puts a kinda *limit* on the number of indices. Consider the `[1,28,28]` example. Since there's only one colour, you can only ever use the index `0` for the first dimension. Likewise you can go from `0` to `27` with the 2nd and 3rd dimensions.

Now let's try to visual it!

In [None]:
plt.imshow(single_predictor[0]) # print only the one channel of BW
plt.show()

print("This shows an image of: ", int(single_label))

---
## <font color='#F89536'> **Discussion:** </font> 
Does the image match the label? (See table above).

---

You may get the following error if you run the above cell to show the image:

```
"OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/."
```

If so, this may be a fix. Uncomment the cell below and run it. Caution: There may be [side-effects](https://github.com/dmlc/xgboost/issues/1715) to this fix.

In [None]:
# import os
# os.environ['KMP_DUPLICATE_LIB_OK']='True'

## Step 3: Explore Your Data
Often, you are not the one who has collected the data. Before doing any fancy modelling, the most important thing is to *understand* our data. Here are some questions we can ask ourselves:
* What is the format of the predictors we have?
* What is the format of the predictors we need for our model?
* How many output classes do we have? Or in the case of regression, what is the distribution of the output label?  
etc.

Here we will focus on a single question: What is the **class distribution** of the train/test sets?
* The class distribution is just examples of each class do we have -- for example are there more boots than there are t-shirts?
* The class distribution in the train and test set should ideally be **similar** to ensure we are training a representative subset of the data
* The class distribution in the labelled data should ideally be **similar** to any unseen examples (for example, if there are no boots in the training data, and we encounter a boot for the first time in unseen data, the model will perform very poorly on boots)

---
## <font color='#F89536'> **Discussion:** </font> 
Why do we prefer the train and test set to have similar class distributions?

---

In [None]:
def data_explore(data_iter):
    class_count = helper.Accumulator(10)
    for i, (X, y) in enumerate(data_iter):
        current_counter = torch.bincount(y)
        class_count.add(current_counter[0], current_counter[1], current_counter[2], current_counter[3], current_counter[4],
                  current_counter[5], current_counter[6], current_counter[7], current_counter[8], current_counter[9]) # This is bad coding practice, don't do this, I just got lazy!
    for i in range(10):
        print("Class", i, "has", int(class_count.__getitem__(i)), "images")
    # Note sometimes we can omit the return clause: https://stackoverflow.com/questions/13307158/most-pythonic-way-of-function-with-no-return
        
print("Train Data:")
data_explore(train_iter)
print("Test Data")
data_explore(test_iter)

![](../images/thanos.jpg)  
([source](https://i.kym-cdn.com/entries/icons/original/000/027/257/perfectly-balanced-as-all-things-should-be.jpg))

In reality, your data won't be this perfect, it's always worth checking and understanding class balances!

## Step 4: Prepare Your Data
This was something we have previously alluded to, but we have 70,000 labelled examples. Perfect! Do we throw them all into the training/fitting process of the model?

The answer is **NO**. 

---
## <font color='#F89536'> **Discussion:** </font> 
Why is that though? Doesn't more data = better model?

---

![yesbutno](../images/yesbutno.jpg)

The reason lies in a concept known as **overfitting**. Remember, our goal is to make sure the model works well on hitherto unseen data (ie. unlabelled data). If we just throw all the data we have into the train process, then we won't have access to an independent dataset to evaluate the model. Later on, we will see how the test data can be used to assess overfitting. For now, just remember that it is vital that we have an (representative) subset of the labelled data set aside for evaluation. For the purposes of this example, 60,000 images will be used to train and 10,000 images will be used to test:
* `Train` -- this is the data we use to fit the model (n=60,000)
* `Validate` -- we will not be using this today to tune hyper-parameters
* `Test` -- this is the data we use to determine the model performance (n=10,000)

**Hyperparameter** = parameters the user (you!) choose beforehand (as opposed to parameters the model learns from the data, such as the weights of a neural network) 

e.g. the batch number, the number of layers in a neural network, the width of each layer etc.

*How do you choose?*  
The *art* of choosing the correct hyperparameters is called **hyperparameter tuning** and a detailed discussion is beyond the scope of this workshop.

The most simple implementation of tuning is **grid search**, and it's predicated on the idea that you have a limited list of hyperparameters to *try*. For example, let's say we are deciding how many layers to have in our neural network. We can train on a few 'settings' (eg. `n_hidden_layers = {1,2,3,4,5}`). By evaluating these models against the `Validate` set, we can find the hyperparameter which gives the optimal result.

| `n_hidden_layers` | Validate Accuracy |
| --- | --- |
| 1 | 0.645 | 
| 2 | 0.823 | 
| 3 | 0.855 | 
| 4 | 0.899 | 
| 5 | 0.878 |

If we get these results, we choose `n_hidden_layers = 4`. In reality we might have many of these parameters which make up a 'grid', and we 'search' through this parameter space until we find one with the best accuracy on the validate set.

***Note:*** The validate set has to be separate from the test set as overzealous parameter tuning can overfit the model. So if we had used the test set to tune hyperparameters, we may never truly know if our model has overfit (until our boss comes back to us with all the bad performance reports!). The test set should ONLY every be used for evaluating the final accuracy of the model and no more.

---
## <font color=Red> **[BONUS] Discussion:** </font> 
What are some shortcomings of grid search?

---