Deep learning with python and pytorch.

Data - the actual input to a neural network
    
    aquire data
    
    preprocess data
   
   how to iterate over your data

This consumes 90% of your time and energy when thinking about your model.
    training time can take a long time but requires no work by the programmer
    
To begin we will use a toy data set.

In [1]:
pip install torchvision

Note: you may need to restart the kernel to use updated packages.


In [1]:
import torch
import torchvision
from torchvision import transforms, datasets

torchvision is a bunch of datasets that is used for vision. MNIST is a dataset of 28x28 images of hand-drawn numbers 1-9 and this is what we will be using for this tutorial. It is included free and provided by torchvision.

Vision is a good task to benchmark and the main interest that people are working with.
    
    This is the most business interest and low hanging fruit.
    
Basically cheating to use their vision data. In order to do neural network deep learning projects, most of your time is going to be spent on getting data, preparing your data and formatting it to work with a neural network.

In the future we can introduce original data sets.

Lets define two different data sets. tpyically training and testing. They must be seperated in order to validate the model. You have to use data that has never been seen by your model before. You need to do this to avoid over-fitting the training data and it will perform well in production.


In [2]:
train = datasets.MNIST("", train=True, download=True,
                      transform = transforms.Compose([transforms.ToTensor()]))

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to MNIST\raw\train-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Extracting MNIST\raw\train-images-idx3-ubyte.gz to MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to MNIST\raw\train-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Extracting MNIST\raw\train-labels-idx1-ubyte.gz to MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to MNIST\raw\t10k-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Extracting MNIST\raw\t10k-images-idx3-ubyte.gz to MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to MNIST\raw\t10k-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Extracting MNIST\raw\t10k-labels-idx1-ubyte.gz to MNIST\raw
Processing...
Done!


MNIST parameters: 

    first parameter specifies that the data will go locally
    Transforms needs to be done because the data is not natively already a tensor.
    
You can write your own dataset and use this same syntax. It will become obvious how tedious iterating over a data set can be.

In [7]:
test = datasets.MNIST("",train=False, download=True,
                     transform = transforms.Compose([transforms.ToTensor()]))

Batch size is how many data entries are passed to our model at one time. Based on Memory resource allocation ability of the system.

Deep learning starts to shine when you have millions and millions of samples in your data. At some point, there will be more samples than you can fit in your memory. This is why you want to control the batch size. The model will be optimized in increments based on those 10 samples at a time in the batch. Everyone likes to use base 8 numbers for their batch size. No real reason for this, but everyone tends to do it. How many neurons per layer is always trial and error. Basically a gradient decent operation.

Second reason is that we hope this data will generalize. As the model starts to opimze the weights, if you pass your entire data set at once. The machine might learn some generalization, but the model will also find some weights to be arbitrary. Batches at a time will help each optimization that sticks around in the model with each iteration be confirmed as a generalization and the other optimizations that aren't always there will be classified as over-fitting.

There is always a best batch size somewhere between 8 and 64 regardless of how big the memory of the system is. Sometimes you can go bigger batch size and bigger size will impact how quickly you can train through your data.

Shuffle parameter lets the data entries be shuffled. The purpose is to do everything we can to help the neural network to learn and optimize for general principals rather than the specifics of the training data. Neural network will always pick the quickest route to report minimal loss. Programmer needs to obfuscate over-fitment.


In [8]:
trainset = torch.utils.data.DataLoader(train,batch_size=10,shuffle=True)
testset = torch.utils.data.DataLoader(test,batch_size=10,shuffle=True)

Let's iterate over this data.

In [9]:
for data in trainset:
    print(data)
    break

[tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]],


        [[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]],


        [[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]],


        ...,


        [[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0