# Training a ConvNet PyTorch

In this notebook, you'll learn how to use the powerful PyTorch framework to specify a conv net architecture and train it on the human action recognition dataset. 


In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import DataLoader,sampler,Dataset
import torchvision.datasets as dset
import torchvision.transforms as T
import timeit
from PIL import Image
import os
import numpy as np
import scipy.io
from torchsummary import summary

## What's this PyTorch business?

* When using a framework like PyTorch or TensorFlow you can harness the power of the GPU for your own custom neural network architectures without having to write CUDA code directly.
* this notebook will walk you through much of what you need to do to train models using pytorch. if you want to learn more or need further clarification on topics that aren't fully explained here, here are 2 good Pytorch tutorials. 1): http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html 2)http://pytorch.org/tutorials/beginner/pytorch_with_examples.html
* It's not necessary to have a GPU for this homework, using a GPU can make your code run faster.


## Load Datasets

In this part, we will load the action recognition dataset for the neural network. In order to load data from our custom dataset, we need to write a custom Dataloader. If you put hw6_data.mat, /valClips,/trainClips,/testClips under the folder of ./data/ , you do not need to change anything in this part.

First, load the labels of the dataset, you should write your path of the hw6_data.mat file.

In [2]:
label_mat=scipy.io.loadmat('./data/hw6_data.mat')
label_train=label_mat['trLb']
print(len(label_train))
label_val=label_mat['valLb']
print(len(label_val))

7770
2230


### Dataset class

torch.utils.data.Dataset is an abstract class representing a dataset. The custom dataset should inherit Dataset and override the following methods:

    __len__ so that len(dataset) returns the size of the dataset.
    __getitem__ to support the indexing such that dataset[i] can be used to get ith sample

Let’s create a dataset class for our action recognition dataset. We will read images in __getitem__. This is memory efficient because all the images are not stored in the memory at once but read as required.

Sample of our dataset will be a dict {'image':image,'img_path':img_path,'Label':Label}. Our datset will take an optional argument transform so that any required processing can be applied on the sample. 

In [3]:

class ActionDataset(Dataset):
    """Action dataset."""

    def __init__(self,  root_dir,labels=[], transform=None):
        """
        Args:
            root_dir (string): Directory with all the images.
            labels(list): labels if images.
            transform (callable, optional): Optional transform to be applied on a sample.
        """
        self.root_dir = root_dir
        self.transform = transform
        self.length=len(os.listdir(self.root_dir))
        self.labels=labels
    def __len__(self):
        return self.length*3

    def __getitem__(self, idx):
        
        folder=int(idx/3)+1
        imidx=idx%3+1
        folder=format(folder,'05d')
        imgname=str(imidx)+'.jpg'
        img_path = os.path.join(self.root_dir,
                                folder,imgname)
        image = Image.open(img_path)
        if len(self.labels)!=0:
            Label=self.labels[int(idx/3)][0]-1
        if self.transform:
            image = self.transform(image)
        if len(self.labels)!=0:
            sample={'image':image,'img_path':img_path,'Label':Label}
        else:
            sample={'image':image,'img_path':img_path}
        return sample
  

Iterating over the dataset by a for loop.

In [4]:
image_dataset=ActionDataset(root_dir='./data/trainClips/',\
                            labels=label_train,transform=T.ToTensor())

#iterating though the dataset
for i in range(10):
    sample=image_dataset[i]
    print(sample['image'].shape)
    print(sample['Label'])
    print(sample['img_path'])
     
   

torch.Size([3, 64, 64])
0.0
./data/trainClips/00001/1.jpg
torch.Size([3, 64, 64])
0.0
./data/trainClips/00001/2.jpg
torch.Size([3, 64, 64])
0.0
./data/trainClips/00001/3.jpg
torch.Size([3, 64, 64])
0.0
./data/trainClips/00002/1.jpg
torch.Size([3, 64, 64])
0.0
./data/trainClips/00002/2.jpg
torch.Size([3, 64, 64])
0.0
./data/trainClips/00002/3.jpg
torch.Size([3, 64, 64])
0.0
./data/trainClips/00003/1.jpg
torch.Size([3, 64, 64])
0.0
./data/trainClips/00003/2.jpg
torch.Size([3, 64, 64])
0.0
./data/trainClips/00003/3.jpg
torch.Size([3, 64, 64])
0.0
./data/trainClips/00004/1.jpg


We can iterate over the created dataset with a 'for' loop as before. However, we are losing a lot of features by using a simple for loop to iterate over the data. In particular, we are missing out on:

* Batching the data
* Shuffling the data
* Load the data in parallel using multiprocessing workers.

torch.utils.data.DataLoader is an iterator which provides all these features. 

In [5]:
image_dataloader = DataLoader(image_dataset, batch_size=4,
                        shuffle=True, num_workers=4)


for i,sample in enumerate(image_dataloader):
    sample['image']=sample['image']
    print(i,sample['image'].shape,sample['img_path'],sample['Label'])
    if i>20: 
        break

(0, torch.Size([4, 3, 64, 64]), ['./data/trainClips/04572/3.jpg', './data/trainClips/05974/1.jpg', './data/trainClips/05135/3.jpg', './data/trainClips/02513/2.jpg'], tensor([5., 7., 6., 2.], dtype=torch.float64))
(1, torch.Size([4, 3, 64, 64]), ['./data/trainClips/03542/2.jpg', './data/trainClips/07541/1.jpg', './data/trainClips/05055/3.jpg', './data/trainClips/07433/3.jpg'], tensor([4., 9., 6., 9.], dtype=torch.float64))
(2, torch.Size([4, 3, 64, 64]), ['./data/trainClips/05161/1.jpg', './data/trainClips/04730/1.jpg', './data/trainClips/02546/1.jpg', './data/trainClips/02503/1.jpg'], tensor([6., 5., 2., 2.], dtype=torch.float64))
(3, torch.Size([4, 3, 64, 64]), ['./data/trainClips/03306/1.jpg', './data/trainClips/03569/3.jpg', './data/trainClips/06700/1.jpg', './data/trainClips/06639/2.jpg'], tensor([3., 4., 8., 8.], dtype=torch.float64))
(4, torch.Size([4, 3, 64, 64]), ['./data/trainClips/03515/2.jpg', './data/trainClips/05719/3.jpg', './data/trainClips/02484/1.jpg', './data/trainCli

Dataloaders for the training, validationg and testing set. 

In [6]:
image_dataset_train=ActionDataset(root_dir='./data/trainClips/',labels=label_train,transform=T.ToTensor())

image_dataloader_train = DataLoader(image_dataset_train, batch_size=32,
                        shuffle=True, num_workers=4)
image_dataset_val=ActionDataset(root_dir='./data/valClips/',labels=label_val,transform=T.ToTensor())

image_dataloader_val = DataLoader(image_dataset_val, batch_size=32,
                        shuffle=False, num_workers=4)
image_dataset_test=ActionDataset(root_dir='./data/testClips/',labels=[],transform=T.ToTensor())

image_dataloader_test = DataLoader(image_dataset_test, batch_size=32,
                        shuffle=False, num_workers=4)

In [7]:
dtype = torch.FloatTensor # the CPU datatype
# Constant to control how frequently we print train loss
print_every = 100
# This is a little utility that we'll use to reset the model
# if we want to re-initialize all our parameters
def reset(m):
    if hasattr(m, 'reset_parameters'):
        m.reset_parameters()

## Example Model

### Some assorted tidbits

Let's start by looking at a simple model. First, note that PyTorch operates on Tensors, which are n-dimensional arrays functionally analogous to numpy's ndarrays, with the additional feature that they can be used for computations on GPUs.

We'll provide you with a Flatten function, which we explain here. Remember that our image data (and more relevantly, our intermediate feature maps) are initially N x C x H x W, where:
* N is the number of datapoints
* C is the number of image channels. 
* H is the height of the intermediate feature map in pixels
* W is the height of the intermediate feature map in pixels

This is the right way to represent the data when we are doing something like a 2D convolution, that needs spatial understanding of where the intermediate features are relative to each other. When we input  data into fully connected affine layers, however, we want each datapoint to be represented by a single vector -- it's no longer useful to segregate the different channels, rows, and columns of the data. So, we use a "Flatten" operation to collapse the C x H x W values per representation into a single long vector. The Flatten function below first reads in the N, C, H, and W values from a given batch of data, and then returns a "view" of that data. "View" is analogous to numpy's "reshape" method: it reshapes x's dimensions to be N x ??, where ?? is allowed to be anything (in this case, it will be C x H x W, but we don't need to specify that explicitly). 

In [8]:
class Flatten(nn.Module):
    def forward(self, x):
        N, C, H, W = x.size() # read in N, C, H, W
        return x.view(N, -1)  # "flatten" the C * H * W values into a single vector per image

### The example model itself

The first step to training your own model is defining its architecture.

Here's an example of a convolutional neural network defined in PyTorch -- try to understand what each line is doing, remembering that each layer is composed upon the previous layer. We haven't trained anything yet - that'll come next - for now, we want you to understand how everything gets set up.  nn.Sequential is a container which applies each layer
one after the other.

In this example, you see 2D convolutional layers (Conv2d), ReLU activations, and fully-connected layers (Linear). You also see the Cross-Entropy loss function, and the Adam optimizer being used. 

Make sure you understand why the parameters of the Linear layer are 10092 and 10.


In [9]:
# Here's where we define the architecture of the model... 
simple_model = nn.Sequential(
                nn.Conv2d(3, 32, kernel_size=7, stride=2),
                nn.ReLU(inplace=True),
                Flatten(), # see above for explanation
                nn.Linear(10092, 10), # affine layer
              )

# Set the type of all data in this model to be FloatTensor 
simple_model.type(dtype)

loss_fn = nn.CrossEntropyLoss().type(dtype)
optimizer = optim.Adam(simple_model.parameters(), lr=1e-2) # lr sets the learning rate of the optimizer

PyTorch supports many other layer types, loss functions, and optimizers - you will experiment with these next. Here's the official API documentation for these (if any of the parameters used above were unclear, this resource will also be helpful). 

* Layers: http://pytorch.org/docs/nn.html
* Activations: http://pytorch.org/docs/nn.html#non-linear-activations
* Loss functions: http://pytorch.org/docs/nn.html#loss-functions
* Optimizers: http://pytorch.org/docs/optim.html#algorithms

## Training a specific model

In this section, we're going to specify a model for you to construct. The goal here isn't to get good performance (that'll be next), but instead to get comfortable with understanding the PyTorch documentation and configuring your own model. 

Using the code provided above as guidance, and using the following PyTorch documentation, specify a model with the following architecture:

* 7x7 Convolutional Layer with 8 filters and stride of 1
* ReLU Activation Layer
* 2x2 Max Pooling layer with a stride of 2
* 7x7 Convolutional Layer with 16 filters and stride of 1
* ReLU Activation Layer
* 2x2 Max Pooling layer with a stride of 2
* Flatten the feature map
* ReLU Activation Layer
* Affine layer to map input units to 10 outputs, you need to figure out the input size here.


In [10]:
fixed_model_base = nn.Sequential( 
    #########1st TODO  (10 points)###################
    nn.Conv2d(3, 8, kernel_size=7, stride=1),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=2, stride=2),
    nn.Conv2d(8, 16, kernel_size=7, stride=1),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=2, stride=2),
    Flatten(),
    nn.ReLU(inplace=True),
    nn.Linear(1936, 10)
    ####################################
            )
fixed_model = fixed_model_base.type(dtype)



To make sure you're doing the right thing, use the following tool to check the dimensionality of your output (it should be 32 x 10, since our batches have size 32 and the output of the final affine layer should be 10, corresponding to our 10 classes):

In [11]:
## Now we're going to feed a random batch into the model you defined and make sure the output is the right size
x = torch.randn(32, 3, 64, 64).type(dtype)
x_var = Variable(x.type(dtype)) # Construct a PyTorch Variable out of your input data
ans = fixed_model(x_var)        # Feed it through the model! 

# Check to make sure what comes out of your model
# is the right dimensionality... this should be True
# if you've done everything correctly
print(np.array(ans.size()))
np.array_equal(np.array(ans.size()), np.array([32, 10]))   


[32 10]


True

### Train the model.

Now that you've seen how to define a model and do a single forward pass of some data through it, let's  walk through how you'd actually train one whole epoch over your training data (using the fixed_model_base we provided above).

Make sure you understand how each PyTorch function used below corresponds to what you implemented in your custom neural network implementation.

Note that because we are not resetting the weights anywhere below, if you run the cell multiple times, you are effectively training multiple epochs (so your performance should improve).

First, set up an RMSprop optimizer (using a 1e-4 learning rate) and a cross-entropy loss function:

In [12]:
################ 2nd TODO  (5 points)##################
optimizer = torch.optim.RMSprop(fixed_model_base.parameters(), lr = 0.0001)
loss_fn = nn.CrossEntropyLoss()

In [13]:
# This sets the model in "training" mode. 
# This is relevant for some layers that may have different behavior
# in training mode vs testing mode, such as Dropout and BatchNorm. 
fixed_model.train()

# Load one batch at a time.
for t, sample in enumerate(image_dataloader_train):
    x_var = Variable(sample['image'])
    #print(type(x_var.data))
    #print(x_var.shape)
    y_var = Variable(sample['Label']).long()

    # This is the forward pass: predict the scores for each class, for each x in the batch.
    scores = fixed_model(x_var)
    
    # Use the correct y values and the predicted y values to compute the loss.
    loss = loss_fn(scores, y_var)
    
    if (t + 1) % print_every == 0:
        print('t = %d, loss = %.4f' % (t + 1, loss.data[0]))

    # Zero out all of the gradients for the variables which the optimizer will update.
    optimizer.zero_grad()
    
    # This is the backwards pass: compute the gradient of the loss with respect to each 
    # parameter of the model.
    loss.backward()
    
    # Actually update the parameters of the model using the gradients computed by the backwards pass.
    optimizer.step()
   



t = 100, loss = 1.9794
t = 200, loss = 1.6585
t = 300, loss = 1.7687
t = 400, loss = 1.3849
t = 500, loss = 1.3847
t = 600, loss = 1.3364
t = 700, loss = 1.3302


Now you've seen how the training process works in PyTorch. To save you writing boilerplate code, we're providing the following helper functions to help you train for multiple epochs and check the accuracy of your model:

In [14]:
def train(model, loss_fn, optimizer, dataloader, num_epochs = 1):
    for epoch in range(num_epochs):
        print('Starting epoch %d / %d' % (epoch + 1, num_epochs))
        model.train()
        for t, sample in enumerate(dataloader):
            x_var = Variable(sample['image'])
            y_var = Variable(sample['Label'].long())

            scores = model(x_var)
            
            loss = loss_fn(scores, y_var)
            if (t + 1) % print_every == 0:
                print('t = %d, loss = %.4f' % (t + 1, loss.data[0]))

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def check_accuracy(model, loader):
    '''
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')  
    '''
    num_correct = 0
    num_samples = 0
    model.eval() # Put the model in test mode (the opposite of model.train(), essentially)
    for t, sample in enumerate(loader):
        x_var = Variable(sample['image'])
        y_var = sample['Label']
        #y_var=y_var.cpu()
        scores = model(x_var)
        _, preds = scores.data.max(1)#scores.data.cpu().max(1)
        #print(preds)
        #print(y_var)
        num_correct += (preds.numpy() == y_var.numpy()).sum()
        num_samples += preds.size(0)
    acc = float(num_correct) / num_samples
    print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
    
    



### Check the accuracy of the model.

Let's see the train and check_accuracy code in action -- feel free to use these methods when evaluating the models you develop below.

You should get a training loss of around 1.0-1.2, and a validation accuracy of around 50-60%. As mentioned above, if you re-run the cells, you'll be training more epochs, so your performance will improve past these numbers.

But don't worry about getting these numbers better -- this was just practice before you tackle designing your own model.

In [18]:
torch.random.manual_seed(12345)
fixed_model.cpu()
fixed_model.apply(reset) 
fixed_model.train() 
train(fixed_model, loss_fn, optimizer,image_dataloader_train, num_epochs=4) 
check_accuracy(fixed_model, image_dataloader_train)# check accuracy on the training set


Starting epoch 1 / 4


  del sys.path[0]


t = 100, loss = 2.2922
t = 200, loss = 2.2640
t = 300, loss = 2.1622
t = 400, loss = 1.8757
t = 500, loss = 1.7532
t = 600, loss = 1.5534
t = 700, loss = 1.6626
Starting epoch 2 / 4
t = 100, loss = 1.8409
t = 200, loss = 1.3446
t = 300, loss = 0.9724
t = 400, loss = 1.0138
t = 500, loss = 1.0166
t = 600, loss = 0.8713
t = 700, loss = 0.9027
Starting epoch 3 / 4
t = 100, loss = 0.8842
t = 200, loss = 0.9790
t = 300, loss = 0.5493
t = 400, loss = 1.0489
t = 500, loss = 0.8538
t = 600, loss = 0.8192
t = 700, loss = 0.6244
Starting epoch 4 / 4
t = 100, loss = 0.5160
t = 200, loss = 0.4333
t = 300, loss = 0.5515
t = 400, loss = 0.9081
t = 500, loss = 0.8962
t = 600, loss = 0.5640
t = 700, loss = 0.5809
Got 19116 / 23310 correct (82.01)


### Don't forget the validation set!

And note that you can use the check_accuracy function to evaluate on the validation set, by passing **image_dataloader_val** as the second argument to check_accuracy. The accuracy on validation set is arround 40-50%.

In [19]:
check_accuracy(fixed_model, image_dataloader_val)#check accuracy on the validation set

Got 3529 / 6690 correct (52.75)


##### Train a better  model for action recognition!

Now it's your job to experiment with architectures, hyperparameters, loss functions, and optimizers to train a model that achieves better accuracy on the action recognition **validation** set. You can use the check_accuracy and train functions from above.

### Things you should try:
- **Filter size**: Above we used 7x7; this makes pretty pictures but smaller filters may be more efficient
- **Number of filters**: Do more or fewer do better?
- **Pooling vs Strided Convolution**: Do you use max pooling or just stride convolutions?
- **Batch normalization**: Try adding spatial batch normalization after convolution layers and vanilla batch normalization after affine layers. Do your networks train faster?
- **Network architecture**: The network above has two layers of trainable parameters. Can you do better with a deep network? Good architectures to try include:
    - [conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
    - [conv-relu-conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
    - [batchnorm-relu-conv]xN -> [affine]xM -> [softmax or SVM]
- **Global Average Pooling**: Instead of flattening and then having multiple affine layers, perform convolutions until your image gets small (7x7 or so) and then perform an average pooling operation to get to a 1x1 image picture (1, 1 , Filter#), which is then reshaped into a (Filter#) vector. This is used in [Google's Inception Network](https://arxiv.org/abs/1512.00567) (See Table 1 for their architecture).
- **Regularization**: Add l2 weight regularization, or perhaps use Dropout.

### Tips for training
For each network architecture that you try, you should tune the learning rate and regularization strength. When doing this there are a couple important things to keep in mind:

- If the parameters are working well, you should see improvement within a few hundred iterations
- Remember the coarse-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.
- Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.
- You should use the validation set for hyperparameter search, and save your test set for evaluating your architecture on the best parameters as selected by the validation set.

### Going above and beyond
If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are **not required** to implement any of these; however they would be good things to try.

- Alternative update steps: For the assignment we implemented SGD+momentum, RMSprop, and Adam; you could try alternatives like AdaGrad or AdaDelta.
- Alternative activation functions such as leaky ReLU, parametric ReLU, ELU, or MaxOut.
- Model ensembles
- Data augmentation
- New Architectures
  - [ResNets](https://arxiv.org/abs/1512.03385) where the input from the previous layer is added to the output.
  - [DenseNets](https://arxiv.org/abs/1608.06993) where inputs into previous layers are concatenated together.
  - [This blog has an in-depth overview](https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32)

If you do decide to implement something extra, clearly describe it in the "Extra Credit Description" cell below.

### What we expect
At the very least, you should be able to train a ConvNet that gets at least 55% accuracy on the validation set. This is just a lower bound - if you are careful it should be possible to get accuracies much higher than that! Extra credit points will be awarded for particularly high-scoring models or unique approaches.

You should use the space below to experiment and train your network. 



In [20]:
###########3rd TODO (20 points, must submit the results to Kaggle) ##############
# Train your model here, and make sure the output of this cell is the accuracy of your best model on the 
# train, val, and test sets. Here's some code to get you started. The output of this cell should be the training
# and validation accuracy on your best model (measured by validation accuracy).

image_dataloader_aug = DataLoader(image_dataset_train+image_dataset_val, batch_size=64,
                        shuffle=True, num_workers=4)


model = nn.Sequential(
    nn.Conv2d(3, 8, kernel_size=3, stride=1, padding=0),
    nn.BatchNorm2d(8), 
    nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=2, padding=1),

    nn.Conv2d(8, 16, kernel_size=3, stride=1, padding=0),
    nn.BatchNorm2d(16),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=2, padding=1),

    nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=0),
    nn.BatchNorm2d(32),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=2, padding=1),

    nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=0),
    nn.BatchNorm2d(64),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=2, padding=1),

    Flatten(),
    nn.Dropout(0.1),
    nn.Linear(1024, 10)    
    )



loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters()) # lr sets the learning rate of the optimizer

torch.random.manual_seed(1332)
model.cpu()
model.apply(reset) 
model.train() 

train(model, loss_fn, optimizer, image_dataloader_aug, num_epochs=1) 
check_accuracy(model, image_dataloader_train)# check accuracy on the training set

check_accuracy(model, image_dataloader_val)

Starting epoch 1 / 1


  del sys.path[0]


t = 100, loss = 0.6221
t = 200, loss = 0.2417
t = 300, loss = 0.2470
t = 400, loss = 0.0629
Got 22770 / 23310 correct (97.68)
Got 6451 / 6690 correct (96.43)


### Describe what you did 

In the cell below you should write an explanation of what you did, any additional features that you implemented, and any visualizations or graphs that you make in the process of training and evaluating your network.

In [31]:
print model
# pip install torchsummary
from torchsummary import summary
print summary(model, (3, 64, 64))

Sequential(
  (0): Conv2d(3, 8, kernel_size=(3, 3), stride=(1, 1))
  (1): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace)
  (3): MaxPool2d(kernel_size=2, stride=2, padding=1, dilation=1, ceil_mode=False)
  (4): Conv2d(8, 16, kernel_size=(3, 3), stride=(1, 1))
  (5): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (6): ReLU(inplace)
  (7): MaxPool2d(kernel_size=2, stride=2, padding=1, dilation=1, ceil_mode=False)
  (8): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
  (9): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (10): ReLU(inplace)
  (11): MaxPool2d(kernel_size=2, stride=2, padding=1, dilation=1, ceil_mode=False)
  (12): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
  (13): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (14): ReLU(inplace)
  (15): MaxPool2d(kernel_size=2, stride=2, padding=1, dilation=1, ceil_mode=False

Tell us here!
########### 4th TODO (5 points) ##############
### Starting from the model given previously, the following changes were incorporated.

#### 1. Decreased kernel-size from 7x7 to 3x3 for Conv2d Layers : 
Since the images are small, smaller filter-size helped capture the local features better thus improving the performance.
#### 2. Decreased stride from 2 to 1 :
Smaller stride works better.
#### 3. Added more layers to the network
#### 4. Conv2d -> BatchNorm -> ReLU -> MaxPool2d
Max pooling works better than average pooling
#### 5. Dropout : 
partly helped to reduce overfitting (not much)
#### 6. Data Augmentation : Trained on train + validation dataset : 
increased the accuracy on test. Since model is seeing more data (val) than just train data as previously, more data, better results. 
#### 7. Increased the batch-size to 64

###############################################
Other failed approaches mentioned at the end of the notebook

### Testing the model and submit on Kaggle
Testing the model on the testing set and save the results as a .csv file. 
Please submitted the results.csv file generated by predict_on_test() to Kaggle(https://www.kaggle.com/c/cse512springhw3) to see how well your network performs on the test set. 
#######5th TODO (submit the result to Kaggle, the highest 3 entries get extra 10 points )###############
### Kaggle Score : 0.69133

In [21]:
def predict_on_test(model, loader):
    '''
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')  
    '''
    num_correct = 0
    num_samples = 0
    model.eval() # Put the model in test mode (the opposite of model.train(), essentially)
    results=open('results.csv','w')
    count=0
    results.write('Id'+','+'Class'+'\n')
    for t, sample in enumerate(loader):
        x_var = Variable(sample['image'])
        scores = model(x_var)
        _, preds = scores.data.max(1)
        for i in range(len(preds)):
            results.write(str(count)+','+str(preds[i])+'\n')
            count+=1
    results.close()
    return count
    
count=predict_on_test(model, image_dataloader_test)
print(count)

9810


### GPU! (This part is optional, 0 points)

If you have access to GPU, you can make the code run on GPU, it would be much faster. 

Now, we're going to switch the dtype of the model and our data to the GPU-friendly tensors, and see what happens... everything is the same, except we are casting our model and input tensors as this new dtype instead of the old one.

If this returns false, or otherwise fails in a not-graceful way (i.e., with some error message), you may not have an NVIDIA GPU available on your machine. 

In [None]:
# Verify that CUDA is properly configured and you have a GPU available

torch.cuda.is_available()

In [None]:
import copy
gpu_dtype = torch.cuda.FloatTensor

fixed_model_gpu = copy.deepcopy(fixed_model_base)#.type(gpu_dtype)
fixed_model_gpu.cuda()
x_gpu = torch.randn(4, 3, 64, 64).cuda()#.type(gpu_dtype)
x_var_gpu = Variable(x_gpu)#type(gpu_dtype)) # Construct a PyTorch Variable out of your input data
ans = fixed_model_gpu(x_var_gpu)        # Feed it through the model! 

# Check to make sure what comes out of your model
# is the right dimensionality... this should be True
# if you've done everything correctly
np.array_equal(np.array(ans.size()), np.array([4, 10]))


Run the following cell to evaluate the performance of the forward pass running on the CPU:

In [None]:
%%timeit 
ans = fixed_model(x_var)

... and now the GPU:

In [None]:
%%timeit 
torch.cuda.synchronize() # Make sure there are no pending GPU computations
ans = fixed_model_gpu(x_var_gpu)        # Feed it through the model! 
torch.cuda.synchronize() # Make sure there are no pending GPU computations

You should observe that even a simple forward pass like this is significantly faster on the GPU. So for the rest of the assignment (and when you go train your models in assignment 3 and your project!), you should use the GPU datatype for your model and your tensors: as a reminder that is *torch.cuda.FloatTensor* (in our notebook here as *gpu_dtype*)

Let's make the loss function and training variables to GPU friendly format by '.cuda()'

In [None]:
loss_fn = nn.CrossEntropyLoss().cuda()
optimizer = optim.RMSprop(fixed_model_gpu.parameters(), lr=1e-4)

In [None]:
def train(model, loss_fn, optimizer, dataloader, num_epochs = 1):
    for epoch in range(num_epochs):
        print('Starting epoch %d / %d' % (epoch + 1, num_epochs))
        model.train()
        for t, sample in enumerate(dataloader):
            x_var = Variable(sample['image'].cuda())
            y_var = Variable(sample['Label'].cuda().long())

            scores = model(x_var)
            
            loss = loss_fn(scores, y_var)
            if (t + 1) % print_every == 0:
                print('t = %d, loss = %.4f' % (t + 1, loss.data[0]))

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def check_accuracy(model, loader):
    '''
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')  
    '''
    num_correct = 0
    num_samples = 0
    model.eval() # Put the model in test mode (the opposite of model.train(), essentially)
    for t, sample in enumerate(loader):
        x_var = Variable(sample['image'].cuda())
        y_var = sample['Label'].cuda()
        y_var=y_var.cpu()
        scores = model(x_var)
        _, preds = scores.data.cpu().max(1)
        #print(preds)
        #print(y_var)
        num_correct += (preds.numpy() == y_var.numpy()).sum()
        num_samples += preds.size(0)
    acc = float(num_correct) / num_samples
    print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))

Run on GPU!

In [None]:
torch.cuda.random.manual_seed(12345)

fixed_model_gpu.apply(reset) 
fixed_model_gpu.train() 
train(fixed_model_gpu, loss_fn, optimizer,image_dataloader_train, num_epochs=1) 
check_accuracy(fixed_model_gpu, image_dataloader_train)# check accuracy on the training set


### 3D Convolution on video clips (25 points+10 extra points)
3D convolution is for videos, it has one more dimension than 2d convolution. You can find the document for 3D convolution here http://pytorch.org/docs/master/nn.html#torch.nn.Conv3dIn. In our dataset, each clip is a video of 3 frames. Lets classify the each clip rather than each image using 3D convolution.
We offer the data loader, the train_3d and check_accuracy

In [22]:
class ActionClipDataset(Dataset):
    """Action Landmarks dataset."""

    def __init__(self,  root_dir,labels=[], transform=None):
        """
        Args:
            csv_file (string): Path to the csv file with annotations.
            root_dir (string): Directory with all the images.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        
        self.root_dir = root_dir
        self.transform = transform
        self.length=len(os.listdir(self.root_dir))
        self.labels=labels

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        
        folder=idx+1
        folder=format(folder,'05d')
        clip=[]
        if len(self.labels)!=0:
            Label=self.labels[idx][0]-1
        for i in range(3):
            imidx=i+1
            imgname=str(imidx)+'.jpg'
            img_path = os.path.join(self.root_dir,
                                    folder,imgname)
            image = Image.open(img_path)
            image=np.array(image)
            clip.append(image)
        if self.transform:
            clip=np.asarray(clip)
            clip=np.transpose(clip, (0,3,1,2))
            clip = torch.from_numpy(np.asarray(clip))
        if len(self.labels)!=0:
            sample={'clip':clip,'Label':Label,'folder':folder}
        else:
            sample={'clip':clip,'folder':folder}
        return sample

clip_dataset=ActionClipDataset(root_dir='./data/trainClips/',\
                               labels=label_train,transform=T.ToTensor())#/home/tqvinh/Study/CSE512/cse512-s18/hw2data/trainClips/
for i in range(10):
    sample=clip_dataset[i]
    print(sample['clip'].shape)
    print(sample['Label'])
    print(sample['folder'])

torch.Size([3, 3, 64, 64])
0.0
00001
torch.Size([3, 3, 64, 64])
0.0
00002
torch.Size([3, 3, 64, 64])
0.0
00003
torch.Size([3, 3, 64, 64])
0.0
00004
torch.Size([3, 3, 64, 64])
0.0
00005
torch.Size([3, 3, 64, 64])
0.0
00006
torch.Size([3, 3, 64, 64])
0.0
00007
torch.Size([3, 3, 64, 64])
0.0
00008
torch.Size([3, 3, 64, 64])
0.0
00009
torch.Size([3, 3, 64, 64])
0.0
00010


In [23]:
clip_dataloader = DataLoader(clip_dataset, batch_size=4,
                        shuffle=True, num_workers=4)


for i,sample in enumerate(clip_dataloader):
    print(i,sample['clip'].shape,sample['folder'],sample['Label'])
    if i>20: 
        break

(0, torch.Size([4, 3, 3, 64, 64]), ['07581', '05352', '03804', '05857'], tensor([9., 6., 4., 7.], dtype=torch.float64))
(1, torch.Size([4, 3, 3, 64, 64]), ['05671', '00418', '06187', '01096'], tensor([7., 0., 7., 1.], dtype=torch.float64))
(2, torch.Size([4, 3, 3, 64, 64]), ['03360', '05474', '00758', '04664'], tensor([3., 6., 0., 5.], dtype=torch.float64))
(3, torch.Size([4, 3, 3, 64, 64]), ['05260', '00347', '04708', '04490'], tensor([6., 0., 5., 5.], dtype=torch.float64))
(4, torch.Size([4, 3, 3, 64, 64]), ['03060', '06416', '06578', '02359'], tensor([3., 7., 8., 2.], dtype=torch.float64))
(5, torch.Size([4, 3, 3, 64, 64]), ['05792', '05851', '00529', '04173'], tensor([7., 7., 0., 4.], dtype=torch.float64))
(6, torch.Size([4, 3, 3, 64, 64]), ['04267', '05919', '00541', '01197'], tensor([5., 7., 0., 1.], dtype=torch.float64))
(7, torch.Size([4, 3, 3, 64, 64]), ['07507', '06897', '02637', '03536'], tensor([9., 8., 2., 4.], dtype=torch.float64))
(8, torch.Size([4, 3, 3, 64, 64]), ['001

In [75]:
clip_dataset_train=ActionClipDataset(root_dir='./data/trainClips/',labels=label_train,transform=T.ToTensor())

clip_dataloader_train = DataLoader(clip_dataset_train, batch_size=16,
                        shuffle=True, num_workers=4)
clip_dataset_val=ActionClipDataset(root_dir='./data/valClips/',labels=label_val,transform=T.ToTensor())

clip_dataloader_val = DataLoader(clip_dataset_val, batch_size=16,
                        shuffle=True, num_workers=4)
clip_dataset_test=ActionClipDataset(root_dir='./data/testClips/',labels=[],transform=T.ToTensor())

clip_dataloader_test = DataLoader(clip_dataset_test, batch_size=16,
                        shuffle=False, num_workers=4)

Write the Flatten for 3d covolution feature maps.

In [67]:
class Flatten3d(nn.Module):
    def forward(self, x):
        ###############6th TODO (5 points)###################
        N, C, D, H, W = x.size() 
        return x.view(N, -1)

Design a network using 3D convolution on videos for video classification.

In [76]:
### not the best model ; best model mentioned at the end of the notebook ; this was the last experiment i tried


fixed_model_3d = nn.Sequential( # You fill this in!
    ###############7th TODO (20 points)#########################
    nn.Conv3d(3, 8, kernel_size=(1,3,3), stride=1),
    nn.BatchNorm3d(8),
    nn.ReLU(inplace=True),
    nn.MaxPool3d(kernel_size=(3, 2, 2), stride=(1,2,2), padding=1),
    
    nn.Conv3d(8, 16, kernel_size=(1,3,3), stride=1),
    nn.BatchNorm3d(16),
    nn.ReLU(inplace=True),
    nn.MaxPool3d(kernel_size=(3, 2, 2), stride=(1,2,2), padding=1),

    nn.Conv3d(16, 32, kernel_size=(1,3,3), stride=1),
    nn.BatchNorm3d(32),
    nn.ReLU(inplace=True),
    nn.MaxPool3d(kernel_size=(3, 2, 2), stride=(1,2,2), padding=1),       

    nn.Conv3d(32, 64, kernel_size=(1,3,3), stride=1),
    nn.BatchNorm3d(64),
    nn.ReLU(inplace=True),
    nn.MaxPool3d(kernel_size=(3, 2, 2), stride=(1,2,2), padding=1),  
    nn.Dropout(0.5),

    nn.Conv3d(64, 64, kernel_size=(1,3,3), stride=1),
    nn.BatchNorm3d(64),
    nn.ReLU(inplace=True),
    nn.MaxPool3d(kernel_size=(3, 2, 2), stride=(1,2,2), padding=1),
    Flatten3d(),
    nn.Linear(768,64),
    nn.ReLU(inplace=True),
    nn.Dropout(0.5),
    nn.Linear(64,10)
)

fixed_model_3d = fixed_model_3d.type(dtype)
x = torch.randn(32,3, 3, 64, 64).type(dtype)
x_var = Variable(x).type(dtype) # Construct a PyTorch Variable out of your input data
ans = fixed_model_3d(x_var) 
np.array_equal(np.array(ans.size()), np.array([32, 10]))


True

### Describe what you did (5 points)

In the cell below you should write an explanation of what you did, any additional features that you implemented, and any visualizations or graphs that you make in the process of training and evaluating your network.

### 8th TODO Tell us here:
#### 1. kernel sizes : 
Smaller kernel sizes work better.
#### 2.  strides: 
Smaller stride works better.
#### 3. Convolution with and without padding
#### 4. Max pooling :
Max pooling works better than average pooling.
#### 5. Batch normalization 



In [77]:
print fixed_model_3d
# pip install torchsummary
from torchsummary import summary
print summary(fixed_model_3d, (3, 3, 64, 64))

Sequential(
  (0): Conv3d(3, 8, kernel_size=(1, 3, 3), stride=(1, 1, 1))
  (1): BatchNorm3d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace)
  (3): MaxPool3d(kernel_size=(3, 2, 2), stride=(1, 2, 2), padding=1, dilation=1, ceil_mode=False)
  (4): Conv3d(8, 16, kernel_size=(1, 3, 3), stride=(1, 1, 1))
  (5): BatchNorm3d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (6): ReLU(inplace)
  (7): MaxPool3d(kernel_size=(3, 2, 2), stride=(1, 2, 2), padding=1, dilation=1, ceil_mode=False)
  (8): Conv3d(16, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1))
  (9): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (10): ReLU(inplace)
  (11): MaxPool3d(kernel_size=(3, 2, 2), stride=(1, 2, 2), padding=1, dilation=1, ceil_mode=False)
  (12): Conv3d(32, 64, kernel_size=(1, 3, 3), stride=(1, 1, 1))
  (13): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (14): ReLU(inplace)
  (15): M

In [78]:
loss_fn = nn.CrossEntropyLoss().type(dtype)
optimizer = optim.Adam(fixed_model_3d.parameters())


In [79]:
def train_3d(model, loss_fn, optimizer,dataloader,num_epochs = 1):
    for epoch in range(num_epochs):
        print('Starting epoch %d / %d' % (epoch + 1, num_epochs))
        model.train()
        for t, sample in enumerate(dataloader):
            x_var = Variable(sample['clip'].type(dtype))
            y_var = Variable(sample['Label'].type(dtype).long())

            scores = model(x_var)
            
            loss = loss_fn(scores, y_var)
            if (t + 1) % print_every == 0:
                print('t = %d, loss = %.4f' % (t + 1, loss.data[0]))

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def check_accuracy_3d(model, loader):
    '''
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')  
    '''
    num_correct = 0
    num_samples = 0
    model.eval() # Put the model in test mode (the opposite of model.train(), essentially)
    for t, sample in enumerate(loader):
        x_var = Variable(sample['clip'].type(dtype))
        y_var = sample['Label'].type(dtype)
        y_var=y_var.cpu()
        scores = model(x_var)
        _, preds = scores.data.cpu().max(1)
        #print(preds)
        #print(y_var)
        num_correct += (preds.numpy() == y_var.numpy()).sum()
        num_samples += preds.size(0)
    acc = float(num_correct) / num_samples
    print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))

In [80]:
torch.cuda.random.manual_seed(1332)
fixed_model_3d.apply(reset) 
fixed_model_3d.train() 
train_3d(fixed_model_3d, loss_fn, optimizer,clip_dataloader_train, num_epochs=25) 
fixed_model_3d.eval() 
check_accuracy_3d(fixed_model_3d, clip_dataloader_val)

Starting epoch 1 / 25


  del sys.path[0]


t = 100, loss = 1.6114
t = 200, loss = 1.8699
t = 300, loss = 1.2448
t = 400, loss = 1.5927
Starting epoch 2 / 25
t = 100, loss = 1.0678
t = 200, loss = 0.9919
t = 300, loss = 0.7828
t = 400, loss = 1.2686
Starting epoch 3 / 25
t = 100, loss = 1.1394
t = 200, loss = 0.7921
t = 300, loss = 0.6804
t = 400, loss = 0.9715
Starting epoch 4 / 25
t = 100, loss = 0.6548
t = 200, loss = 0.6611
t = 300, loss = 1.1438
t = 400, loss = 0.5637
Starting epoch 5 / 25
t = 100, loss = 0.2799
t = 200, loss = 0.4259
t = 300, loss = 0.4275
t = 400, loss = 0.2818
Starting epoch 6 / 25
t = 100, loss = 0.3485
t = 200, loss = 0.9103
t = 300, loss = 0.5154
t = 400, loss = 0.3972
Starting epoch 7 / 25
t = 100, loss = 0.2575
t = 200, loss = 0.3915
t = 300, loss = 0.7013
t = 400, loss = 0.6936
Starting epoch 8 / 25
t = 100, loss = 0.1945
t = 200, loss = 0.8446
t = 300, loss = 0.3518
t = 400, loss = 0.7200
Starting epoch 9 / 25
t = 100, loss = 0.4520
t = 200, loss = 0.4028
t = 300, loss = 0.1995
t = 400, loss = 0.1

Test your 3d convolution model on the validation set. You don't need to submit the result of this part to kaggle.  

Test your model on the test set, predict_on_test_3d() will generate a file named 'results_3d.csv'. Please submit the csv file to kaggle https://www.kaggle.com/c/cse512springhw3video
The highest 3 entries get extra 10 points.


In [81]:
def predict_on_test_3d(model, loader):
    '''
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')  
    '''
    num_correct = 0
    num_samples = 0
    model.eval() # Put the model in test mode (the opposite of model.train(), essentially)
    results=open('results_3d.csv','w')
    count=0
    results.write('Id'+','+'Class'+'\n')
    for t, sample in enumerate(loader):
        x_var = Variable(sample['clip'].type(dtype))
        scores = model(x_var)
        _, preds = scores.data.max(1)
        for i in range(len(preds)):
            results.write(str(count)+','+str(preds[i])+'\n')
            count+=1
    results.close()
    return count
    
count=predict_on_test_3d(fixed_model_3d, clip_dataloader_test)
print(count)

3270


## Images - Other Attempts 

### 1. On the lines of VGG - Since the image size is 64, and we have a limited dataset, a bigger network overfitted very badly. Hence, I chose to decrease the number of hidden layers and decrease the number of hidden units in the hidden layers
```
nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, stride = 2),
nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, stride = 2),
nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, stride = 2),
nn.Conv2d(256, 512, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, stride = 2),
Flatten(),
nn.Linear(8192, 1024),
nn.Linear(1024, 10),
nn.LogSoftmax()
```

### 2. Decreasing just the number of hidden units, still resulted in overfitting. 
```
    nn.Conv2d(3,32, kernel_size=5, stride=1, padding=2),
    nn.ReLU(inplace=True),
    nn.Conv2d(32,32, kernel_size=5, stride=1, padding=2),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=2, stride=2, padding=0),
    nn.Conv2d(32,64, kernel_size=3, stride=1, padding=2),
    nn.ReLU(inplace=True),
    nn.Conv2d(64,64, kernel_size=3, stride=1, padding=2),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=2, stride=2, padding=0),
    nn.Conv2d(64,64, kernel_size=3, stride=1, padding=2),
    nn.ReLU(inplace=True),
    nn.Conv2d(64,64, kernel_size=3, stride=1, padding=2),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=2, stride=2, padding=0),
    Flatten(),
    nn.Linear(7744,1024),
    nn.Dropout(0.2),
    nn.Linear(1024,10)
```



### 3. Since there are too many parameters for the model to learn, and limited dataset, the model overfitted on trainingset giving an accuracy of 95.35% on training set. Dropout or batchnorm didnot help overcome overfitting. Hence, I chose to decrease the number of hidden units.
```
fixed_model_base2 = nn.Sequential( 
    nn.Conv2d(3, 128, kernel_size=10, stride=3, padding=0),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(3, stride=1),
    nn.BatchNorm2d(128),
    nn.Dropout2d(0.5),
    nn.Conv2d(128, 64, kernel_size=5, stride=2, padding=0),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(3, stride = 1),
    nn.BatchNorm2d(64),
    nn.Dropout2d(0.5),
    nn.Conv2d(64, 50, kernel_size=3, stride=1, padding=0),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(2, stride=1),
    nn.BatchNorm2d(50),
    Flatten(),
    nn.Linear(200, 100),
    nn.Linear(100, 10),
    nn.LogSoftmax()
)
fixed_model2 = fixed_model_base2.type(dtype)
optimizer = torch.optim.RMSprop(fixed_model_base2.parameters(), lr = 0.0001)
loss_fn = nn.CrossEntropyLoss()
torch.random.manual_seed(54321)
fixed_model2.cpu()
fixed_model2.apply(reset) 
fixed_model2.train() 
train(fixed_model2, loss_fn, optimizer,image_dataloader_train, num_epochs=3) 
check_accuracy(fixed_model2, image_dataloader_train)
```


## Videos - Best Attempt
### 1.  Kaggle Score - 0.66207
```
fixed_model_3d = nn.Sequential( # You fill this in!
    ###############7th TODO (20 points)#########################
    nn.Conv3d(3, 32, kernel_size=(1,3,3), stride=1),
    nn.BatchNorm3d(32),
    nn.ReLU(inplace=True),
    nn.MaxPool3d(kernel_size=(1,2,2), stride=(1,2,2)),
    nn.Conv3d(32, 16, kernel_size=(1,3,3), stride=1),
    nn.BatchNorm3d(16),
    nn.ReLU(inplace=True),
    nn.MaxPool3d(kernel_size=(1,2,2), stride=(1,2,2)),
    Flatten3d(),
    nn.Dropout(0.5),
    nn.ReLU(inplace=True),
    nn.Linear(9408,10)
        
)

fixed_model_3d = fixed_model_3d.type(dtype)
x = torch.randn(32,3, 3, 64, 64).type(dtype)
x_var = Variable(x).type(dtype) # Construct a PyTorch Variable out of your input data
ans = fixed_model_3d(x_var) 
np.array_equal(np.array(ans.size()), np.array([32, 10]))
loss_fn = nn.CrossEntropyLoss().type(dtype)

optimizer = optim.RMSprop(fixed_model_3d.parameters(), lr=1e-4)
torch.cuda.random.manual_seed(12345)
fixed_model_3d.apply(reset) 
fixed_model_3d.train() 
train_3d(fixed_model_3d, loss_fn, optimizer,clip_dataloader_train, num_epochs=12) 
fixed_model_3d.eval() 
check_accuracy_3d(fixed_model_3d, clip_dataloader_val)

t = 100, loss = 1.0798
t = 200, loss = 1.0568
t = 300, loss = 1.2343
t = 400, loss = 0.7497
Starting epoch 2 / 12
t = 100, loss = 0.5015
t = 200, loss = 0.9561
t = 300, loss = 0.4418
t = 400, loss = 0.2362
Starting epoch 3 / 12
t = 100, loss = 0.3162
t = 200, loss = 0.4173
t = 300, loss = 0.4172
t = 400, loss = 0.3901
Starting epoch 4 / 12
t = 100, loss = 0.1592
t = 200, loss = 0.0814
t = 300, loss = 0.2312
t = 400, loss = 0.2133
Starting epoch 5 / 12
t = 100, loss = 0.0910
t = 200, loss = 0.1133
t = 300, loss = 0.1800
t = 400, loss = 0.1902
Starting epoch 6 / 12
t = 100, loss = 0.1297
t = 200, loss = 0.0796
t = 300, loss = 0.1849
t = 400, loss = 0.2262
Starting epoch 7 / 12
t = 100, loss = 0.1923
t = 200, loss = 0.0568
t = 300, loss = 0.0981
t = 400, loss = 0.1433
Starting epoch 8 / 12
t = 100, loss = 0.2054
t = 200, loss = 0.0542
t = 300, loss = 0.0111
t = 400, loss = 0.0210
Starting epoch 9 / 12
t = 100, loss = 0.1155
t = 200, loss = 0.0603
t = 300, loss = 0.0813
t = 400, loss = 0.0884
Starting epoch 10 / 12
t = 100, loss = 0.0473
t = 200, loss = 0.0115
t = 300, loss = 0.0281
t = 400, loss = 0.1106
Starting epoch 11 / 12
t = 100, loss = 0.1383
t = 200, loss = 0.1403
t = 300, loss = 0.0517
t = 400, loss = 0.0847
Starting epoch 12 / 12
t = 100, loss = 0.0265
t = 200, loss = 0.0155
t = 300, loss = 0.0134
t = 400, loss = 0.1849
Got 1530 / 2230 correct (68.61)
```

In [3]:
class q1_model1(nn.Module):

    def __init__(self):
        super(q1_model1, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=5, stride=2) # 64x64x1 -> 30x30x16
        self.bn1 = nn.BatchNorm2d(16)
        self.relu1 = nn.ReLU(inplace=True) 
        self.pool1 = nn.MaxPool2d(kernel_size=2 , stride=2) # 30x30x16 -> 15x15x16
        #self.dropout1 = nn.Dropout(p=0.5)
        
        self.conv2 = nn.Conv2d(16, 8, kernel_size=3, stride=1) # 15x15x16 -> 13x13x8
        self.bn2 = nn.BatchNorm2d(8)
        self.relu2 = nn.ReLU(inplace=True) 
        self.pool2 = nn.MaxPool2d(kernel_size=2 , stride=2) # 13x13x8 -> 6x6x8
        self.dropout1 = nn.Dropout(p=0.5)
        self.fc1 = nn.Linear(6*6*8, 15) 

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu1(x)
        x = self.pool1(x)
        
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu2(x)
        x = self.pool2(x)
        x = self.flatten(x)
        x = self.dropout1(x) 
        x = self.fc1(x)
        return x

    def flatten(self, x):
        N, C, H, W = x.size()
        return x.view(N, -1)

model1 = q1_model1()
print model1

q1_model1(
  (conv1): Conv2d(1, 16, kernel_size=(5, 5), stride=(2, 2))
  (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu1): ReLU(inplace)
  (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(16, 8, kernel_size=(3, 3), stride=(1, 1))
  (bn2): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu2): ReLU(inplace)
  (pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (dropout1): Dropout(p=0.5)
  (fc1): Linear(in_features=288, out_features=15, bias=True)
)


In [4]:
summary(model1, (1,64,64))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 16, 30, 30]             416
       BatchNorm2d-2           [-1, 16, 30, 30]              32
              ReLU-3           [-1, 16, 30, 30]               0
         MaxPool2d-4           [-1, 16, 15, 15]               0
            Conv2d-5            [-1, 8, 13, 13]           1,160
       BatchNorm2d-6            [-1, 8, 13, 13]              16
              ReLU-7            [-1, 8, 13, 13]               0
         MaxPool2d-8              [-1, 8, 6, 6]               0
           Dropout-9                  [-1, 288]               0
           Linear-10                   [-1, 15]           4,335
Total params: 5,959
Trainable params: 5,959
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 0.39
Params size (MB): 0.02
Estimated Total