# Writeup

Our goal is to get above ~95% on the test set. Taking a look at the nature of the problem, we want to think from the ground up: what are the abstractions necessary in this task? 

Traffic signs have fairly simple larger geometries (triangles with acute angles, squares with right angles, and circles), so a few convolutions should be able to capture the outer shape. However, when the signs are skewed, different shapes take hold (parallelograms, ellipses).   

On top of that, the inner shape however needs to be able to represent numbers and other symbols, since the signs are categorized via their specific speed limit, for example. 

![100](unnamed.png)

Here is a sample of images in the 100km/h speed limit sign category. 

### Timeline 

Before doing any large bloated architectures, we wanted a baseline with respect to a simple neural network. The author also wanted to reduce too much copying and other forms of outright plagiarism (wanting to treat this more of exploratory experimentation vs. going straight for the SOTA), thinking from basic concepts. With two convolution layers and two fully connected layers, the accuracy hovered in the 80s with layer widths around 64. 

From here, the network was built upon. Max pooling with stride 2 was added to help with recognizing the blurred sign images (like Gaussian average blurring) with the initial inputs. With the performance still under 90%, a deeper network with two residual layers were added to preserve signal/gradients, and images augmented with grayscaling. However, this network was not too performant either, and it was hypothesized that residual layers are better suited for deeper architectures. 

More fully connected layers were added, and the performance reached just above 90%. When adding more convolutional layers, the performance crept up to ~92%. It seems that the head of fully connected layers was slowing down training more than helping, so instead, a large tail of initial convNets were lined in sequence whereas fully connected layers at the end were reduced to two, and residual layers were removed. Too many convNets, however, decreased performance, so it seemed that less than 10 would be the right level of abstraction.

Since the validation loss and training loss started to diverge, this was a signal of overfitting, so conv2d batch normalization was added to keep weights normalized and dropout was added to make the network more robust. The default dropout rate of half was a bit much, so it was tempered down to around one fourth. The accuracy rate went up to 94%, close to target. Then it oscillated between 93 and 94 percent accuracy, signaling a wandering learning rate. 

To reach a deeper minima, control logic was implemented to slow down the learning rate after certain thresholds. After training the network for 256 epochs, there were models that reached the goal of >95%, scoring around 97% on the provided test set.

We plot the graph of one of these >95% test performer kernels' loss as follows:

![f1](Figure_1.png)

and the validation accuracy trace: 

![f2](Figure_2.png)

## Next steps

With the goal reached, we now wanted to experiment with more complicated architectures. Disclaimer: the experiments here did not exactly succeed, but are included to talk about what else was tried. I saw that others were getting above human performance, whereas some years ago getting close to human performance would be impressive enough, so I wondered what else could be done.

To be robust to scale, something called a multi-scale architecture was attempted by sequentially building convnets in sequence but then concatenating initial layers to the final fully connected layer, so that the network can pick and choose finer or broader details to weigh on final classification. Training was extremely slow in these cases, as the concatenation made the final layers have to juggle a large amount of parameters. A shoulder of performance was around ~70%. These were tried with only 4 conv layers, so the work by Sermanet and LeCun may have employed more. After a couple days trying this with not the greatest compute power or turnaround time, decided to shelve it.

The question was now what other tricks could keep train time relatively reasonable yet reach higher than 97% performance? I felt it could do with the data itself. Less than 100,000 training samples seems low compared to other current datasets, so data augmentation was employed to add random distortions to the data set per batch. After training that for a couple days, it was unfortunately leveling around 87%. 

Furthermore, after inspecting the training images, I found that some folders had much fewer examples than others, indicating a class imbalance. Thus if we wanted less bias and accuracy on edge cases, we would need to better represent the minority classes/cases. 

To start, more samples of the under-represented classes were copied over (still with random distortion each load), giving the dataloader a more equitable chance of selecting those categories for training. These classes included class: 0, 6, 19, 24, 27, 29, 32, 37, 41, and 42. Indeed, these signs are both less frequent and a lot of them are more visually complicated (perhaps all the easy symbols in the real world were designed for the more frequent signs). Each of these under represented classes was multiplied by at least 4.  

An example of one of the underrepresented sign categories is:

![hard](hard.png)

The shapes are more nuanced and complicated (i.e. the truck and car may be classified the same if not trained enough). If not selected enough for training, the features may not be learned adequately enough to differentiate between this and a similar looking category.

Allowing these minority categories to undergo more selection in training can help with the categorization of them during validation. 

### Code Snippets

#### The Model Architecture

Before augmentation and grayscaling.

Link to csv: https://www.kaggle.com/submissions/13022721/13022721.raw

Edit for Oct 31: pth "baby_network.pth" is attached (also sent via an email from mll469@nyu.edu as model_236.pth). 

In [None]:
nclasses = 43

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3)
        self.conv2 = nn.Conv2d(64, 64, kernel_size=3)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3)
        self.conv4 = nn.Conv2d(64, 64, kernel_size=3)
        self.conv_bn = nn.BatchNorm2d(64)
        self.conv_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(256, 64)
        self.fc2 = nn.Linear(64, 50)
        self.finallayer = nn.Linear(50, nclasses)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv_drop(self.conv1(x)), 2))
        x = F.relu(F.max_pool2d(self.conv_drop(self.conv2(x)), 2))
        x = F.relu(self.conv_bn(self.conv3(x)))
        x = F.relu(self.conv_bn(self.conv4(x)))
        x = x.view(-1, 256)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training, p=0.2)
        x = F.relu(self.fc2(x))
        x = F.dropout(x, training=self.training, p=0.2)
        x = self.finallayer(x)
        return F.log_softmax(x)

#### Data Augmentation

Randomly rotate, move, zoom, and shear the samples on load. 

In [None]:
data_transforms = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.Grayscale(),
    transforms.RandomApply([
        transforms.RandomRotation(45, resample=PIL.Image.BICUBIC),
        transforms.RandomAffine(0, translate=(0.1, 0.1),
                                resample=PIL.Image.BICUBIC),
        transforms.RandomAffine(0, scale=(0.9, 1.1), 
                                resample=PIL.Image.BICUBIC)
        transforms.RandomAffine(0, shear=10, 
                                resample=PIL.Image.BICUBIC)
    ]),
    transforms.ToTensor(),
    # transforms.Normalize((0.3337, 0.3064, 0.3171), ( 0.2672, 0.2564, 0.2629))
    transforms.Normalize((0.5, ), ( 0.5,))
])

#### CLI command to make more copies of a specific class for underrepresented folders

Copies and renames 000xx_0xxxx to 000xx_1xxxx, doubling the sample size (which the data loader then randomly augments). Can be done again to quadruple, etc.

`mmv -c \*_0\* \#1_1\#2 `

Requires the mmv package on Linux.

#### Parse logs to make graph

In [None]:
import argparse
import sys
from matplotlib import pyplot as plt

def parse(filename):
    training_count = 0
    training_avg_loss = 0
    training_avg_plot = []
    val_plot = []
    val_acc = []
    with open(filename, "r") as f:
        for line in f:
            if is_training_line(line):
                training_count += 1
                training_avg_loss = (get_training_loss(line) + training_avg_loss * (training_count - 1)) / training_count
            if is_val_line(line):
                training_avg_plot.append(training_avg_loss)
                val_plot.append(get_val_loss(line))
                val_acc.append(get_val_acc(line))
                training_avg_loss = 0
                training_count = 0
        print(training_avg_plot, val_plot)
        return training_avg_plot, val_plot, val_acc      
            
def is_training_line(line):
    if line[0:2] == "Tr":
        return True
    return False

def is_val_line(line):
    if line[0:2] == "Va":
        return True
    return False

def get_training_loss(line):
    print(line.split("Loss: "))
    return float(line.split("Loss: ")[1])

def get_val_loss(line):
    return float(line.split("loss: ")[1].split(",")[0])

def get_val_acc(line):
    return float(line.split("(")[1].split("%")[0])

train, val, val_acc = parse(sys.argv[1])

# tr, = plt.plot(train, label="train")
# va, = plt.plot(val, label="val")
# plt.legend(handles=[tr, va])
# plt.xlabel("epoch")
# plt.ylabel("loss")

vac, = plt.plot(val_acc, label="val acc")
plt.legend(handles=[vac])
plt.xlabel("epoch")
plt.ylabel("percentage acc")

plt.show()

#### README.md for training

Along with replacing model.py with the architecture and the augmentation in data.py, a variable `above_thres = False` in main.py was added to implement precise control logic to target the overfitting zones for each network, and would be set to true with a new optimizer that had a lower learning rate (~0.0002 when using SGD) when the accuracy rose above a certain threshold. Loss was first negative log likelihood and then cross entropy was tried later on. Top ranking model actually used NLL.

#### References

https://stackoverflow.com/questions/57229054/how-to-implement-my-own-resnet-with-torch-nn-sequential-in-pytorch  
https://pytorch.org/docs/stable/torchvision/transforms.html  
P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. In
Proceedings of International Joint Conference on Neural Networks (IJCNN’11), 2011. 1

#### Username on Kaggle

MJL, Michael Luckyman (not the best but built from the ground up and trained from scratch without looking at any premade code for the task)

### Edit Oct 30: 

Although we reached the goal through perspiration, I still wanted to keep trying since I felt bad about my score as it seemed everyone somehow found a near optimal solution while I was banging my head trying to optimize hyperparameters days at a time with disappointment. Again, goal was reached by my model, but people seemed to be competing a lot so felt like I was behind. 

I was still trying even before knowing the deadline was extended (chasing personal results). At the same time I had other assignments planned to be worked  on according to the original dates, so was just running things in the background. At this point I went to look around for specific methods which we haven't learned, in which I familiarized with spatial transformer networks to keep abstractions translation invariant. This was useful. Additionally, I learned of the data concat so that augmented data would arise as additional data points. I also took a closer look at optimizers and saw that the custom scheduler I had for Adam was detrimental because of internal learning rates already inherent in the algorithm, which would be erased when reassigned. After adjusting those elements, performance on Kaggle went up another nice percent! 

Other than the STN, I voluntarily did not change any of the architecture since I wanted that idea to remain the same. The new files' snippets follow, again with only changes outside to the model, aside from STN. It was really harrowing to think how else to squeeze performance out, especially after knowing after conversations that I was doing a similar thing as many people, who I don't know how they got it so fast, and in the end a lot was learned and internalized by going through the work myself and thinking. 

Additional references at this point:
    
    STNs - https://www.youtube.com/watch?v=T5k0GnBmZVI, with original paper.

In [None]:
# Diffs in main.py

### Data Initialization and Loading
from data import initialize_data, spec_trans, spec_trans_end, randoTrans # data.py in the same folder
from data import data_transforms
initialize_data(args.data) # extracts the zip files, makes a validation set

train_loader = torch.utils.data.DataLoader(
    torch.utils.data.ConcatDataset([
        datasets.ImageFolder(args.data + "/train_images", transform=spec_trans(transforms.CenterCrop(32))),
        datasets.ImageFolder(args.data + "/train_images", transform=spec_trans(transforms.ColorJitter(brightness=0.4))),
        datasets.ImageFolder(args.data + "/train_images", transform=spec_trans(transforms.ColorJitter(contrast=0.7))),
        datasets.ImageFolder(args.data + "/train_images", transform=spec_trans(transforms.ColorJitter(hue=0.5))),
        datasets.ImageFolder(args.data + "/train_images", transform=spec_trans(transforms.RandomAffine(80))),
        datasets.ImageFolder(args.data + "/train_images", transform=spec_trans(transforms.RandomAffine(0, translate=((0.40,0.40))))),
        datasets.ImageFolder(args.data + "/train_images", transform=spec_trans(transforms.RandomAffine(0, scale=(1.0,1.24), shear=10))),
#         datasets.ImageFolder(args.data + "/train_images", transform=spec_trans(transforms.RandomHorizontalFlip(p=1.0))),
#         datasets.ImageFolder(args.data + "/train_images", transform=spec_trans(transforms.RandomVerticalFlip(p=1.0))),
        datasets.ImageFolder(args.data + "/train_images", transform=spec_trans_end(transforms.RandomErasing(p=1.0, value='random'))),
        datasets.ImageFolder(args.data + "/train_images", transform=randoTrans),
        datasets.ImageFolder(args.data + "/train_images", transbform=spec_trans(transforms.RandomPerspective()))
    ]),
    batch_size=args.batch_size, shuffle=True, num_workers=1)

A nice sample of the actual augmented training set, notice the erasure in case of real life obfuscation:

![fig4](Figure_4.png)

I removed flips because some signs were so similar that the flip could actually make it into the other sign. I limited rotation to 80 degrees because 90 degrees could make it into another sign (i.e. arrows). Perhaps the network would learn about background cues to orient the image, so I kept it above 45 degrees. I limited brightness change because some signs were already pretty dark and bright, so they were over/underexposed with too much brightness change. Hue shift was kept not too large, but large enough to replicate lighting conditions, daytime tints. I scaled up after shear so that the photo edge was not too mistaken for another shape / sign. Translation was kept. This was all done through trial and error and thinking about the real world and what we would see in its data set.  

In [None]:
# Diffs in data.py

data_transforms = transforms.Compose([
    transforms.Resize((32, 32)),
    # transforms.Grayscale(),
    transforms.ToTensor(),
    transforms.Normalize((0.3337, 0.3064, 0.3171), ( 0.2672, 0.2564, 0.2629))
    # transforms.Normalize((0.5, ), ( 0.5,))
])

# This is a function to abstract away compositions with a specific transform interchangedb.

def spec_trans(specific_transform):
    trans = transforms.Compose([
        specific_transform,
        transforms.Resize((32, 32)),
        # transforms.Grayscale(),
        transforms.ToTensor(),
        # transforms.Normalize((0.5, ), ( 0.5,))
        transforms.Normalize((0.3337, 0.3064, 0.3171), ( 0.2672, 0.2564, 0.2629))
    ])
    return trans

def spec_trans_end(specific_transform):
    trans = transforms.Compose([
        transforms.Resize((32, 32)),
        # transforms.Grayscale(),
        transforms.ToTensor(),
        # transforms.Normalize((0.5, ), ( 0.5,))
        transforms.Normalize((0.3337, 0.3064, 0.3171), ( 0.2672, 0.2564, 0.2629)),
        specific_transform
    ])
    return trans    

randoTrans = transforms.Compose([
    transforms.RandomApply(
        [
            transforms.ColorJitter(brightness=0.9),
            transforms.ColorJitter(contrast=0.9),
            transforms.ColorJitter(hue=0.5),
            transforms.RandomAffine(90),
            transforms.RandomAffine(0, translate=((0.40,0.40))),
            transforms.RandomAffine(0, shear=10),
            transforms.RandomHorizontalFlip(p=1.0),
            transforms.RandomVerticalFlip(p=1.0)
        ]
        ,p=0.3),
        transforms.Resize((32, 32)),
        transforms.ToTensor(),
        transforms.Normalize((0.3337, 0.3064, 0.3171), ( 0.2672, 0.2564, 0.2629)),
        transforms.RandomErasing(p=0.9, value='random')
])

In [None]:
# Diff in model.py

# ... 
# Previous architecture above here. 
        
        # STN added to add more robust translational invariance
        self.localization = nn.Sequential(
            nn.Conv2d(3, 8, kernel_size=7),
            nn.MaxPool2d(2, stride=2),
            nn.ReLU(True),
            nn.Conv2d(8, 10, kernel_size=5),
            nn.MaxPool2d(2, stride=2),
            nn.ReLU(True)
        )

        # Regressing to the same output
        self.fc_loc = nn.Sequential(
            nn.Linear(10 * 4 * 4, 32),
            nn.ReLU(True),
            nn.Linear(32, 3 * 2)
        )

        # Initialize the weights as identity
        self.fc_loc[2].weight.data.zero_()
        self.fc_loc[2].bias.data.copy_(torch.tensor([1, 0, 0, 0, 1, 0], dtype=torch.float))

    # Forward for STN
    def stn(self, x):
        xs = self.localization(x)
        xs = xs.view(-1, 10 * 4 * 4)
        theta = self.fc_loc(xs)
        theta = theta.view(-1, 2, 3)

        grid = F.affine_grid(theta, x.size())
        x = F.grid_sample(x, grid)

        return x

    def forward(self, x):
        x = self.stn(x) # Added STN 
        x = F.relu(F.max_pool2d(self.conv_drop(self.conv1(x)), 2))
        x = F.relu(F.max_pool2d(self.conv_drop(self.conv2(x)), 2))
        x = F.relu(self.conv_bn(self.conv3(x)))
        x = F.relu(self.conv_bn(self.conv4(x)))
        x = x.view(-1, 256)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training, p=0.2)
        x = F.relu(self.fc2(x))
        x = F.dropout(x, training=self.training, p=0.2)
        x = self.finallayer(x)
        return F.log_softmax(x)

New loss convergence plot:

![fig3](Figure_3.png)

Since we augmented the data to things maybe unseen in the validation set, a hypothesis is that the training set was legitimately harder to learn and abstractions more robust upfront to the validation set. Code and folders were checked to make sure data was not leaking, so it seemed to be genuinely underfitting and fitting because of the augmented data. 

### Conclusion 

In this project I learned how to build a network from scratch using things we have learned in class. I healthily struggled with and encountered what the actual effects of changing parameters and architecture were through exploration. Eventually through this self learning I reached the goal. I also learned a lot about data augmentation. At the end, I looked up more of the SOTA "this just works" techniques that my friends suggested to me, but am proud that I took the time to think through the initial aspects through first principles.