# Sample Complexity Gap

This notebook aims to demonstrate the stated sample complexity gap in **Why Are Convolutional Networks More Sample Efficient Than Fully-Connected Nets? by Zhiyuan Li, Yi Zhang and Sanjeev Arora** [1]. We set up an experiment in which we should see the gap as an increasing polynomial curve of degree less than two.

## 1. Methods

For a given input dimension $d$, we seek the number $|S_{tr}|$ of training samples needed for a model to reach $\epsilon=0.9$ test accuracy. Then we plot the difference of training samples needed between a Convolutional Neural Network and a Fully Connected Neural Network for increasing values of $d$.

### Data

The inputs are $3\times k \times k$ RGB images for $k\in \mathbb{N}$, yielding input dimensions $d\in \{3,12,27,48,75,108,147,243,300,...\}$. We create full training set of $10'000$ images and a test set of $10'000$ and we ask "the first *how-many* training samples are needed to reach $90\%$ test accuracy if we train until convergence"? The training sets are constructed in the following manner.
+ Entry-wise independent Gaussian (mean 0, standard deviation 1)

We explore two different labelling functions 
\begin{equation}
h_1=\mathbb{1}[\sum_{i\in R} x_i > \sum_{i \in G}x_i] \quad\mathrm{ and }\quad h_2=\mathbb{1}[\sum_{i\in R} x_i^2 > \sum_{i \in G}x_i^2].
\end{equation}

### Models

1. 2-layer CNN.
    + Convolution - one kernel per input channel of size 3x3, 10 output channels, stride size 1, and padding of 1, and bias
    + Activation function
    + Pooling: Max pooling, kernel size 2x2, stride 2
    + Flattening
    + Fully connected layer (? in, 1 out) with bias
    + Sigmoid
2. 2-layer FCNN 
    + Fully connected layer (192 in, 3072 out) with bias
    + Activation function 
    + Fully connected layer (3072 in, 1 out) with bias
    + Sigmoid
    
We try both ReLU and Quadratic activation functions. 

### Training algorithm
+ Stochastic Gradient Descent with batch size 64
+ BCELoss
+ Learning rate $\gamma = 0.01$
+ Stopping criterion: At least 10 epochs AND Training loss < 0.01 AND Rolling avg. of rel. change in training loss < 0.01 (window size 10). OR 1000 epochs.

### Model Evaluation
+ The model $M$ prediction is $\mathbb{1}[M(x)>0.5]$. Test accuracy is the percentage of correct predictions over the test set.

### Search algorithm

We seek the number of training samples needed to reach a fixed test accuracy using a kind of bisection algorithm.
1. Initial training run on 5000 samples.
    + If test accuracy > 0.9, take half step towards 0 -> 2'500
    + If test accuracy <= 0.9 take half step towards 10'000 -> 7500
2. Reload initial weights and retrain. Make quarter step.

This is repeated $10$ times with different weight initialisations in case the test-accuracy curves are not monotonically increasing due to noise.

In [11]:
# Imports
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

from torch.utils.data import TensorDataset, DataLoader
from torchsummary import summary
from json import load, dump

# Local python scripts
from helpers import roll_avg_rel_change, calc_label
from models import CNN, FCNN, Quadratic

In [12]:
# Seed random number generation
torch.manual_seed(0)
np.random.seed(0)

In [13]:
# Global constants
learning_rate = 0.01
batch_size = 64
max_epochs = 1000
window = 10 # Window size for convergence crit.
rel_conv_crit = 0.01
abs_conv_crit = 0.01
epsilon = 0.7 # Required accuracy

# Input shape
channels = 3 # RGB images
img_sizes =  np.arange(8, 12) # Image side lengths
input_sizes = 3*img_sizes**2 # Input dimension
input_shapes = [(channels, img_size, img_size) for img_size in img_sizes]

# Full dataset sizes
N_tr = 10000
N_te = 10000

In [22]:
n_tried=[0, 100, 10000]
n = 100
if 0.5 < 0.9:
    index = max(n_tried.index(n)-1, 0)
else:
    if n_tried.index(n)==len(n_tried)-1:
        n_tried+=[n_tried[-1]*2]
    else:
        n += (n_tried[-1] - n) / 2
        
print(n, n_tried)

100 [0, 100, 10000]


In [10]:
# For increasing input dimension
for i, d in enumerate(input_sizes):
    
    print(f"Input dimension: {d}")
    
    # Full training and test sets
    gauss_x_tr = torch.tensor(np.random.normal(0,1,size=(N_tr,*(input_shapes[i]))),dtype=torch.float32)
    gauss_x_te = torch.tensor(np.random.normal(0,1,size=(N_te,*(input_shapes[i]))),dtype=torch.float32)

    # Full h2 training and test labels (Replace with p=1 for h1)
    gauss_y_tr = calc_label(gauss_x_tr, p=2)
    gauss_y_te = calc_label(gauss_x_te, p=2)
    
    # Models
    models = [CNN(Quadratic()), FCNN(d, Quadratic())]
    names = ['CNNquad', 'FCNNquad']
    for j, model in enumerate(models):
        print(names[j])
        summary(model, input_shapes[i])
        torch.save(model.state_dict(), 'Weights/'+names[j]+'.pth')
        
        # Find correct training sample set size by bisection
        n = 100 # Number of training samples
        found = False
        n_tried = [n]
        while not found:
            print("Iterate ", iterate, "Training samples: ", n)
            
            # Reset models
            model.load_state_dict(torch.load('Weights/'+names[j]+'.pth'))
            
            # Create dataloaders
            dataset = TensorDataset(gauss_x_tr[:n], gauss_y_tr[:n])
            dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
            
            # Optimizer
            optimizer = optim.SGD(model.parameters(), lr=learning_rate)
            
            # Training loop
            criterion = nn.BCELoss()
            model.train()
            epoch = 0
            converged = False
            loss_queue = [] # For rolling training loss stop criterion
            while not converged:
                for batch_x, batch_y in dataloader:
                    optimizer.zero_grad()
                    output = model(batch_x)
                    loss = criterion(output, batch_y)
                    loss.backward()
                    optimizer.step()

                    # Check for convergence
                    roll_avg = roll_avg_rel_change(loss_queue, window, loss.item())
                    if (roll_avg and roll_avg < rel_conv_crit and loss < abs_conv_crit) or epoch == max_epochs:
                        converged = True
                        break
                
                epoch += 1
                if (epoch%100==0):
                    print("Epoch:", epoch, "Loss: ", loss.item())
            
            # Evaluate model
            model.eval()
            out = model(gauss_x_te)
            test_loss = criterion(out, gauss_y_te)
            accuracy = float(sum(torch.eq((out > 0.5).to(float), gauss_y_te)) / N_te)
            found = abs(accuracy - epsilon) < 0.1
            
            print("Finished training: ", epoch, " Acc: ", accuracy)
            
            # Bisection method for finding correct training set size
            iterate += 1
            if accuracy < epsilon:
                
            else:
                

Input dimension: 192
CNNquad
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1             [-1, 10, 8, 8]             280
         Quadratic-2             [-1, 10, 8, 8]               0
         MaxPool2d-3             [-1, 10, 4, 4]               0
           Sigmoid-4                    [-1, 1]               0
Total params: 280
Trainable params: 280
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.00
Estimated Total Size (MB): 0.01
----------------------------------------------------------------
Iterate  0 Training samples:  100
Epoch: 100 Loss:  0.728851854801178
Epoch: 200 Loss:  0.7228296995162964
Epoch: 300 Loss:  0.7010923624038696
Epoch: 400 Loss:  0.689973771572113
Epoch: 500 Loss:  0.6737563610076904
Epoch: 600 Loss:  0.6960011124610901
Epoch: 700 Loss:  0.689

KeyboardInterrupt: 

In [None]:
# Read the JSON file
file_path = 'test_acc.json'
with open(file_path, 'r') as json_file:
    test_acc = load(json_file)

# Add experiment to results
test_acc[str(N)]=accuracy

# Write accuracy to file
with open(file_path, 'w') as json_file:
        dump(test_acc, json_file)

1. [Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?](https://arxiv.org/abs/2010.08515) Zhiyuan Li, Yi Zhang, Sanjeev Arora, 2021