# Sample Complexity Gap

This notebook aims to demonstrate the stated sample complexity gap in **Why Are Convolutional Networks More Sample Efficient Than Fully-Connected Nets? by Zhiyuan Li, Yi Zhang and Sanjeev Arora** [1]. We set up an experiment in which we should see the gap as an increasing polynomial curve of degree less than two.

## 1. Methods

For a given input dimension $d$, we seek the number $|S_{tr}|$ of training samples needed for a model to reach $\epsilon=0.9$ test accuracy. Then we plot the difference of training samples needed between a Convolutional Neural Network and a Fully Connected Neural Network for increasing values of $d$.

### Data

The inputs are $3\times k \times k$ RGB images for $k\in \mathbb{N}$, yielding input dimensions $d\in \{..., 192, 243, 300, 363, ...\}$. We create full training set of 10000 images and a test set of 10'000 and we ask "the first *how-many* training samples are needed to reach $90\%$ test accuracy if we train until convergence"? The training sets are constructed in the following manner.
+ Entry-wise independent Gaussian (mean 0, standard deviation 1)

We explore two different labelling functions 
\begin{equation}
h_1=\mathbb{1}[\sum_{i\in R} x_i > \sum_{i \in G}x_i] \quad\mathrm{ and }\quad h_2=\mathbb{1}[\sum_{i\in R} x_i^2 > \sum_{i \in G}x_i^2].
\end{equation}

### Models

1. 2-layer CNN.
    + Convolution - one kernel per input channel of size 3x3, 10 output channels, stride size 1, and padding of 1, and bias
    + Activation function
    + Pooling: Max pooling, kernel size 2x2, stride 2
    + Flattening
    + Fully connected layer (? in, 1 out) with bias
    + Sigmoid
2. 2-layer FCNN 
    + Fully connected layer (192 in, 3072 out) with bias
    + Activation function 
    + Fully connected layer (3072 in, 1 out) with bias
    + Sigmoid
    
We try both ReLU and Quadratic activation functions. 

### Training algorithm
+ Stochastic Gradient Descent with batch size 64
+ BCELoss
+ Learning rate $\gamma = 0.01$
+ Stopping criterion: At least 10 epochs AND Training loss < 0.01 AND Rolling avg. of rel. change in training loss < 0.01 (window size 10). OR 500 epochs.

### Model Evaluation
+ The model $M$ prediction is $\mathbb{1}[M(x)>0.5]$. Test accuracy is the percentage of correct predictions over the test set.

### Search algorithm

We seek the number of training samples needed to reach a fixed test accuracy using a kind of bisection algorithm.
1. Initial training run on 5000 samples.
    + If test accuracy > 0.9, take half step towards 0 -> 2'500
    + If test accuracy <= 0.9 take half step towards 10'000 -> 7500
2. Reload initial weights and retrain. Make quarter step.

This is repeated $10$ times with different weight initialisations in case the test-accuracy curves are not monotonically increasing due to noise.

In [1]:
# Imports
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

from torch.utils.data import TensorDataset, DataLoader
from torchsummary import summary
from json import load, dump

# Local python scripts
from helpers import roll_avg_rel_change, calc_label
from models import CNN, FCNN, Quadratic

In [2]:
# Seed random number generation
torch.manual_seed(0)
np.random.seed(0)

In [3]:
# Global constants
learning_rate = 0.01
batch_size = 64
max_epochs = 500
window = 10 # Window size for convergence crit.
rel_conv_crit = 0.01
abs_conv_crit = 0.01
epsilon = 0.9 # Required accuracy
tolerance = 0.01 # Required tolerance

# Input shape
channels = 3 # RGB images
img_sizes =  np.arange(4, 14) # Image side lengths
input_sizes = 3*img_sizes**2 # Input dimension
input_shapes = [(channels, img_size, img_size) for img_size in img_sizes]

# Full dataset sizes
N_tr = 1000000
N_te = 10000

In [4]:
torch.tensor([1,2,3,4])

tensor([1, 2, 3, 4])

In [5]:
# For increasing input dimension
for i, input_size in enumerate(input_sizes):
    
    print(f"Input dimension: {input_size}")
    
    # Full training and test sets
    gauss_x_tr = torch.tensor(np.random.normal(0,1,size=(N_tr,*(input_shapes[i]))),dtype=torch.float32)
    gauss_x_te = torch.tensor(np.random.normal(0,1,size=(N_te,*(input_shapes[i]))),dtype=torch.float32)

    # Full h2 training and test labels (Replace with p=1 for h1)
    gauss_y_tr = calc_label(gauss_x_tr, p=2)
    gauss_y_te = calc_label(gauss_x_te, p=2)
    
    # Models
    CNNreLU, CNNquad = CNN(input_shapes[i], nn.ReLU()), CNN(input_shapes[i], Quadratic())
    FCNNreLU, FCNNquad = FCNN(input_size, nn.ReLU()), FCNN(input_size, Quadratic())
    models = [CNNreLU, CNNquad, FCNNreLU, FCNNquad]
    names = ["2-CNN+ReLU", "2-CNN+Quadratic","2-FCNN+ReLU","2-FCNN+Quadratic"]
    
    # Found training sample set sizes
    ns = [0 for model in models]
    
    for j, model in enumerate(models):
        print(names[j])
        # summary(model, input_shapes[i])
        torch.save(model.state_dict(), 'Weights/'+names[j]+'.pth')
        
        # Find exact training sample set size by bisection
        n = 100 # Initial number of training samples
        found = False
        n_tried = [0]
        iterate=1
        while not found:
            print("Iterate ", iterate, "Training samples: ", n)
            
            # Reset model
            model.load_state_dict(torch.load('weights/'+names[j]+'.pth'))
            
            # Create dataloaders
            dataset = TensorDataset(gauss_x_tr[:n], gauss_y_tr[:n])
            dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
            
            # Optimizer
            optimizer = optim.SGD(model.parameters(), lr=learning_rate)
            
            # Training loop
            criterion = nn.BCELoss()
            model.train()
            epoch = 0
            converged = False
            loss_queue = [] # For rolling training loss stop criterion
            while not converged:
                for batch_x, batch_y in dataloader:
                    optimizer.zero_grad()
                    output = model(batch_x)
                    loss = criterion(output, batch_y)
                    loss.backward()
                    optimizer.step()

                    # Check for convergence
                    roll_avg = roll_avg_rel_change(loss_queue, window, loss.item())
                    if (roll_avg and roll_avg < rel_conv_crit and loss < abs_conv_crit) or epoch == max_epochs:
                        converged = True
                        break
                
                epoch += 1
            
            # Evaluate model
            with torch.no_grad():
                model.eval()
                out = model(gauss_x_te)
                test_loss = criterion(out, gauss_y_te)
                accuracy = float(sum(torch.eq((out > 0.5).to(float), gauss_y_te)) / N_te)
                found = abs(accuracy - epsilon) < tolerance
                
                # Save this training set size
                if found:
                    ns[j]=n
                    
            print("Finished training: ", epoch, " Acc: ", accuracy)
            
            # Bisection method for finding correct training set size
            n_tried+=[n]
            n_tried.sort()
            idx = n_tried.index(n)
            if accuracy > epsilon:
                n = n // 2 if idx == 0 else (n + n_tried[idx-1]) // 2
            else:
                n = 2 * n if idx == len(n_tried)-1 else (n + n_tried[idx+1]) // 2
                
            # Try again with a different number of training samples
            iterate += 1
    
    # Read the JSON file
    file_path = 'trainset_sizes.json'
    with open(file_path, 'r') as json_file:
        trainset_sizes = load(json_file)

    # Add experiment to results
    trainset_sizes[str(input_size)]=ns

    # Write training set size to file
    with open(file_path, 'w') as json_file:
            dump(trainset_sizes, json_file)

Input dimension: 48
2-CNN+ReLU
Iterate  1 Training samples:  100
Finished training:  501  Acc:  0.598800003528595
Iterate  2 Training samples:  200
Finished training:  501  Acc:  0.6995000243186951
Iterate  3 Training samples:  400
Finished training:  501  Acc:  0.739799976348877
Iterate  4 Training samples:  800
Finished training:  501  Acc:  0.7900999784469604
Iterate  5 Training samples:  1600
Finished training:  501  Acc:  0.8418999910354614
Iterate  6 Training samples:  3200
Finished training:  501  Acc:  0.8809999823570251
Iterate  7 Training samples:  6400
Finished training:  501  Acc:  0.8988000154495239
2-CNN+Quadratic
Iterate  1 Training samples:  100
Finished training:  501  Acc:  0.5504000186920166
Iterate  2 Training samples:  200
Finished training:  501  Acc:  0.6757000088691711
Iterate  3 Training samples:  400
Finished training:  501  Acc:  0.7797999978065491
Iterate  4 Training samples:  800
Finished training:  501  Acc:  0.8367000222206116
Iterate  5 Training samples:

KeyboardInterrupt: 

In [None]:
import matplotlib.pyplot as plt

# Load data
file_path = 'trainset_sizes.json'
with open(file_path, 'r') as json_file:
    test_acc = load(json_file)

# Create sorted dictionary out of saved data
new_dict = {}
for key in test_acc.keys():
    new_dict[int(key)]=test_acc[key]
sorted_dict = dict(sorted(new_dict.items()))

# Convert the dictionary to a format suitable for plotting
x_values = np.sort([int(key) for key in sorted_dict.keys()])  # Convert keys to integers

# For every model, make line plot
for i, model in enumerate(models):
    
    # The accuracy values for increasing number of samples
    y_values = [value[i] for value in sorted_dict.values()]

    # Create a line plot
    plt.plot(x_values, y_values, marker='.', linestyle='-', label=names[i])
    
# Plot graphics
plt.xlabel('Input dimension')
plt.ylabel('Training set size')
plt.title(f'Training set size required to reach {epsilon} test accuracy')
plt.legend()
plt.grid(True)

# Show the plot
plt.show()

1. [Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?](https://arxiv.org/abs/2010.08515) Zhiyuan Li, Yi Zhang, Sanjeev Arora, 2021