In [1]:
import pandas as pd
import math

# Capacity Requirements

This notebook predicts the maximum memory capacity and expected capacity requirement of Fashion MNIST dataset. 

In [2]:
# Loading Fashion MNIST dataset
fashion_mnist = pd.read_csv('data/train.csv')
fashion_mnist.head()

Unnamed: 0,label,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,6,0,0,0,0,0,0,0,5,0,...,0,0,0,30,43,0,0,0,0,0
3,0,0,0,0,1,2,0,0,0,0,...,3,0,0,0,0,1,0,0,0,0
4,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In Fashion MNIST dataset, the first column is the label [0 - 9], and the rest 784 columns refer to the 784 pixels in a 28 x 28 grayscale image. We have in total 600,000 training samples.  

## Algorithm 1

The following algorithm is the same as what's presented in [nntailoring capacity requirement](https://github.com/fractor/nntailoring/blob/master/capacityreq/capacityreq.py). It the algorithm from chapter 9 and lecture 6 & 7. 

![algorithm 1](images/MEC-algo.png)

In [12]:
def capacity_req(df):
    '''
    This uses a dummy network to compute the max/expected capacity requirement of a datasets.
    Input: df - pandas dataframe
        Dataframe df has the first column as labels and the rest columns as features. 
        Each row in the dataframe is a data point. 
    '''
    
    # Input dimensions, number of points, and number of classes
    input_dims = len(df.columns) - 1  # the first column is label
    num_rows = len(df)                # number of data points
    num_classes = len(df['label'].unique())
       
    # Step 1
    # for every data point x[i] in the dataframe, sum all of its features
    # for instance, for x[1], sum(x[1][d]) for all d.
    df['dim_sum'] = df.iloc[:,1:].sum(axis=1)
    
    # Step 2: sort the table by each data point's dimension sum
    sorted_df = df.sort_values(by='dim_sum')
    
    # Step 3: loop over the table and count number of threshold
    c = -1 
    threshold = 0
    for label in sorted_df['label']:
        if label != c:
            c = label
            threshold += 1
            
    # Max Capacity Requirement
    # The input layer as threshold number of nodes, each has input_dim + 1 parameters
    # The output layer has threshold bits of information as well
    max_cap_req = threshold * (input_dims + 1) + threshold
    
    # Expected Capacity Requirement
    # Follow the formula in the algorithm
    print(math.log(threshold + 1, 2))
    print(input_dims)
    print(math.log(threshold + 1, 2) * input_dims)
    exp_cap = math.ceil(math.log(threshold + 1, 2) * input_dims)
    
    print("Input Dimensions:", input_dims, ", Number of data points:", num_rows)
    print("Number of Thresholds:", threshold, "bits")
    print("Max Capacity Requirement:", max_cap_req, "bits")
    print("Expected Capacity:", exp_cap, "bits")

In [13]:
capacity_req(fashion_mnist)

15.587367335163968
785
12236.083358103715
Input Dimensions: 785 , Number of data points: 60000
Number of Thresholds: 49233 bits
Max Capacity Requirement: 38746371 bits
Expected Capacity: 12237 bits


## Algorithm 2

We compute the dataset MEC as a dictionary with 60000 rows and each row has 10 labels

In [5]:
def max_capacity_req_dict(df):
    num_rows = len(df)                # number of data points
    num_classes = len(df['label'].unique())
    max_cap_req = math.ceil(num_rows * math.log(num_classes, 2))
    print("Max Capacity Requirement (dict):", max_cap_req)

In [6]:
max_capacity_req_dict(fashion_mnist)

Max Capacity Requirement (dict): 199316
