In [1]:
import torch
import sklearn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.linear_model import SGDClassifier
import struct
import gzip

### Reading MNIST data (crash course)

In [2]:
with gzip.open('./data/train-images-idx3-ubyte.gz','rb') as f:
    print(f.read(4))
    print(f.read(4))
    print(f.read(4))
    print(f.read(4))
    print(f.read(4))
    print(f.read(4))

b'\x00\x00\x08\x03'
b'\x00\x00\xea`'
b'\x00\x00\x00\x1c'
b'\x00\x00\x00\x1c'
b'\x00\x00\x00\x00'
b'\x00\x00\x00\x00'


8 bits in 1 byte (value ranges from 0-255).  
Most people use hexadecimal (base 16) to represent bytes since it's more compact and divides evenly.  
The values 0-255 in hexdecimal is 0x00 - 0xFF.  
So each byte (== 8 bits) is represented by two hexadecimals digits.

The following structure below represents the training set for MNIST. 

```r
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  60000            number of images
0008     32 bit integer  28               number of rows
0012     32 bit integer  28               number of columns
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel
```

Basically, the data starts on the 16th byte and everything before that is metadata.  
We show below how to read this.  

In [3]:
# https://stackoverflow.com/questions/39969045/parsing-yann-lecuns-mnist-idx-file-format
# http://yann.lecun.com/exdb/mnist/

with gzip.open('./data/train-images-idx3-ubyte.gz','rb') as f:
    magic, size = struct.unpack(">II", f.read(8))
    nrows, ncols = struct.unpack(">II", f.read(8))
    X_train = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
    X_train = X_train.reshape((size, nrows, ncols))

with gzip.open('./data/train-labels-idx1-ubyte.gz','rb') as f:
    magic, size = struct.unpack(">II", f.read(8))
    y_train = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
    
with gzip.open('./data/t10k-images-idx3-ubyte.gz','rb') as f:
    magic, size = struct.unpack(">II", f.read(8))
    nrows, ncols = struct.unpack(">II", f.read(8))
    X_test = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
    X_test = X_test.reshape((size, nrows, ncols))
    
with gzip.open('./data/t10k-labels-idx1-ubyte.gz','rb') as f:
    magic, size = struct.unpack(">II", f.read(8))
    y_test = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))

[ChatGPT]

In the given code snippet, the `struct.unpack` function is used to extract data from a binary file according to a specified format. The `struct` module in Python provides functions for working with C-style data structures represented as strings. It allows you to pack and unpack data in a binary format.

Let's break down the usage of `struct.unpack` in the code:

1. `magic, size = struct.unpack(">II", f.read(8))`
   Here, `f.read(8)` reads 8 bytes from the file object `f`. The format string `">II"` specifies the format of the data to be unpacked. `">"` indicates big-endian byte order, and `I` represents an unsigned integer of size 4 bytes. Therefore, `struct.unpack(">II", f.read(8))` reads 8 bytes from the file, interprets the first 4 bytes as `magic` and the next 4 bytes as `size`, and assigns the unpacked values to the variables `magic` and `size`.

2. `nrows, ncols = struct.unpack(">II", f.read(8))`
   This line is similar to the previous one. It reads another 8 bytes from the file and interprets the first 4 bytes as `nrows` and the next 4 bytes as `ncols`.

3. `data = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))`
   Here, `f.read()` reads the remaining data from the file. The `frombuffer` function from the NumPy library is used to create an array from the binary data. The data type is specified as `np.uint8`, representing an 8-bit unsigned integer. The `newbyteorder('>')` method is used to ensure the data is interpreted in big-endian byte order.

4. `data = data.reshape((size, nrows, ncols))`
   Finally, the `data` array is reshaped using the dimensions obtained from the previous unpacking steps. It is reshaped into a 3-dimensional array with `size` rows, `nrows` height, and `ncols` width.

[ChatGPT]

There are two common types of endianness: big-endian and little-endian.

Big-endian: In big-endian systems, the most significant byte (the byte containing the highest order bits) is stored at the lowest memory address, while the least significant byte is stored at the highest memory address. This means that the data is stored from left to right, with the most significant byte first.

Little-endian: In little-endian systems, the least significant byte is stored at the lowest memory address, while the most significant byte is stored at the highest memory address. The data is stored from right to left, with the least significant byte first.

To understand this concept, let's consider a 4-byte integer value 0x12345678. Here's how it would be stored in memory based on the endianness:

Big-endian:

Memory Address: 0x00 0x01 0x02 0x03
Data Value: 0x12 0x34 0x56 0x78
Little-endian:

Memory Address: 0x00 0x01 0x02 0x03
Data Value: 0x78 0x56 0x34 0x12

Note that the endianness affects the ordering of bytes, but not the individual bits within each byte.

Endianness is important when data is shared between systems or when reading data from a binary file format. It's essential to ensure that both the sender and receiver of data interpret the bytes in the correct order to avoid data corruption or misinterpretation. Most modern systems, including x86 and ARM processors, use little-endian architecture. However, big-endian systems are still in use, particularly in certain network protocols or older hardware architectures.

In [4]:
# mask = np.random.uniform(low=0.0, high=1.0, size=len(y_train)) < .9

# X_train_ = X_train[mask]
# y_train_ = y_train[mask]

# X_val = X_train[~mask]
# y_val = y_train[~mask]

# print(X_train_.shape, y_train_.shape)
# print(X_val.shape, y_val.shape)

In [5]:
X_train_flattened = X_train.reshape(len(X_train), -1)
X_test_flattened = X_test.reshape(len(X_test), -1)

### Logistic Regression

In [6]:
%%time
classifier_sgd = SGDClassifier(loss='log_loss')
classifier_sgd.fit(X_train_flattened, y_train)

CPU times: total: 1min 21s
Wall time: 1min 35s


In [7]:
%%time
y_pred = classifier_sgd.predict(X_test_flattened)
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96       958
           1       0.97      0.98      0.97      1131
           2       0.86      0.89      0.87       995
           3       0.81      0.91      0.86       897
           4       0.88      0.93      0.91       934
           5       0.90      0.75      0.82      1075
           6       0.94      0.91      0.93       989
           7       0.86      0.96      0.91       916
           8       0.87      0.70      0.78      1200
           9       0.81      0.90      0.85       905

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.88     10000

CPU times: total: 0 ns
Wall time: 44.5 ms


### XGB

In [8]:
%%time
classifier_xgb = XGBClassifier()
classifier_xgb.fit(X_train_flattened, y_train)

CPU times: total: 33min 17s
Wall time: 2min 11s


In [9]:
%%time
y_pred = classifier_xgb.predict(X_test_flattened)
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.99      0.98      0.98       993
           1       0.99      0.99      0.99      1136
           2       0.98      0.98      0.98      1033
           3       0.98      0.98      0.98      1013
           4       0.98      0.98      0.98       973
           5       0.98      0.98      0.98       886
           6       0.98      0.98      0.98       955
           7       0.97      0.98      0.98      1023
           8       0.98      0.98      0.98       972
           9       0.97      0.97      0.97      1016

    accuracy                           0.98     10000
   macro avg       0.98      0.98      0.98     10000
weighted avg       0.98      0.98      0.98     10000

CPU times: total: 1.58 s
Wall time: 86.5 ms


In [None]:
# explain f1
# torch example (dataset, model, gpu, training loop); (conv)
# explain relations
# explain autograd and dag; similarity to sgd [make sgd example]
# show adam optim
# relu and activation functions
# batch norm

# monitor val loss (see sgd)