# MNIST, Hello world

## Get Data

Kaggle has a competition https://www.kaggle.com/c/digit-recognizer where it provides three files:

1. train.csv
2. test.csv
3. sample_submission.csv



## Import Data Into SQLite DB

Instead of loading data from csv files, I am going to load it from a SQLite database.

    sqlite3 mnist.db
    sqlite> .import --csv train.csv train
    sqlite> select count(1) from train;
    42000
    sqlite> select 0.2 * 42000;
    8400.0
    sqlite> create temp view z as select rowid from train order by random() limit 8400;
    sqlite> create table v as select rowid as id, * from train where rowid in z order by random();
    sqlite> create table t as select rowid as id, * from train where rowid not in z order by random();

I shuffle and divide images into training and validation datasets. Id column is used to reference original images.

Columns that were created during import has a text type.

    sqlite> .schema train
    CREATE TABLE train(
      "label" TEXT,
      "pixel0" TEXT,
      "pixel1" TEXT,
      ...
      "pixel782" TEXT,
      "pixel783" TEXT
    );  

I need to remember to convert corresponding values from string to int when accessing data. The alternative would be to create table with correct types and then import but I will left this approach for another day.

Also import test dataset:

    sqlite> .import --csv test.csv test

## Dataset

In [15]:
import contextlib, sqlite3, torch

In [16]:
class DbDataset(torch.utils.data.IterableDataset):  
    def __init__(self, db, table, transform, limit = 10000):
        self.db, self.limit, self.offset = db, limit, 0
        self.transform = transform
        self.select = f"SELECT * FROM {table} LIMIT ? OFFSET ?"
        
    def __iter__(self):
        select, db, limit = self.select, self.db, self.limit
        self.buf = []
        while True:
            if len(self.buf) == 0:
                with contextlib.closing(sqlite3.connect(db)) as c:
                    r = c.cursor().execute(select, (limit, self.offset))
                    self.offset += limit
                    self.buf = r.fetchall()
                if len(self.buf) == 0:
                    break
                self.buf.reverse()
            yield self.transform(self.buf.pop())           

def trainTransform(t):
    y = int(t[1])
    X = torch.as_tensor(list(map(int, t[2:])), dtype=torch.float) / 255.
    return (X, y)
    
def testTransform(t):
    return torch.as_tensor(list(map(int, t)), dtype=torch.float) / 255.
            

## Simplest Model

In [27]:
torch.manual_seed(42)
model = torch.nn.Linear(28 * 28, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = torch.nn.CrossEntropyLoss()

I've trained this model for 10 epochs. The final accuracy is ~0.9. Kaggle's score is also ~0.9.

## Accuracy

In [18]:
def accuracy():
    ds = DbDataset("mnist.db", "v", trainTransform)
    dl = torch.utils.data.DataLoader(ds, batch_size=256)
    c, total = 0, 0
    for i, target in dl:
        r = torch.argmax(model(i), dim=1) == target
        c += int(torch.sum(r))
        total += r.shape[0]
    return c / total

## Optimization

In [19]:
def epoch():
    ds = DbDataset("mnist.db", "t", trainTransform)
    dl = torch.utils.data.DataLoader(ds, batch_size=256)
    c, total = 0, 0
    for i, target in dl:
        optimizer.zero_grad()
        output = model(i)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()

In [30]:
def main():
    for _ in range(10):
        epoch()
        print(accuracy())

## Save results

In [22]:
import csv 

def write_results():
    ds = DbDataset("mnist.db", "test", testTransform)
    r = []
    for i in ds:
        r.append(int(torch.argmax(model(i))))
    with open('subm.csv', 'w') as f:    
        fieldnames = ['ImageId', 'Label']
        writer = csv.DictWriter(f, fieldnames=fieldnames)

        writer.writeheader()
        for i, j in enumerate(r):
            writer.writerow({'ImageId': i + 1, 'Label': j})    