<center> <h1> Lecture 6: Embeddings and ML Experiments </h1> </center>
<center> Jillian Fisher, Zaid Harchaoui </center>
    <center> Data 598 (Winter 2023), University of Washington </center>

We will discuss two topics this lecture:
- Embeddings for natural language
- Model Selection with statistical tests

This notebook is inspired by the [D2L book](https://d2l.ai/),  and adapted from lecture material created by Krishna Pillutla.

# Part 1: Embeddings for Natural Language

The field of **natural language processing (NLP)** is concerned with the interaction between computers and natural (human) language. This involves "understanding" the contents of documents, including the contextual nuances of the language within them. 

**Embeddings**:
The use of machine learning for NLP, both in the classical settings as well as the modern deep learning era, have relied on *embedding* words in vector spaces.
Words are made of characters, which are combinatorial in nature with no "neighborhood" structure which one expects of vectors in, say, a Euclidean space. 
The magic of embeddings is that they are able to capture some "neighborhood" structure in words, e.g., the embedding of synonyms are closer together than of words which have nothing in common. 

![](https://miro.medium.com/max/2400/1*OEmWDt4eztOcm5pr2QbxfA.png)
Image credits: https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8

**Note**: Sometimes, we will work at the level of subword units, rather than words. Mathematically, the same treatment holds irrespective of how we *tokenize* the text. We will refer to these units as *tokens*.


**Types of embeddings**:

- Global (context-free) embeddings: TF-IDF, word2vec, GloVe
- Contextual embeddings: ELMo, BERT, ...

![](http://ai.stanford.edu/blog/assets/img/posts/2020-03-24-contextual/contextual_mouse_transparent_2.png)
Image credit: http://ai.stanford.edu/blog/contextual/


**The history of word embeddings**:
The research started with global (context-free) embeddings with 
later research producing contextual embeddings using deep learning.
$$
\begin{matrix}
\text{TF-IDF}   & \text{word2vec}   &  \text{GloVe}   &   \text{ELMo}     &       \text{BERT}  \\
2003       &2013       &  2014    &   2017     &       2018 
\end{matrix} 
$$



## Constructing Word Embeddings:
We will walk through the different methods of constructing word embeddings and how they differ. 

### Context-free Embeddings
Context-free embeddings involves representing a word irrespective of the meaning of that word in a particular sentence (pre-determined). 
Therefore, context-free embedding methods always return a single word embedding for each word in the vocabulary. 

We have already seen an example of context-free embedding, Bag of Words. Here we will describe a few other methods. 

### **1. Term frequency-inverse document frequency (TF-IDF)**
TF-IDF is a method which combines two metrics to provide a score for each word (in each document). 

It uses the following:

1. *Term frequency (TF)*: the frequency of words in a particular document

$TF_{i,j} = \frac{\text{Term }i \text{ frequency in document }j}{\text{Total num. of terms in document }j}$

2. *Inverse document frequency (IDF)*: the rarity of words in the text

$IDF_{i} = \log \frac{\text{Total documents}}{\text{Num. of documents containing term }i}$ 

$\text{TF-IDF Score} = TF * IDF$

Uses: basic nlp analysis, information retrieval, stop words removal 

Challenges: d not capture semantic meaning, as vocabulary size increases so does the vector representation (sparse vectors)



In [1]:
# Use Scikit-learn for TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd 

sentence1 = "I collected my paycheck at the bank" # "document 1"
sentence2 = "Please meet me tomorrow at the river bank" # "document 2"

corpus = [sentence1, sentence2]

vectorizer = TfidfVectorizer()
tfIdf = vectorizer.fit_transform(corpus)
# vectorizer.get_feature_names_out()

df = pd.DataFrame(tfIdf.T.todense(), index=vectorizer.get_feature_names_out(), columns=["TF-IDF Doc1", "TF-IDF Doc2"])
# df = df.sort_values('TF-IDF', ascending=False)
print (df.head(25))


           TF-IDF Doc1  TF-IDF Doc2
at            0.334712     0.278675
bank          0.334712     0.278675
collected     0.470426     0.000000
me            0.000000     0.391668
meet          0.000000     0.391668
my            0.470426     0.000000
paycheck      0.470426     0.000000
please        0.000000     0.391668
river         0.000000     0.391668
the           0.334712     0.278675
tomorrow      0.000000     0.391668


### **2. Word2Vec**
This method was developed by Google in 2013 and was the dominate method until the release of Transformer (around 2018). This method attempts to incorporate semantic meaning by considering the order of the words (in the past and in the future). It actually uses a shallow neural network (input, output, and projection layer) to create the embeddings. It centers on the hypothesis that neighboring words have semantic similarities. But how do we express words being "semantically similar"?

*Cosine Similarity*

In NLP we generally use the "cosine similarity" metric to indicate semantically close. This finds the cosine angle between two vectors (equation below). So if the cosine similarity is 1 than the two words are exactly semantically similar and if it is 0 than they are not semantically similar at all. 

$\text{cosine similarity} = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2}\sqrt{\sum_{i=1}^n B_i^2}}$

**Continuous Bag of Words (CBOW)**
Word2Vec uses CBOW to code the input to it's shallow network. 

<img src="https://d2mk45aasx86xg.cloudfront.net/Context_and_current_word_in_CBOW_d75636ea68.webp" width="400">
Image credit: https://www.turing.com/kb/guide-on-word-embeddings-in-nlp

CBOW centers each word around a window (set by the user) and then uses one-hot encoding to create inputs to the shallow network.
The shallow network than try to predict the current word using the surrounding words. 

<img src="https://d2mk45aasx86xg.cloudfront.net/Continuous_bag_of_word_embedding_in_NLP_7ea4656378.webp" width="400">
Image credit: https://www.turing.com/kb/guide-on-word-embeddings-in-nlp

Another method for Word2Vec is using skip-grams, see [here](https://www.turing.com/kb/guide-on-word-embeddings-in-nlp) for more details. 

Uses: can use for semantic vectorization

Challenges: can be computationally intensive and relies on a linear window for semantics


In [2]:
# Install the nltk and gensim
# Important: make sure pip is installed in your conda environment
# Run "pip install nltk" and "pip install gensim" in your terminal
# Python program to generate word vectors using Word2Vec
 
# importing all necessary modules
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
 
warnings.filterwarnings(action = 'ignore')
 
import gensim
from gensim.models import Word2Vec
 
#  Same data as before
sentence1 = "I collected my paycheck at the bank" # "document 1"
sentence2 = "Please meet me tomorrow at the river bank" # "document 2"

# sent_tokenizer automatically sperates based on punctuation
corpus = sentence1 +". " +sentence2 +". "

data = []
 
# iterate through each sentence in the file
for i in sent_tokenize(corpus):
    temp = []
     
    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())
 
    data.append(temp)
 
# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1,
                              vector_size = 100, window = 3)
 
# Print results
print("Length of Word Vector for 'Bank': ", len(model1.wv["bank"]))
print("Cosine similarity between 'river' " +
            "and 'bank': ",
model1.wv.similarity('paycheck', 'bank'))
print("Cosine similarity between 'bank' " +
            "and 'bank': ",
model1.wv.similarity('bank', 'bank'))

print("Top 3 most similar words to'bank': ", model1.wv.most_similar('bank', topn=3))


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jillianfisher/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Length of Word Vector for 'Bank':  100
Cosine similarity between 'river' and 'bank':  -0.095753424
Cosine similarity between 'bank' and 'bank':  1.0
Top 3 most similar words to'bank':  [('at', 0.06797593086957932), ('please', 0.033640600740909576), ('tomorrow', 0.009394742548465729)]


## Contextual Embeddings
### **1. Bidirectional encoder representations for transformers (BERT)**
Similar to Word2Vec, BERT uses a model to create the word vectors. However, instead of simple model, the BERT model is a deep neural network with 12 layers (a transformer). We will learn about transformers in-depthly next week, so for today we will just play with the embeddings from BERT

**Playing with embeddings**:
We will play with BERT embeddings, a form of contextual embeddings, using the `transformers` library.

BERT and its follow-ups such as RoBERTa provided contextual embeddings that depend on both the left context and right context of a token. 
On the other hand, GPT-2 and GPT-3 produce representations that only depend on the left context. This comes from how these language models are trained. We will talk about the different pretraining strategies in the lecture.

In [3]:
# Install the transformers library
# Important: make sure pip is installed in your conda environment
# Run "pip install transformers" in your terminal

In [4]:
import torch
import numpy as np
from transformers import BertTokenizer, BertModel

In [5]:
model_name = 'bert-base-uncased'
# Download the pre-trained model + tokenizer (a total of 440 MB)
tokenizer = BertTokenizer.from_pretrained(model_name) # to tokenize the text
model = BertModel.from_pretrained(model_name)  # PyTorch module

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
# Consider these two sentences

sentence1 = "I collected my paycheck at the bank"
sentence2 = "Meet me tomorrow at the river bank"

# Let us tokenize them 
tokens_for_sentence1 = tokenizer.encode(sentence1, return_tensors='pt')
tokens_for_sentence2 = tokenizer.encode(sentence2, return_tensors='pt')

print('Sentence 1:', tokens_for_sentence1)
print('Sentence 2:', tokens_for_sentence2)

print('Sentence 1 Length:', tokens_for_sentence1.shape)
print('Sentence 2 Length:', tokens_for_sentence2.shape)
# the leading 1 is the batch size

Sentence 1: tensor([[ 101, 1045, 5067, 2026, 3477, 5403, 3600, 2012, 1996, 2924,  102]])
Sentence 2: tensor([[ 101, 3113, 2033, 4826, 2012, 1996, 2314, 2924,  102]])
Sentence 1 Length: torch.Size([1, 11])
Sentence 2 Length: torch.Size([1, 9])


In [7]:
# Look at the word peices the tokens correpond to: 
print(tokenizer.convert_ids_to_tokens(tokens_for_sentence1[0]))

['[CLS]', 'i', 'collected', 'my', 'pay', '##che', '##ck', 'at', 'the', 'bank', '[SEP]']


The token "2924" corresponds to the word "bank".
Observe now that the contextual embedding of the word "bank" for each case is different. 
This would not have been the case for a global embedding. 

In [8]:
outputs1 = model(tokens_for_sentence1,
                return_dict=True)

# Extract contextual embedding for each token
embeddings_for_sentence1 = outputs1.last_hidden_state
print(embeddings_for_sentence1.shape) # [batch_size, num_tokens, dimension]

outputs2 = model(tokens_for_sentence2,
                return_dict=True)

# Extract contextual embedding for each token
embeddings_for_sentence2 = outputs2.last_hidden_state
print(embeddings_for_sentence2.shape) # [batch_size, num_tokens, dimension]


torch.Size([1, 11, 768])
torch.Size([1, 9, 768])


In [9]:
# The last token 102 is a special [SEP]. The second-to-last token 2924 corresponds to "bank".
embedding_for_bank_1 = embeddings_for_sentence1[0, -2, :]
embedding_for_bank_2 = embeddings_for_sentence2[0, -2, :]
print('L2 distance between the embeddings:', 
      torch.norm(embedding_for_bank_1-embedding_for_bank_2).item())


L2 distance between the embeddings: 11.550871849060059


We use the L2 distance here as an example to show that the embeddings of "bank" in each sentence are distinct.
A more commonly used measure of similarity/distance between embeddings is the cosine similarity/distance.

The cosine similarity between two vectors $v_1$ and $v_2$ is defined as:
$$
    S(v_1, v_2) = \frac{v_1^\top v_2}{ \|v_1\| \, \|v_2\|} .
$$

**Exercise**: When $v_1$ and $v_2$ are unit vectors, show that $\|v_1 - v_2\|^2 = 2\big(1 - S(v_1, v_2)\big)$.

## Sentiment Analysis using Embeddings

We will look at the standard NLP task of sentiment analysis. 
Given a piece of text, the goal is to classify it as "positive" or "negative" in sentiment.

![](https://vitalflux.com/wp-content/uploads/2021/10/sentiment-analysis-machine-learning-techniques-640x395.png)
Image credits: https://vitalflux.com/sentiment-analysis-machine-learning-techniques/

Our procedure is as follows:
- We will use a labeled dataset and cast this as a multiclass classification problem
- We will use these BERT embeddings to construct obtain one vector per token. We will simply take the mean of this vector as the feature representation of the entire piece of text.
- We will train a simple linear model to predict the output label from these features.


Download the data from [here](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data?select=train.tsv.zip).

We will use movie reviews from Rotten Tomatoes. The sentiment labels are:
- 0 - negative
- 1 - somewhat negative
- 2 - neutral
- 3 - somewhat positive
- 4 - positive

### Load and visualize data

In [10]:
import pandas as pd
filename = './data/train.tsv'
# keep one example per sentence (original data labels each phrase)
data = pd.read_csv(filename, sep='\t').groupby('SentenceId').first()
data = data.drop(columns=['PhraseId'])

print(data.shape)

data.head(4)

(8529, 2)


Unnamed: 0_level_0,Phrase,Sentiment
SentenceId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,A series of escapades demonstrating the adage ...,1
2,"This quiet , introspective and entertaining in...",4
3,"Even fans of Ismail Merchant 's work , I suspe...",1
4,A positively thrilling combination of ethnogra...,3


In [11]:
data.at[3, 'Phrase']

"Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one ."

### Train-test split and featurize

In [12]:
data = data.sample(frac=1)  # shuffle
train_data = data[:1000]
test_data = data[5000:6000]
print(train_data.shape, test_data.shape)

(1000, 2) (1000, 2)


In [13]:
from tqdm.auto import tqdm

@torch.no_grad()
def featurize(x): # x is pd.Series with text
    features = []
    for sen in tqdm(x):
        sen = tokenizer.encode(sen, return_tensors='pt')
        outputs = model(sen, return_dict=True)
        embeddings = outputs.last_hidden_state.squeeze() # (len, dim)
        mean_embedding = embeddings.mean(axis=0)
        features.append(mean_embedding.numpy())
    return np.stack(features)  # (n, dim)


In [14]:
# Takes a few minutes to run
x_train = featurize(train_data['Phrase'])
y_train = train_data['Sentiment'].values

x_test = featurize(test_data['Phrase'])
y_test = test_data['Sentiment'].values

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

### Train a simple logistic regression classifier to test performance

In [16]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.99, random_state=1).fit(x_train)  # keep 99% of the explained variance
x_train = pca.transform(x_train)
x_test = pca.transform(x_test)

In [17]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, C=0.01).fit(x_train, y_train)

y_train_pred = clf.predict(x_train)
y_test_pred = clf.predict(x_test)

print('Train accuracy:', (y_train_pred == y_train).mean())
print('Test accuracy:', (y_test_pred == y_test).mean())

Train accuracy: 0.543
Test accuracy: 0.426


**Improving this model**: 
In this example, we simply averaged the embeddings of all the tokens
from one example to get an embedding for the entire example. We then train a linear model. 
Here are two ways to improve this model: 
1. Instead of averaging the embeddings of each token, we train a recurrent neural network that maintains the sequential nature of the data. We will take a closer look into recurrent networks next week. 
2. The preferred approach today is to attach a prediction layer to the BERT model using the embedding of the [CLS] token (this is special token we added to the start of the sequence). We then finetune the entire BERT model on the dataset.

# Part 2: Statistical Tests for Analysis for ML Experiments

In some particular safety-critical applications, it might be necessary to make guarantees of the form 
"*The misclassification error of my classifier is at most 12% on data from the same distribution as our training data*". Think of self-driving cars, for instance. The 12% above is an arbitrary number.

In these cases, we run hypothesis tests to formalize our claims.

**Hypothesis Testing Review**: 
Suppose we want to show that herbal tea helps with migraines. 
In the spirit of "proof by contradiction", 
we assume the opposite to be true and say that herbal tea does not help with migraines. 
If the data looks too "anomalous" under this assumption, we arrive at a 
"contradiction", which means that the data is not consistent the 
claim that "herbal tea does not help with migraines" (the opposite of what we set out to show).


We have two hypotheses, the null hypothesis (denoted $H_0$; the opposite of what we want to show) and the alternate hypothesis (denoted $H_1$ or $H_a$; what we want to show). 
From looking at the data, we take one of two steps:
- reject the null hypothesis
- "fail" to reject the null hypothesis


**Illustration**: ([credit](https://study.com/cimages/multimages/16/ea0e233d-7bc3-4ba6-a79c-5d8281295985_t_tests.png)): 

Suppose we are given two distributions with means $\mu_1$ and $\mu_2$ respectively. 
We plot as test statistic (TS) the difference $\hat\mu_{1, n} - \hat\mu_{2, n}$ between the sample means. The bell-curve is centered at $0$.
![](https://study.com/cimages/multimages/16/ea0e233d-7bc3-4ba6-a79c-5d8281295985_t_tests.png)


Letting "acc $(h)$ " denote the classification accuracy of our classification algorithm $h$, We may write this test as the following:
$$
H_0: \quad \text{acc}(h) \le a_0 \\
H_1: \quad \text{acc}(h) > a_0 ,
$$
where $a_0$ is some pre-specified accuracy.

The outcomes are:
- Reject the null: If our data is convincing enough (i.e., the accuracy on our validation set is significantly larger than $a_0$), we reject the null with a certain level of confidence
- Fail to reject the null: If the validation accuracy is close to or smaller than $a_0$, we say that we do not have strong enough evidence to reject the null (default) hypothesis. 

**The $t$-Test to Assess Classification**:

Suppose we have $K$ training-validation pairs. For each one, we record the 
validation accuracies $A_1, \cdots, A_K$. 
The empirical mean and variance are:
$$
    m = \frac{1}{K} \sum_{k=1}^K A_k\,, \quad S^2 = \frac{1}{K-1}(A_k - m)^2 \,.
$$
The test statistic is then
$$
    T_K := \frac{\sqrt{K}(m - a_0)}{S} .
$$
Under the assumption of independence of each of the training-validation set pairs, 
it turns out that $T_K$ is distributed according to the Student $t$ distribution with $K-1$ degrees of freedom. 

In this case, we reject the null with a level of significance $\alpha$ if 
$$
    T_K > t_{K-1, \alpha},
$$
the $(1-\alpha)$-quantile of of the $t_{K-1}$ distribution. 

That is, we reject the null if 
$$
    m > a_0 + \frac{S}{\sqrt{K}} t_{K-1, \alpha} \,.
$$
Observe what happens as $K$ grows or $\alpha$ becomes smaller. 
![](https://lh3.googleusercontent.com/proxy/Rk0TX6KUcZLaFgMU42Qr553ALEHXt1YRIoZRIZfaoTMp69H5UcESVWmj3C-qE1NgtSUyngFqUx-v_O9__tzq29yUeZ3OKcmwVbby2bJ5neKzkBBFGzJhQzR9U0rWxL3kEYYV7ieeZh8hvCfLffhyP2AghYESkJqOa7fg5qAj)

The significance level $\alpha$ is the type-I error: the probability of rejecting the null hypothesis when it is correct. 
The type-II error is the probability of failing to reject the null when the alternate is correct; this is related to the *power* of the test. 

**Illustration**: What is the null hypothesis here?
![](https://qph.fs.quoracdn.net/main-qimg-a25c9f17379bd7b94719a77686dfb519)
Image source: https://effectsizefaq.com/2010/05/31/i-always-get-confused-about-type-i-and-ii-errors-can-you-show-me-something-to-help-me-remember-the-difference/


### The $t$-test in action
Let us assess the accuracy of one of the ConvNets we saw in Week 2. We will construct $5$ different training-validation splits of the data. 
We will test for the following:

$$
H_0: \text{accuracy} \le 0.87 \\
H_1: \text{accuracy} > 0.87
$$

We will use a significance level of $\alpha = 0.05$.

In [18]:
import numpy as np
import torch
from torchvision.datasets import MNIST, FashionMNIST
from torch.nn.functional import cross_entropy
import time
import scipy.stats

import matplotlib.pyplot as plt 
%matplotlib inline 

torch.manual_seed(0)
np.random.seed(1)

Download the FashionMNIST dataset and divide it into 5 train-val pairs.

In [19]:
train_dataset = FashionMNIST('./data', train=True, download=True)
X_train = train_dataset.data # torch tensor of type uint8
y_train = train_dataset.targets # torch tensor of type Long

X_train = X_train.float()  # convert to float32
X_train = X_train.view(-1, 784)
mean, std = X_train.mean(axis=0), X_train.std(axis=0)
X_train = (X_train - mean[None, :]) / (std[None, :] + 1e-6)  # avoid divide by zero



# shuffle the data
idxs = np.random.permutation(X_train.shape[0])
size = X_train.shape[0]//10

Xs = [] 
ys = []
for i in range(10): # 5 train-val pairs
    subsample_idxs = idxs[i*size : (i+1)*size]
    X = X_train[subsample_idxs]
    y = y_train[subsample_idxs]
    Xs.append(X)
    ys.append(y)


Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/26421880 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/29515 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/4422102 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/5148 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw



Now we write the model and our helper functions

In [20]:
class ConvNet(torch.nn.Module):
    def __init__(self,num_classes=10):
        super().__init__()
        self.conv_ensemble_1 = torch.nn.Sequential(
            torch.nn.Conv2d(1, 16, kernel_size=5, padding=2),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2))
        self.conv_ensemble_2 = torch.nn.Sequential(
            torch.nn.Conv2d(16, 32, kernel_size=5, padding=2),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2))
        self.fc = torch.nn.Linear(7*7*32, 10)
        
    def forward(self, x):
        x = x.view(-1, 1, 28, 28)
        out = self.conv_ensemble_1(x)
        out = self.conv_ensemble_2(out)
        out = out.view(out.shape[0], -1)
        out = self.fc(out)
        return out
    
# Some utility functions to compute the objective and the accuracy
def compute_objective(model, X, y):
    score = model(X)
    # PyTorch's function cross_entropy computes the multinomial logistic loss
    return cross_entropy(input=score, target=y, reduction='mean') 

@torch.no_grad()
def compute_accuracy(model, X, y):
    score = model(X)
    predictions = torch.argmax(score, axis=1)  # class with highest score is predicted
    return (predictions == y).sum() * 1.0 / y.shape[0]

def sgd_one_pass(model, X, y, learning_rate, verbose=False):
    num_examples = X.shape[0]
    average_loss = 0.0
    for i in range(num_examples):
        idx = np.random.choice(X.shape[0])
        # compute the objective. 
        # Note: This function requires X to be of shape (n,d). In this case, n=1 
        objective = compute_objective(model, X[idx:idx+1], y[idx:idx+1]) 
        average_loss = 0.99 * average_loss + 0.01 * objective.item()
        if verbose and (i+1) % 100 == 0:
            print(average_loss)
        
        # compute the gradient using automatic differentiation
        gradients = torch.autograd.grad(outputs=objective, inputs=model.parameters())
        
        # perform SGD update. IMPORTANT: Make the update inplace!
        for (w, g) in zip(model.parameters(), gradients):
            w.data -= learning_rate * g.data
      
    
from tqdm.auto import trange # range + progress bar
def sgd_n_passes(X_train, y_train, X_val, y_val, n_passes, learning_rate):
    model = ConvNet()
    for i in trange(n_passes):
        sgd_one_pass(model, X_train, y_train, learning_rate)
    return compute_accuracy(model, X_val, y_val)

In [21]:
accuracies = []
for i in range(5):
    print(f'Starting run {i+1}')
    X_train, y_train = Xs[2*i], ys[2*i]
    X_val, y_val = Xs[2*i+1], ys[2*i+1]
    acc = sgd_n_passes(X_train, y_train, X_val, y_val, n_passes=30, learning_rate=2.5e-3)
    accuracies.append(acc)

Starting run 1


  0%|          | 0/30 [00:00<?, ?it/s]

Starting run 2


  0%|          | 0/30 [00:00<?, ?it/s]

Starting run 3


  0%|          | 0/30 [00:00<?, ?it/s]

Starting run 4


  0%|          | 0/30 [00:00<?, ?it/s]

Starting run 5


  0%|          | 0/30 [00:00<?, ?it/s]

In [22]:
# Print accuracies
accuracies = np.asarray(accuracies)
accuracies

array([0.879     , 0.87666667, 0.8756667 , 0.8765    , 0.8721667 ],
      dtype=float32)

Now we run the test. 

In [23]:
alpha = 0.05  # significance level
a_0 = 0.87 # accuracy level we are testing for
K = accuracies.shape[0]
m = np.mean(accuracies)
s = np.std(accuracies, ddof=1)  # divide by K-1

# Compute the test statistic
T =  np.sqrt(K) * (m - a_0) / s
threshold = scipy.stats.t(df=K-1).ppf(1-alpha)  # 1-alpha quantile of t_{K-1}

print(f'Test statistic: {T}\t threshold: {threshold}')

if T > threshold:
    print('Reject the null')
else:
    print('Fail to reject the null')

Test statistic: 5.421106851044113	 threshold: 2.13184678133629
Reject the null


We can work through what we would have gotten if were to test for 
accuracy being at least 88%. 

**NOTE**: We must determine the hypotheses before running the task. We are not supposed to adaptively change the test depending on the results. This is only "a simulation".

In [24]:
a_0 = 0.88 
# Compute the test statistic
T =  T = np.sqrt(K) * (m - a_0) / s
threshold = scipy.stats.t(df=K-1).ppf(1-alpha)  # 1-alpha quantile of t_{K-1}

print(f'Test statistic: {T}\t threshold: {threshold}')

if T > threshold:
    print('Reject the null')
else:
    print('Fail to reject the null')

Test statistic: -3.614000865536312	 threshold: 2.13184678133629
Fail to reject the null
