
Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# GenSen with Pytorch
In this tutorial, you will train a GenSen model for the sentence similarity task. We use the [SNLI](https://nlp.stanford.edu/projects/snli/) dataset in this example. For a more detailed walkthrough about data processing jump to [SNLI Data Prep](../01-prep-data/snli.ipynb). A quickstart version of this notebook can be found [here](../00-quick-start/)

## Overview

### What is GenSen?

GenSen[\[1\]](#References) is a technique to learn general purpose, fixed-length representations of sentences via multi-task training. GenSen is to combine the benefits of These representations are useful for transfer and low-resource learning. GenSen is trained on several data sources with multiple training objectives on over 100 milion sentences.

### Why GenSen?

GenSen model performs the state-of-the-art results on multiple datasets, such as MRPC, SICK-R, SICK-E and STS, for sentence similarity. The reported results are as follows compared with other models [\[2\]](#References):

| Model | MRPC | SICK-R | SICK-E | STS |
| --- | --- | --- | --- | --- |
| GenSen (Subramanian et al., 2018) | 78.6/84.4 | 0.888 | 87.8 | 78.9/78.6 |
| [InferSent](https://arxiv.org/abs/1705.02364) (Conneau et al., 2017) | 76.2/83.1 | 0.884 | 86.3 | 75.8/75.5 |
| [TF-KLD](https://www.aclweb.org/anthology/D13-1090) (Ji and Eisenstein, 2013) | 80.4/85.9 | - | - | - |

## Outline
This notebook is organized as follows:

1. GenSen Theory
2. Data preparation and inspection
3. Model application, performance and analysis

## 0. Global Settings

In [1]:
import sys
sys.path.append("../../../")

import os
from utils_nlp.dataset.preprocess import to_lowercase, to_nltk_tokens
from utils_nlp.dataset import snli
from utils_nlp.model.gensen_wrapper import GenSenClassifier
from utils_nlp.pretrained_embeddings.glove import download_and_extract 


print("System version: {}".format(sys.version))
BASE_DATA_PATH = '../../../data'

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]


## 3. Data Preparation and inspection

The [SNLI](https://nlp.stanford.edu/projects/snli/) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). 

### 3.1 Load the dataset

We provide a function load_pandas_df which does the following

* Downloads the SNLI zipfile at the specified directory location
* Extracts the file based on the specified split
* Loads the split as a pandas dataframe The zipfile contains the following files:
    * snli_1.0_dev.txt
    * snli_1.0_train.txt
    * snli_1.0_test.tx
    * snli_1.0_dev.jsonl
    * snli_1.0_train.jsonl
    * snli_1.0_test.jsonl
    
The loader defaults to reading from the .txt file; however, you can change this to .jsonl by setting the optional file_type parameter when calling the function.

In [2]:
train = snli.load_pandas_df(BASE_DATA_PATH, file_split="train")
dev = snli.load_pandas_df(BASE_DATA_PATH, file_split="dev")
test = snli.load_pandas_df(BASE_DATA_PATH, file_split="test")

train.head()

Unnamed: 0,gold_label,sentence1_binary_parse,sentence2_binary_parse,sentence1_parse,sentence2_parse,sentence1,sentence2,captionID,pairID,label1,label2,label3,label4,label5
0,neutral,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( is ( ( training ( his horse...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,A person is training his horse for a competition.,3416050480.jpg#4,3416050480.jpg#4r1n,neutral,,,,
1,contradiction,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( ( ( is ( at ( a diner ) ) )...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is at a diner, ordering an omelette.",3416050480.jpg#4,3416050480.jpg#4r1c,contradiction,,,,
2,entailment,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,"( ( A person ) ( ( ( ( is outdoors ) , ) ( on ...",(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is outdoors, on a horse.",3416050480.jpg#4,3416050480.jpg#4r1e,entailment,,,,
3,neutral,( Children ( ( ( smiling and ) waving ) ( at c...,( They ( are ( smiling ( at ( their parents ) ...,(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (PRP They)) (VP (VBP are) (VP (VB...,Children smiling and waving at camera,They are smiling at their parents,2267923837.jpg#2,2267923837.jpg#2r1n,neutral,,,,
4,entailment,( Children ( ( ( smiling and ) waving ) ( at c...,( There ( ( are children ) present ) ),(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (EX There)) (VP (VBP are) (NP (NN...,Children smiling and waving at camera,There are children present,2267923837.jpg#2,2267923837.jpg#2r1e,entailment,,,,


### 3.2 Tokenize

We have loaded the dataset into pandas.DataFrame, we now convert sentences to tokens. We also clean the data before tokenizing. This includes dropping unneccessary columns and renaming the relevant columns as score, sentence_1, and sentence_2.

In [3]:
def clean(df, file_split):
    src_file_path = os.path.join(BASE_DATA_PATH, "raw/snli_1.0/snli_1.0_{}.txt".format(file_split))
    if not os.path.exists(os.path.join(BASE_DATA_PATH, "clean/snli_1.0")):
        os.makedirs(os.path.join(BASE_DATA_PATH, "clean/snli_1.0"))
    dest_file_path = os.path.join(BASE_DATA_PATH, "clean/snli_1.0/snli_1.0_{}.txt".format(file_split))
    clean_df = snli.clean_snli(src_file_path).dropna() # drop rows with any NaN vals
    clean_df.to_csv(dest_file_path)
    return clean_df

train = clean(train, 'train')
dev = clean(dev, 'dev')
test = clean(test, 'test')

Once we have the clean pandas dataframes, we do lowercase standardization and tokenization. We use the [NLTK] (https://www.nltk.org/) library for tokenization.

In [4]:
train_tok = to_nltk_tokens(to_lowercase(train))
dev_tok = to_nltk_tokens(to_lowercase(dev))
test_tok = to_nltk_tokens(to_lowercase(test))

[nltk_data] Downloading package punkt to /home/jihon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to /home/jihon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to /home/jihon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
dev_tok.head()

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens
0,neutral,two women are embracing while holding to go pa...,the sisters are hugging goodbye while holding ...,"[two, women, are, embracing, while, holding, t...","[the, sisters, are, hugging, goodbye, while, h..."
1,entailment,two women are embracing while holding to go pa...,two woman are holding packages.,"[two, women, are, embracing, while, holding, t...","[two, woman, are, holding, packages, .]"
2,contradiction,two women are embracing while holding to go pa...,the men are fighting outside a deli.,"[two, women, are, embracing, while, holding, t...","[the, men, are, fighting, outside, a, deli, .]"
3,entailment,"two young children in blue jerseys, one with t...",two kids in numbered jerseys wash their hands.,"[two, young, children, in, blue, jerseys, ,, o...","[two, kids, in, numbered, jerseys, wash, their..."
4,neutral,"two young children in blue jerseys, one with t...",two kids at a ballgame wash their hands.,"[two, young, children, in, blue, jerseys, ,, o...","[two, kids, at, a, ballgame, wash, their, hand..."


##  4. Model application, performance and analysis of the results
The model has been implemented as a GenSen class with the specifics hidden inside the fit() method, so that no explicit call is needed. The algorithm operates in three different steps:

** Model initialization ** : This is where we tell our class how to train the model. The main parameters to specify are the number of
1. config file which contains information about the number of training epochs, the minibatch size etc.
2. cache_dir which is the folder where all the data will be saved.
3. learning rate for the model
4. path to the pretrained embedding vectors.

** Model fit ** : This is where we train the model on the data. The method takes two arguments: the training, dev and test set pandas dataframes. Note that the model is trained only on the training set, the test set is used to display the test set accuracy of the trained model, that in turn is an estimation of the generazation capabilities of the algorithm. It is generally useful to look at these quantities to have a first idea of the optimization behaviour.

** Model prediction ** : This is where we generate the similarity for a pair of sentences. Once the model has been trained and we are satisfied with its overall accuracy we use the saved model to show the similarity between two provided sentences. 

### 4.0 Download pretrained vectors.

In [6]:
pretrained_embedding_path = download_and_extract(BASE_DATA_PATH)

Vector file already exists. No changes made.


### 4.1 Initialize Model

In [7]:
import autoreload
%load_ext autoreload
%autoreload 2
%aimport utils_nlp.model.gensen

config_filepath = '../../../utils_nlp/model/gensen/sample_config.json'
clf = GenSenClassifier(config_file = os.path.abspath(config_filepath), 
                       pretrained_embedding_path = pretrained_embedding_path,
                       learning_rate = 0.0001, 
                       cache_dir=BASE_DATA_PATH)

### 4.2 Train Model

In [None]:
%%time
clf.fit(train_tok, dev_tok, test_tok)

../../../data/clean/snli_1.0/snli_1.0_train.txt
../../../data/clean/snli_1.0/snli_1.0_dev.txt
../../../data/clean/snli_1.0/snli_1.0_test.txt
Building vocabulary ...
Attempted to log scalar metric Building vocabulary ...:
2
Building common source vocab ...
model/src_vocab.pkl
Found existing vocab file. Reloading ...
Building target vocabs ...
Found existing vocab file. Reloading ...
Reloading vocab for snli 
Fetching sentences ...
Attempted to log scalar metric Fetching sentences ...:
2
Processing corpus : 0 task snli 
Attempted to log scalar metric Processing corpus : :
0
Attempted to log scalar metric task:
snli
Reached end of dataset, reseting file pointer ...
Fetching sentences ...
Attempted to log scalar metric Fetching sentences ...:
2
Processing corpus : 0 task snli 
Attempted to log scalar metric Processing corpus : :
0
Attempted to log scalar metric task:
snli
Fetched 1000000 sentences
Fetched 1000000 sentences


  "num_layers={}".format(dropout, num_layers))
  torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)


Attempted to log scalar metric loss:
0.9526345729827881
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torc

INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16,

INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16,

INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16,

  F.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)
  F.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)
  torch.LongTensor(sorted_src_lens), volatile=True


INSIDE FORWARD: torch.Size([1, 16, 2048])


  torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)


INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
Attempted to log scalar metric loss:
0.6286092400550842
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
Attempted to log scalar metric loss:
0.4079805910587311
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE FORWARD: torch.Size([1, 16, 2048])
INSIDE

### 4.3 Predict

In [None]:
sentences = [
        'hello world . the quick brown foxy',
        'the quick brown fox jumped over the lazy dog .'
    ]

clf.predict(sentences)

## References

1. Subramanian, Sandeep and Trischler, Adam and Bengio, Yoshua and Pal, Christopher J, [*Learning general purpose distributed sentence representations via large scale multi-task learning*](https://arxiv.org/abs/1804.00079), ICLR, 2018.
3. Semantic textual similarity. url: http://nlpprogress.com/english/semantic_textual_similarity.html