<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Evaluation on GenSen model by SentEval

SentEval is the evaluation toolkit for sentence embeddings. SentEval is a library for evaluating the quality of sentence embeddings. It is used to assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks. SentEval currently includes 17 downstream tasks.

This notebook will show you how to run SentEval and evaluate trained GenSen model locally. We used the [SentEval](https://github.com/facebookresearch/SentEval) toolkit to run most of our transfer learning experiments. To replicate these numbers, clone their repository and follow setup instructions. Once complete, copy this notebook and `gensen.py` into their examples folder and run the following commands to reproduce different rows in Table 2 of our paper. Note: Please set the path to the pretrained glove embeddings (`glove.840B.300d.h5`) and model folder as appropriate.

## 0 Global settings

Most of the functions used in the notebook can be found in the `gensen.py` file.

In [1]:
# Check core SDK version number
import azureml.core
print("SDK version:", azureml.core.VERSION)

SDK version: 1.0.23


In [2]:
from azureml.telemetry import set_diagnostics_collection

set_diagnostics_collection(send_diagnostics=True)

Turning diagnostics collection on. 


In [4]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

Found the config file in: C:\Users\lishao\Project\NLP\examples\03-eval\senteval\examples\.azureml\config.json
Workspace name: MAIDAPNLP
Azure region: eastus2
Subscription id: 15ae9cb6-95c1-483d-a0e3-b1a1a3b06324
Resource group: nlprg


In [3]:
# default_run:begin
# Import dependencies
# pip install azureml-contrib-notebook
from azureml.core import Workspace, Experiment, RunConfiguration
from azureml.contrib.notebook.notebook_run_config import NotebookRunConfig

# Create new experiment
ws = Workspace.from_config()
exp = Experiment(workspace, "simple_notebook_experiment")

In [4]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "gpucluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current AmlCompute. 
print(compute_target.get_status().serialize())

Found existing compute target.
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-04-23T12:27:46.775000+00:00', 'errors': None, 'creationTime': '2019-04-17T17:21:26.968570+00:00', 'modifiedTime': '2019-04-17T17:27:28.740980+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT7200S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


In [None]:
from azureml.core import Datastore
ds = Datastore.register_azure_file_share(workspace=ws,
                                        datastore_name= 'GenSen',
                                        file_share_name='azureml-filestore-09b72610-7938-4ed2-86a2-5004896b12d9',
                                        account_name='maidapnlp0056795534',
                                        account_key='8LtGFZErNlvI6fSrgODqCxJCckkVgq3AL/5S/8ma7Re7xUHgWrNRCfTFnP/QDhF7KDY6ScAORsUpSm7ziog5/Q==')

In [5]:
from azureml.core.conda_dependencies import CondaDependencies
# Customize run configuration to execute in user managed environment
run_config_user_managed = RunConfiguration()
run_config_user_managed.target = compute_target.name

dr = {
    'nsmr': ds.as_mount()
}

run_config_user_managed.data_references = dr

# Specify conda dependencies with scikit-learn
# cd = CondaDependencies.create(conda_packages=['scikit-learn'])
# run_config_user_managed.environment.python.conda_dependencies = cd

In [None]:
# Create notebook run configuration and set parameters values
cfg = NotebookRunConfig(source_directory="./",
                        notebook="gensen_senteval.ipynb",
                        parameters={},
                        run_config=run_config_user_managed)

In [None]:
# Submit experiment and wait for completion
run = exp.submit(cfg)
run.wait_for_completion(show_output=True)

## 1 Use SentEval
To evaluate your sentence embeddings, SentEval requires that you implement two functions:

1. **prepare** (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
2. **batcher** (transforms a batch of text sentences into sentence embeddings)

### 1.) prepare(params, samples) (optional)

*batcher* only sees one batch at a time while the *samples* argument of *prepare* contains all the sentences of a task.

```
prepare(params, samples)
```
* *params*: senteval parameters.
* *samples*: list of all sentences from the tranfer task.
* *output*: No output. Arguments stored in "params" can further be used by *batcher*.

### 2.) batcher(params, batch)
```
batcher(params, batch)
```
* *params*: senteval parameters.
* *batch*: numpy array of text sentences (of size params.batch_size)
* *output*: numpy array of sentence embeddings (of size params.batch_size)

### 1.1 Prepare function

In [2]:
def prepare(params, samples):
    print('Preparing task : %s ' % (params.current_task))
    vocab = set()
    for sample in samples:
        if params.current_task != 'TREC':
            sample = ' '.join(sample).lower().split()
        else:
            sample = ' '.join(sample).split()
        for word in sample:
            if word not in vocab:
                vocab.add(word)

    vocab.add('<s>')
    vocab.add('<pad>')
    vocab.add('<unk>')
    vocab.add('</s>')
    # If you want to turn off vocab expansion just comment out the below line.
    params['gensen'].vocab_expansion(vocab)

### 1.2 Batcher function

In [21]:
def get_batcher(local_strategy):
    
    def batcher(params, batch):
        # batch contains list of words
        max_tasks = ['MR', 'CR', 'SUBJ', 'MPQA', 'ImageCaptionRetrieval']
        if local_strategy == 'best':
            if params.current_task in max_tasks:
                strategy = 'max'
            else:
                strategy = 'last'
        else:
            strategy = local_strategy

        sentences = [' '.join(s).lower() for s in batch]
        _, embeddings = params['gensen'].get_representation(
            sentences, pool=strategy, return_numpy=True
        )
        return embeddings
    
    return batcher

In [19]:
print(dir(get_batcher('best')))
print(get_batcher('best').__globals__)

['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']
{'__name__': '__main__', '__doc__': 'Automatically created module for IPython interactive environment', '__package__': None, '__loader__': None, '__spec__': None, '__builtin__': <module 'builtins' (built-in)>, '__builtins__': <module 'builtins' (built-in)>, '_ih': ['', 'from __future__ import absolute_import, division, unicode_literals\n\nimport sys\nsys.path.append(\'.\')\nimport torch\nimport logging\n\nimport argparse\nfrom gensen import GenSen, GenSenSingle\n\n# Set SentEval Path.\nPATH_SENTEVAL = \'../\'\n# Set da

## 2 Evaluation of GenSen trained model on Transfter Tasks (SentEval)

### 2.1 Parameters for SentEval

The current list of available tasks is:
```python
['CR', 'MR', 'MPQA', 'SUBJ', 'SST2', 'SST5', 'TREC', 'MRPC', 'SNLI',
'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval',
'STS12', 'STS13', 'STS14', 'STS15', 'STS16',
'Length', 'WordContent', 'Depth', 'TopConstituents','BigramShift', 'Tense',
'SubjNumber', 'ObjNumber', 'OddManOut', 'CoordinationInversion']
```
Users can chose the subset of above tasks.

1) to perform the actual evaluation, first import senteval and set its parameters:
```python
import senteval
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
```

2) (optional) set the parameters of the classifier (when applicable):
```python
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
                                 'tenacity': 5, 'epoch_size': 4}
```
You can choose **nhid=0** (Logistic Regression) or **nhid>0** (MLP) and define the parameters for training.

In [4]:
# define transfer tasks
transfer_tasks = ['MR', 'CR', 'SUBJ', 'MPQA', 'SST2', 'SST5', 'TREC', 'SICKRelatedness',\
                  'SICKEntailment', 'MRPC', 'STS14', 'STSBenchmark', 'STS12', 'STS13', 'STS15', 'STS16']
params_senteval = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params_senteval['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
                                 'tenacity': 5, 'epoch_size': 4}

# Set up logger
logging.basicConfig(format='%(asctime)s : %(message)s', level=logging.INFO)

### 2.2 Create an instance of the class SE:

```python
se = senteval.engine.SE(params, batcher, prepare)
```

In [23]:
# All the parameters by default.
folder_path = './data/models'
prefix_1 = 'nli_large_bothskip_parse'
prefix_2 = 'nli_large_bothskip'
pretrain = './data/embedding/glove.840B.300d.h5'
strategy = 'best'
cuda = torch.cuda.is_available()

In [22]:
def gensen_eval(folder_path, prefix_1, prefix_2, pretrain, cuda, strategy):
    gensen_1 = GenSenSingle(
        model_folder=folder_path,
        filename_prefix=prefix_1,
        pretrained_emb=pretrain,
        cuda=cuda
    )
    gensen_2 = GenSenSingle(
        model_folder=folder_path,
        filename_prefix=prefix_2,
        pretrained_emb=pretrain,
        cuda=cuda
    )
    gensen = GenSen(gensen_1, gensen_2)
    params_senteval['gensen'] = gensen
    se = senteval.engine.SE(params_senteval, get_batcher(strategy), prepare)
    results_transfer = se.eval(transfer_tasks)

    print('--------------------------------------------')
    print('Table 2 of Our Paper : ')
    print('--------------------------------------------')
    print('MR                [Dev:%.1f/Test:%.1f]' % (results_transfer['MR']['devacc'], results_transfer['MR']['acc']))
    print('CR                [Dev:%.1f/Test:%.1f]' % (results_transfer['CR']['devacc'], results_transfer['CR']['acc']))
    print('SUBJ              [Dev:%.1f/Test:%.1f]' % (results_transfer['SUBJ']['devacc'], results_transfer['SUBJ']['acc']))
    print('MPQA              [Dev:%.1f/Test:%.1f]' % (results_transfer['MPQA']['devacc'], results_transfer['MPQA']['acc']))
    print('SST2              [Dev:%.1f/Test:%.1f]' % (results_transfer['SST2']['devacc'], results_transfer['SST2']['acc']))
    print('SST5              [Dev:%.1f/Test:%.1f]' % (results_transfer['SST5']['devacc'], results_transfer['SST5']['acc']))
    print('TREC              [Dev:%.1f/Test:%.1f]' % (results_transfer['TREC']['devacc'], results_transfer['TREC']['acc']))
    print('MRPC              [Dev:%.1f/TestAcc:%.1f/TestF1:%.1f]' % (results_transfer['MRPC']['devacc'], results_transfer['MRPC']['acc'], results_transfer['MRPC']['f1']))
    print('SICKRelatedness   [Dev:%.3f/Test:%.3f]' % (results_transfer['SICKRelatedness']['devpearson'], results_transfer['SICKRelatedness']['pearson']))
    print('SICKEntailment    [Dev:%.1f/Test:%.1f]' % (results_transfer['SICKEntailment']['devacc'], results_transfer['SICKEntailment']['acc']))
    print('STS12             [Pearson:%.3f/Spearman:%.3f]' % (results_transfer['STS12']['all']['pearson']['mean'], results_transfer['STS12']['all']['spearman']['mean']))
    print('STS13             [Pearson:%.3f/Spearman:%.3f]' % (results_transfer['STS13']['all']['pearson']['mean'], results_transfer['STS13']['all']['spearman']['mean']))
    print('STS14             [Pearson:%.3f/Spearman:%.3f]' % (results_transfer['STS14']['all']['pearson']['mean'], results_transfer['STS14']['all']['spearman']['mean']))
    print('STS15             [Pearson:%.3f/Spearman:%.3f]' % (results_transfer['STS15']['all']['pearson']['mean'], results_transfer['STS15']['all']['spearman']['mean']))
    print('STS16             [Pearson:%.3f/Spearman:%.3f]' % (results_transfer['STS16']['all']['pearson']['mean'], results_transfer['STS16']['all']['spearman']['mean']))
    print('STSBenchmark      [Dev:%.5f/Pearson:%.5f/Spearman:%.5f]' % (results_transfer['STSBenchmark']['devpearson'], results_transfer['STSBenchmark']['pearson'], results_transfer['STSBenchmark']['spearman']))
    print('--------------------------------------------')


### 2.3 Results from SentEval

In [24]:

gensen_eval(folder_path, prefix_1, prefix_2, pretrain, cuda, strategy)


Preparing task : MR 
Loading pretrained word embeddings




Training vocab expansion on model
Found 5292 task OOVs 
Found 1781 pretrain OOVs 

                expected 80004 x 512
Loading pretrained word embeddings




Training vocab expansion on model
Found 5292 task OOVs 
Found 1781 pretrain OOVs 

                expected 80004 x 512


2019-04-29 16:51:18,172 : Generating sentence embeddings
  sentences = Variable(torch.LongTensor(sentences), volatile=True)
  rev = Variable(torch.LongTensor(rev), volatile=True)
2019-04-29 16:51:57,127 : Generated sentence embeddings
2019-04-29 16:51:57,130 : Training pytorch-MLP-nhid0-adam-bs64 with (inner) 10-fold cross-validation
2019-04-29 17:00:04,531 : Best param found at split 1: l2reg = 0.0001                 with score 83.23


KeyboardInterrupt: 

## References

1. [1] A. Conneau, D. Kiela, [*SentEval: An Evaluation Toolkit for Universal Sentence Representations*](https://arxiv.org/abs/1803.05449).