# Title: Hello World NLP Project

#### Members Names: Oscar Bobadilla, Igor Ilic

#### Members Emails: {oscar.bobadilla, iilic} @ ryerson.ca

# Introduction:

#### Problem Description:

- A need for better Language modeling:
 - NLP Model forerunners: ELMo and ULMFiT(LSTM-Based)
 - BERT (Bidirectional Transformer architecture)  [1](http://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)
 - XLNet (Transformer-XL architecture)  [1](https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/)

    Previous models, ELMo and ULMFiT were LSTM-Based an improvement from the word embeddings models. BERT, utilizes transformer architecture which allowed it to produce State-of-the-art results. XLNet uses Transfomer-XL, introduces the notion of relative positional embeddings. In addition, the Transformer XL architecture solves the problem of the transformer, which takes in fixed-length sequences as input. 
    
    XLNet has produced State-of-the art results in the following tasks:
       
    -Text classification
    -Question answering
    -Natural language inference
    -Duplicate sentence (question) detection
    -Document ranking
    -Coreference resolution
    
    - Reading Comprehension: 
     - Text Classification: 
     - Question Answering: Given information, can you answer a question? (Measured using SQuaD v1.1 / v2.0)
     - Sentiment Analysis: Can you figure out if a statment is good / bad? (SST2)
     - *GLUE language understanding tasks, reading comprehension tasks like SQuAD and RACE, text classification tasks such as Yelp and IMDB, and the ClueWeb09-B document ranking task*


***
#### Context of the Problem:

- Existing models and BERT

    Prevous Language models, ELMo and ULMFiT, generally trained their models from "left to right". They were given a sequence of words, then have to predict the next word. The key traits of BERT, instead of predicting the next word after a sequence of words, BERT randomly masks words in a sentence and predicts them. This method is effective because it forces the model to learn how to use information from the entire sentence in deducing what words are missing.
    

- XLNet-TransformerXL

    The autoregressive pretraining method enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT due to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. 

***
#### Limitation About other Approaches:
**BERT**
1. The [MASK] token used in training does not appear during fine-tuning BERT is trained to predict tokens replaced with the special [MASK] token. The problem is that the [MASK] token - which is at the center of training 

    BERT never appears when fine-tuning BERT on downstream tasks. This can cause a whole host of issues such as:   
 - What does BERT do for tokens that are not replaced with [MASK]?
 - In most cases, BERT can simply copy non-masked tokens to the output. So would it really learn to produce meaningful representations for non-masked tokens?
 - Of course, BERT still needs to accumulate information from all words in a sequence to denoise [MASK] tokens. But what happens if there are no [MASK] tokens in the input sentence?

   There are no clear answers to the above problems, but it's clear that the [MASK] token is a source of train-test skew that can cause problems during fine-tuning
   

2. BERT generates predictions independently. Another problem stems from the fact that BERT predicts masked tokens in parallel, meaning that during training, it does not learn to handle dependencies between predicting simultaneously masked tokens. In other words, it does not learn dependencies between its own predictions. 

   Since BERT is not actually used to unmask tokens, this is not directly a problem. The reason this can be a problem is that this reduces the number of dependencies BERT learns at once, making the learning signal weaker than it could be.
    
<img src="conceptual_difference.png" alt="Alt text that describes the graphic" title="Title text" />


  
***
#### Solution:

HOW DOES XLNet do what it does

**XLNet**
- TransformerXL USES:

    XLNet addresses the previously mentioned problems by introducing a variant of language modeling called "permutation language modeling". Permutation language models are trained to predict one token given preceding context like traditional language model, but instead of predicting the tokens in sequential order, it predicts tokens in some random order. 
    The conceptual difference between BERT and XLNet. Transparent words are masked out so the model cannot rely on them. XLNet learns to predict the words in an arbitrary order but in an autoregressive, sequential manner (not necessarily left-to-right). BERT predicts all masked words simultaneously.

    Aside from using permutation language modeling, XLNet improves upon BERT by using the Transformer XL as its base architecture. The Transformer XL showed state-of-the-art performance in language modeling, so was a natural choice for XLNet.

    XLNet uses the two key ideas from Transformer XL: relative positional embeddings and the recurrence mechanism. The hidden states from the previous segment are cached and frozen while conducting the permutation language modeling for the current segment. Since all the words from the previous segment are used as input, there is no need to know the permutation order of the previous segment.


Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language
inference, sentiment analysis, and document ranking.1
.

[1](https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-8d8fce710335)

<img src="results.png" alt="Results of tests" title="Title text" />

# Background

SUMMARY OF RESULTS FROM PAPER

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Tom et al. [1] | They trained a BERT based transformer to predict answers from the passage of a question| SQUAD dataset for QA | Only 80% accuracy
| George et al. [2] | They trained a attention based sequence to sequence model using LSTM to predict answers from the passage of a question| SQUAD V2 dataset for QA | High accuracy but poor on unkown answers
|J.Devlin et al.[10] | Use 3 BERT variants(Original BERT, BERT w/whole word masking, BERT w/o next sentence prediction). Utilizing the same data and hyperparameters for comparison| BooksCorpus[40],English Wikipedia, Giga5, ClueWeb 2012-B and Common Crawl for pretraining| XLNet outperforms BERT by a sizable margin on all the considered datasets. Ref. 1 (below)
|Y. Liu et al[21], Z.Lan et al[19]| Comparison to other pre-trained models: RoBERTa, BERT+DCMN | Full Data -BooksCorpus[40],English Wikipedia, Giga5, ClueWeb 2012-B and Common Crawl for pretraining| Ref. 2,Ref. 3 ,Ref. 4 and Ref. 5 . XLNet generally outperforms BERT and RoBERTa

1. <img src="table_1.png" alt="Results of tests" title="Title text" />
***
**Ref.2 Reading comprehension & document ranking**
2. <img src="table_2.png" alt="Results of tests" title="Title text" />
***
**Ref.3 Question answering**
3. <img src="table_3.png" alt="Results of tests" title="Title text" />
***
**Ref.4 Text classification**
4. <img src="table_4.png" alt="Results of tests" title="Title text" />
***
**Ref.5 Natural language understanding
5. <img src="table_5.png" alt="Results of tests" title="Title text" />


The last row in this table should be about the method discussed in this paper (If you can't find the weakenss of this method then write about the future improvement, see the future work section of the paper)

# Methodology

Though there are many ways to use bidirectional transformers, the way
we explored was classification with the SST2 database. This dataset is commonly used as a benchmark,
so we determined it to be a good place to explore.

SST-2 consists of many different strings, which are classified as positive (1) or negative (0).
Samples have been included below.

Training an XLnet from scratch is a very complicated task. Because of this, we instead use a pretrained, light version of the model (xlnet-base-cased: 110M params), and grabbed embeddings to put into a classification layer. This is different from the results from XLNet.

Typically, the way to use XLNet would be to take the pretrained version (xlnet-large-cased), and then fine-tune to the data set. This would yield the high results in the paper. We briefly discuss this at the end.

# Implementation

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

# XLNet specifics
import torch
from transformers import XLNetModel, XLNetTokenizer

## Dataset info
Using dataset SST2 (XLNet was able to get 94.4% accuracy)

In [2]:
df_train = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv',
                 delimiter='\t',
                 names=['sentence','label'])

df_test = pd.read_csv('https://raw.githubusercontent.com/clairett/pytorch-sentiment-classification/master/data/SST2/test.tsv',
                 delimiter='\t',
                 names=['sentence','label'])

split_point = len(df_train)
df = pd.concat([df_train, df_test])

In [3]:
for s, l in df.head().values:
    print(f'Label {l}: {s}')

Label 1: a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
Label 0: apparently reassembled from the cutting room floor of any given daytime soap
Label 0: they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science fiction elements of bug eyed monsters and futuristic women in skimpy clothes
Label 1: this is a visually stunning rumination on love , memory , history and the war between art and commerce
Label 1: jonathan parker 's bartleby should have been the be all end all of the modern office anomie films


In [4]:
df_train['label'].value_counts()

1    3610
0    3310
Name: label, dtype: int64

In [5]:
df_test['label'].value_counts()

0    912
1    909
Name: label, dtype: int64

## Training Model

In [6]:
 # See here for all pretrained https://huggingface.co/transformers/pretrained_models.html?highlight=pretrained
pretrained_label = 'xlnet-base-cased'

tokenizer = XLNetTokenizer.from_pretrained(pretrained_label)
model = XLNetModel.from_pretrained(pretrained_label)

## Data preparation
### Tokenization

In [7]:
tokenized = df['sentence'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

These added special tokens include a classifier token, "\< cls\>" which we are mainly interested in.

In [8]:
print(f'Tokenizer string form: {tokenizer.cls_token}, id: {tokenizer.cls_token_id}')

Tokenizer string form: <cls>, id: 3


In [9]:
max_len = 0
clf_positions = np.zeros(len(tokenized), dtype=np.int64)
for posn, i in enumerate(tokenized.values):
    clf_positions[posn] = len(i) - 1
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

A sample padded sentence looks like the following:

In [10]:
padded[0]

array([   24, 16003,    17,    19,  5787,    21,  1381, 21469,    17,
          88,  7693, 15930,    56,    20,  4111,    21,    18, 11740,
          21,  4974,    23,  6941,  2701,     4,     3,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0])

We can see the cls token is at the end of the sentence, before the padding.

### Masking
A slight nuance, we need to pass in a attention masking map, which allows XLNet to identify where the sentence is.

In [12]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(8741, 86)

### Embedding
We need to pass the tokenized sentences into XLNet now, and get the embeddings to pass into
another classifier.

In [13]:
inputs = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(inputs, attention_mask=attention_mask)

In order to select the correct term, we select the classifier token from the mapping.
We know that the final hidden state maps the classifier tokens to their respective position.

In [14]:
# Features
final_hidden_state = last_hidden_states[0]
X = final_hidden_state[np.arange(len(final_hidden_state)), clf_positions]

# Labels
y = df['label']

## Classification

With all of the encoded values, now we can pass into any classifier we'd like to. For simplicity,
we chose to take the default values of the sklearn MLPClassifier.

In [15]:
num_test_pts = len(df) - split_point
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=num_test_pts)

In [16]:
clf = MLPClassifier()
clf.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [17]:
clf.score(X_test, y_test)

0.7874794069192751

This is a lot better than random guessing! It doesn't reach XLNets full accuracy in the paper (94.4%), but this model has:  
- Simple, unoptimized classification layer
- Fewer, untuned weights in XLNet

## Fine-tuning XLNet

Although the heursitic test above is fast and easy to implement, it doesn't obtain the results
found in the XLNet paper (94.4% accuracy).

By running fine tuning on the SST-2 dataset, we were able to up the accuracy to 94.0%. This was accomplished
by setting up the transformers repo, and running the [examples](https://github.com/huggingface/transformers/blob/master/examples/README.md):

```bash
export GLUE_DIR=/path/to/data
export TASK_NAME=SST-2

python run_glue.py \
  --model_type xlnet \
  --model_name_or_path xlnet-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir $GLUE_DIR/tmp/$TASK_NAME/
```

yields:
```bash
$ cat eval_results.txt
acc = 0.9403669724770642
```

This task took significantly longer to do, but was able to fine tune to a particular data set incredibly well.

# Conclusion and Future Direction

Write few sentences about the results and their limitations, how they can be extended in future.

# References:

[1]:  Authors names, title of the paper, Conference Name,Year, page number (iff available)

[2]:  Author names, title of the paper, Journal Name,Journal Vol, Issue Num, Year, page number (iff available)