# Title: Hello World NLP Project

#### Members Names: Oscar Bobadilla, Igor Ilic

#### Members Emails: {oscar.bobadilla, iilic} @ ryerson.ca

# Introduction:

#### Problem Description:

- A need for better Language modeling:
 - NLP Model forerunners: ELMo and ULMFiT(LSTM-Based)
 - BERT (Bidirectional Transformer architecture)  [1](http://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)
 - XLNet (Transformer-XL architecture)  [1](https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/)

    Previous models, ELMo and ULMFiT were LSTM-Based an improvement from the word embeddings models. BERT, utilizes transformer architecture which allowed it to produce State-of-the-art results. XLNet uses Transfomer-XL, introduces the notion of relative positional embeddings. In addition, the Transformer XL architecture solves the problem of the transformer, which takes in fixed-length sequences as input. 
    
    XLNet has produced State-of-the art results in the following tasks:
     - Reading Comprehension: 
     - Text Classification: 
     - Question Answering: Given information, can you answer a question? (Measured using SQuaD v1.1 / v2.0)
     - Sentiment Analysis: Can you figure out if a statment is good / bad? (SST2)
     - *GLUE language understanding tasks, reading comprehension tasks like SQuAD and RACE, text classification tasks such as Yelp and IMDB, and the ClueWeb09-B document ranking task*


***
#### Context of the Problem:

- Existing models and BERT

    Prevous Language models, ELMo and ULMFiT, generally trained their models from "left to right". They were given a sequence of words, then have to predict the next word. The key traits of BERT, instead of predicting the next word after a sequence of words, BERT randomly masks words in a sentence and predicts them. This method is effective because it forces the model to learn how to use information from the entire sentence in deducing what words are missing.
    

- XLNet-TransformerXL

    The autoregressive pretraining method enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT due to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. 

***
#### Limitation About other Approaches:
**BERT**
1. The [MASK] token used in training does not appear during fine-tuning BERT is trained to predict tokens replaced with the special [MASK] token. The problem is that the [MASK] token - which is at the center of training 

    BERT never appears when fine-tuning BERT on downstream tasks. This can cause a whole host of issues such as:   
 - What does BERT do for tokens that are not replaced with [MASK]?
 - In most cases, BERT can simply copy non-masked tokens to the output. So would it really learn to produce meaningful representations for non-masked tokens?
 - Of course, BERT still needs to accumulate information from all words in a sequence to denoise [MASK] tokens. But what happens if there are no [MASK] tokens in the input sentence?

   There are no clear answers to the above problems, but it's clear that the [MASK] token is a source of train-test skew that can cause problems during fine-tuning
   

2. BERT generates predictions independently. Another problem stems from the fact that BERT predicts masked tokens in parallel, meaning that during training, it does not learn to handle dependencies between predicting simultaneously masked tokens. In other words, it does not learn dependencies between its own predictions. 

   Since BERT is not actually used to unmask tokens, this is not directly a problem. The reason this can be a problem is that this reduces the number of dependencies BERT learns at once, making the learning signal weaker than it could be.
    
<img src="conceptual_difference.png" alt="Alt text that describes the graphic" title="Title text" />


  
***
#### Solution:

HOW DOES XLNet do what it does

**XLNet**
- TransformerXL USES:

    XLNet addresses the previously mentioned problems by introducing a variant of language modeling called "permutation language modeling". Permutation language models are trained to predict one token given preceding context like traditional language model, but instead of predicting the tokens in sequential order, it predicts tokens in some random order. 
    The conceptual difference between BERT and XLNet. Transparent words are masked out so the model cannot rely on them. XLNet learns to predict the words in an arbitrary order but in an autoregressive, sequential manner (not necessarily left-to-right). BERT predicts all masked words simultaneously.

    Aside from using permutation language modeling, XLNet improves upon BERT by using the Transformer XL as its base architecture. The Transformer XL showed state-of-the-art performance in language modeling, so was a natural choice for XLNet.

    XLNet uses the two key ideas from Transformer XL: relative positional embeddings and the recurrence mechanism. The hidden states from the previous segment are cached and frozen while conducting the permutation language modeling for the current segment. Since all the words from the previous segment are used as input, there is no need to know the permutation order of the previous segment.


Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language
inference, sentiment analysis, and document ranking.1
.

[1](https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-8d8fce710335)

<img src="results.png" alt="Results of tests" title="Title text" />

# Background

SUMMARY OF RESULTS FROM PAPER

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Tom et al. [1] | They trained a BERT based transformer to predict answers from the passage of a question| SQUAD dataset for QA | Only 80% accuracy
| George et al. [2] | They trained a attention based sequence to sequence model using LSTM to predict answers from the passage of a question| SQUAD V2 dataset for QA | High accuracy but poor on unkown answers


The last row in this table should be about the method discussed in this paper (If you can't find the weakenss of this method then write about the future improvement, see the future work section of the paper)

# Methodology

- Explain text classification/sentiment analysis problem
- Describe SST2 database
- Talk about XLNet + Classifier

Provide details of the method that you are implementing in the next section with figure(s). A fifg Your methodology will be just one method discussed in one of the paper of your choice; it can be a merger or a simplified version of the papers. To avoid any confusion, do not present multiple methods, just one unified method as you will implement in the next section.

For figures you can use this tag:

![Alternate text ](Figure.png "Title of the figure, location is simply the directory of the notebook")

# Implementation

In this section, you will provide the code and its explanation. You may have to create more cells after this.

In [2]:
# Code cells

In [4]:
# Code cells

In [3]:
# Code cells

# Conclusion and Future Direction

Write few sentences about the results and their limitations, how they can be extended in future.

# References:

[1]:  Authors names, title of the paper, Conference Name,Year, page number (iff available)

[2]:  Author names, title of the paper, Journal Name,Journal Vol, Issue Num, Year, page number (iff available)