Comparing Performance and Accuracy of Big Bird and XLNet for Text Summarization

See the presentation (PPT or PDF)

OVERVIEW

For this research, two text summarizations are compared using specific metrics and a timer. Two text summarizations outlined in this research are the Big Bird and XLNet. The set of metrics applied to the comparison is Recall-Oriented Understudy for Gisting Evaluation (ROUGE). The timer with CPU 1.6 GHZ is included to assess the algorithmic efficiency. Both algorithms are Transfer Learnings. Transformers are known for solving various Natural Language Processing (NLP) tasks such as text generation. What leads to this research is a fundamental question to ask. The Google Research team attempts to develop a different approach to address the inherent self-attention mechanism problem in Transformers models called the block sparsity. The research team uses a mathematical assessment to demonstrate the block sparsity that reduces the quadratic dependency to the linear dependency (in relationship of the number of tokens and memory or time) (Zaheer et al., 2020), which is skeptical.

KEY FINDINGS

As indicated by Randomized Controlled Trial analysis, the Big Bird model performance is significantly higher than XLNet at Bonferroni correction level.
However, XLNet outperforms Big Bird model in memory efficiency based on producing text prediction per article text.
There is evidence that the model produces some redundancies produced by Big Bird text summarization.
Other evidence shows that the Big Bird model reduces quadratic dependency to linear dependency which is against my hypothesis.

TRANSFORMERS ARCHITECTURE

To understand what the result of quadratic dependency is, Transformers architecture and its history needs to be addressed first. Before the Transformers architecture is established, the long short-term memory (LSTM) and gated recurrent neural networks are considered as the state-of-the-art approaches in addressing NLP problems. The significant constraint in this model is sequential computation, so the attention mechanisms help to remove the constraint. As a result, the architecture proves to be a milestone in the NLP area throughout the years.

(Vaswani et al., 2017)

As indicated by above, the architecture consists of two multi-layered parts, which are encoder and decoder. Both processings absorb word embeddings processed by word2vec. The difference between encoder and decoder is relatively straightforward. The representation of encoder is X (x₁, ..., x_n) while the representation of decoder is Z (z₁, ..., z_n). In this case, the representation of encoder is word embedding of unsummarized text. The representation of decoder is word embedding of the actual summarized text.

The self-attention is associated with head_i, which is associated with multi-head attention as seen by above and below. The self-attention contains a matrix of queries multiplied by transposed key divided by a root square of key dimensionality in softmax function multiplied by values. The softmax is a generalized version of logistic function. The values can be considered as weight that can be updated. This self-attention is built on matrix multiplicaiton code in order to be more space-efficient and faster.

$Attention \left( Q,K,V \right) = softmax \left( \frac{QK^T}{\sqrt{d_k}} \right) V$

$head_i = Attention(QW_i^Q,KW_i^K,VW_i^v)$

$MultiHead(Q,K,V) = Concat(head_i,...,head_h)W^0$

The head_i has a vector of Q, K, and V each multplied by a vector of weights while the multi-head attention is simply a concatenation of head_i (Vaswani et al., 2017). Based on the graph theory, the problem is that the attention mechansim has quadratic dependency due to the fully connected graph. This is known to be sparsification problem (Zaheer et al., 2020). For this research, the XLNet is used to compare with Big Bird. The change in this model is to use maximum log likelihood of the sequence with respects to permutation (Yang et al., 2020).

BIG BIRD

As mentioned earlier, this architecture has a sparsification problem where the normal attention has a full connection leading to the quadratic increase in memory or time term. The Google Research team attempts to remedy the problem using the block sparsity. In other words, the block sparsity consists of three different types of connections called global, sliding, and random. For example, if the sentence states, "How you have been so far?", this sentence has six tokens that have particular ways to connect based on three types.

However, the concept of global and sliding connections is not novel, which is similar to the article Generating Long Sequences with Sparse Transformers (Child et al., 2019). What makes the Big Bird algorithm different is random connection. The number of connections in block sparsity is less than the connections in normal attention. This reduction may be smaller but it becomes more significant when the length of sequence is increased.

Using the random connection is a concern due to the lack of algorithmic intentions. Having intention in the algorithm is important in order to ensure that the model performs well. On another hand, the Google Research team seems to develop the algorithm based on Central Limit Theorem (CLT) and Law of Large Number (LLN). In other words, their assumption is that the predicted summary becomes consistent when being converged based on the length of sequence. There is an alternative suggestion that may remedy this type of connection, which will be discussed after the model assessment completion.

(Gupta, 2021)

As seen by above, each color is associated with global, sliding, and random connections while each white slot represents no connection. For example, there is no connection between "work" and "is." This approach reduces time complexity, which comes with the price of no theoretical guarantees as the Google Research acknowledges (Zaheer et al., 2020).

METHOD

In order to experiment on both text summarizations, there are two approaches that will be used: partial NLP data science and Randomized Controlled Trials (RCTs). The partial NLP data science is a life cycle pipeline that only includes literature review, data quality assessment, data cleanup, exploratory analysis, and feature engineering. This cycle does not include predictive modeling. As mentioned earlier, Big Bird and XLNet are Transfer Learnings. The next part is RCTs, which randomly sample the ArXiv journals to summarize using these pre-trained models and undertake statistical inferences. This analysis contains features and target variables, such as the following:

Features:

article id
article text
actual abstract text
predicted summary
article word counts
abstract word counts
predicted summary word counts
big bird (binary category)

Target variables:

time per predicted summary (in seconds)
rouge 1 F1 score
rouge 2 F1 score
rouge L F1 score

The ROUGE-N F1-score is a measure of model accuracy based on the number of matching n-grams between predicted summary and ground-truth summary. For example, ROUGE-1 measures the number of matching unigram while the ROUGE-2 measures bigram. But the ROUGE-L is slightly different. This metric measures the longest common subsequence (LCS) between predicted summary and ground-truth summary. The LCS refers to the maximum length of tokens in total. Data collection on both models takes two days to compute.

DATASET

As mentioned earlier, the ArXiv journals are used to infer models prepared by TensorFlow. This dataset contains three features, which are article id, article text, and actual abstract text. There are three subsets in this dataset, which are testing (6,658 entities), training (119,924 entities), and validation (6,633 entities) sets. For this research, the validation set is used to evaluate both models. Based on the exploratory analysis, 70.8% of tokens in article texts matches NLTK dictionaries while 62.05% in abstract text matches these dictionaries. The Big Bird model is pre-trained with Wikipedia dataset (Zaheer et al., 2020) while XLNet model is pre-trained with several datasets other than ArXiv dataset (Yang et al., 2020), so the validation set is considered as an unseen dataset. However, using the entire set is infeasible and time-consuming. For this reason, the ArXiv journals are randomly sampled. The sampling size is 110 for each model.

ACTIONABLE INSIGHTS

After the data is collected, the information is assessed with statistical inference and descriptive statistics. Before the actionable insights are discussed, one part is important to mention. Being analyzing the models with three different metrics, the Bonferroni correction is first applied. The correction is in use to prevent Type I error albeit being conservative. This correction is prone to have the Type II error (the null hypothesis is failed to reject when false), which can be concerning in general. On another hand, this research is focused on Big Bird's model performance comparing to XLENT. Big Bird model does outperform XLNet at the significance level in every metrics listed in this research. This research attempts to address three different questions in order, as the following:

Does the Big Bird model outperform XLNet model on predicted summary term?
Being compared with XLNet model, how fast can Big Bird produce each predicted summary?
Does the Big Bird successfully reduce this quadratic dependency to linear dependency in sequence term?

As indicated by above, the black color (or “0”) represents the XLNet model while the second color (or “1”) represents the Big Bird model. The confidence interval is 95% with Bonferroni correction, which is wider than without this correction. Average ROUGE-1 for Big Bird and XLNet is 57.65% with 4.66% margin of error and 25.66% with 2.27% margin of error, respectively, while average ROUGE-2 is 48.64% with 5.67% margin of error and 5.57% with 0.93% margin of error. Average ROUGE-L is 52.78% with 5.02% margin of error and 14.46% with 1.01% margin of error. In short, the Big Bird model does outperform the XLNet model at a significance level. However, using this model to predict the summary with CPU 1.6 GHZ is 25.8 minutes by median (26.6 minutes by mean). In other words, this model processor is slightly faster than the average reading speed. The average reading speed is 200 words per minute (Rayner et al., 2010). In other words, the model is not well-scalable with local application.

There is no evidence of time overlap between Big Bird and XLNet models. For that reason, the violin plot is used. As indicated by above, XLNet is much faster than Big Bird. Both distributions are leptokurtic and right-skewed on memory term or time. However, Big Bird model has a higher right-skewed and kurtotic score than XLNet model. That is the reason why skepticism exists on Big Bird algorithmic efficiency. When the loess is applied, the result is surprising. Before the discussion goes further, the reason for using loess needs to be explained. The advantage of the loess tool is non-parametric, so the tool helps to focus on the relationship between time and word counts with minimal assumptions.

There are two sliders with two different plots. The first slider is time per predicted summary while the second slider is with logarithm. To mitigate any outlier problems, the logarithm is used. The outliers may have a significant impact on loess regression. Both sliders make same confirmations that Big Bird algorithm successfully establishes a linear relationship between the number of tokens and time per predicted summary, which is surprising. Besides this scalability issue, the Big Bird algorithm turns out to be successful in text summarization area.

FUTURE RESEARCH

In the Big Bird algorithm, there are two issues that have been identified. The first problem is, unsurprisingly, scalability, and the second problem is random connection in block sparse attention. Both problems can be, for the future research, a challenge. To make this algorithm scalable, each self-attention layer in the original architecture needs be replaced with Attention Free Transformer (AFT). Not only that, the algorithm needs to modify block sparse attention. In doing so, the Transformer with AFT needs to be assessed based on model performance and scalability. Then it needs to compare with Big Bird algorithm. The next experiment is to replace each self-attention layer with AFT while persevering block sparse attention. Another experiment is to modify block sparse attention without self-attention replacement. The final experiment is self-attention replacement with modified block sparse attention. In other words, there are five different experiments in total to determine which algorithm outperforms others.

The modified block sparse attention is to replace random connection with Bayesian connection inspired by Bayesian optimization. This optimization consists of three functions: objective, acquisition, and surrogate. The objective function possesses true shape that is unobservable and can only reveal some data points, which can otherwise be expensive to compute while the surrogate function is the probabilistic model being built to exploit what is known and it can be updated based on new information. The acquisition function is to calculate, in this case, adjacency matrix that is likely to yield the higher local maximum of objective function using surrogate function (Brochu et al., 2010).

From now on, the future research focuses on AFT due to its advantage. In the Computer Vision area, the recent study shows that the AFT proves to be efficient with high yield results (Zhai et al., 2021). However, the AFT for text summarization is not tested in this paper. This is the reason why, in the future, scalability and block sparsity need to be evaluated with both relatively novel approaches.

CONCLUSION

Both Big Bird and XLNet models are tested for performance and efficiency using local application. There is a clear trade-off between accuracy and efficiency with both algorithms. For example, the Big Bird model does better with predicting summary. This algorithm successfully linearizes self-attention mechanism using block sparsity. Using the cloud environment with Big Bird model is a prerequisite for efficiency. On another hand, the Big Bird algorithm is highly recommended for producing summaries as long the cloud environment is used. This model has scalability and redundant problems as seen in several predicted texts. For future research, in order to determine if the novel algorithm can be improved, the AFT and Bayesian connection are strongly recommended to be tested.

REFERENCES

Brochu, E., Cora, VM., and Freitas, N. “A Tutorial on Bayesian Optimization of Expensive Cost Functions, With Application to Active User Modeling and Hierarchical Reinforcement Learning.” arXiv, Dec. 2020. https://www.math.umd.edu/~slud/RITF17/Tutorial_on_Bayesian_Optimization.pdf

Child, R., Gray, S., Radford, A., and Sutskever, I. “Generating Long Sequences with Sparse Transformers.” arXiv, 2019. https://arxiv.org/pdf/1904.10509.pdf

Rayner, K., Slattery, TJ., and Bélanger, NN. “Eye movements, the perceptual span, and reading speed.” Psychon Bull Rev., Dec. 2010. doi: 10.3758/PBR.17.6.834

Gupta, V. “Understanding BigBird’s Block Sparse Attention.” Huggingface, Mar. 2021. https://huggingface.co/blog/big-bird

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, Ł., Gomez, AN., Kaiser, L., and Polosukhin, I. “Attention is All You Need.” Advances in Neural Information Processing Systems 30. NIPS, 2017. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le., QV. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” arXiv, Jan. 2020.https://arxiv.org/pdf/1906.08237v2.pdf

Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Philip, P., Ravula, A., Wang, Q., Yang, L., and Amr Ahmed, A. “Big Bird: Transformers for Longer Sequences.” Advances in Neural Information Processing Systems 33, NeurIPS, 2020. https://papers.nips.cc/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf

Zhai, S., Talbott, W., Srivastava, N., Huang, C., Goh, H., Zhang, R., and Susskind, J. “An Attention Free Transformer.” arXiv, Sep. 2021. https://arxiv.org/pdf/2105.14103.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
Analyses		Analyses
Assets		Assets
Images		Images
Presentation		Presentation
R		R
BigBird.pdf		BigBird.pdf
BigBird.pptx		BigBird.pptx
LICENSE		LICENSE
Proposal.pdf		Proposal.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyses

Analyses

Assets

Assets

Images

Images

Presentation

Presentation

R

R

BigBird.pdf

BigBird.pdf

BigBird.pptx

BigBird.pptx

LICENSE

LICENSE

Proposal.pdf

Proposal.pdf

README.md

README.md

Repository files navigation

Comparing Performance and Accuracy of Big Bird and XLNet for Text Summarization

OVERVIEW

KEY FINDINGS

TRANSFORMERS ARCHITECTURE

BIG BIRD

METHOD

DATASET

ACTIONABLE INSIGHTS

FUTURE RESEARCH

CONCLUSION

REFERENCES

About

Releases

Packages

Languages

License

jonahwinninghoff/Text-Summarization

Folders and files

Latest commit

History

Repository files navigation

Comparing Performance and Accuracy of Big Bird and XLNet for Text Summarization

OVERVIEW

KEY FINDINGS

TRANSFORMERS ARCHITECTURE

BIG BIRD

METHOD

DATASET

ACTIONABLE INSIGHTS

FUTURE RESEARCH

CONCLUSION

REFERENCES

About

Topics

Resources

License

Stars

Watchers

Forks

Languages