## Introduction <a class="anchor"  id="chapter1"></a>

The dataset of Disaster Tweets comprises several thousand tweets that have been manually categorized by humans as either being relevant to an actual disaster or not having anything to do with a disaster. For example, someone may colloquially use the word "fire" to describe a song, while other tweets containing the word "fire" may actually be describing a real fire. 

In this notebook, I will be building NLP classification models to predict whether tweets in the test dataset are actual disasters or just false alarms. The models will rely mostly on BERT and RoBERTa, two powerful NLP models created by researchers at Google AI Language and at Facebook AI, respectively. I will also be comparing the performance of BERT and RoBERTa against a simple Feed-forward Neural Network (FNN) with metadata features that will be extracted from the text.

To build my models, I will be using the transformers library from HuggingFace with PyTorch. The transformers library from HuggingFace is a widely used open-source library for working with transformer-based models such as BERT, GPT-2, and RoBERTa. It provides a comprehensive set of pre-trained models as well as tools for pre-processing, fine-tuning, and evaluation, making it easy to implement and experiment with these NLP models. 

## Theory <a class="anchor"  id="chapter2"></a>

BERT is known for its ability to perform well on a wide range of NLP tasks, including text classification, question answering, and language translation. BERT is particularly notable for its ability to understand the context and meaning of words in a sentence or paragraph, which has led to its use in search engines to more accurately produce the closest results for a given search. 

BERT stands for Bidirectional Encoder Representations from Transformers, and was first published in 2018. It is a neural network model that is differentiated among other existing models by its ability to extract context from reading a sentence both forwards and backwards (i.e., bidirectionally). 

Take the classic example below: 

![image.png](attachment:bf42adc9-fca5-4874-a146-97277de3e4f3.png)

Previous NLP models would likely correctly predict that the word "bank" in the first sentence relates to a river bank, as it is reading the sentence front to back. However, it would be more likely to misclassify the meaning of "bank" in the second sentence, as the context needed to make the correct prediction (financial institution) comes after the word itself. BERT is differentiated in that it will perform both a forwards and a backwards pass in order to infer the context of the word. 

One of the key components of the BERT model is masked language modeling (MLM). MLM is a technique used during pre-training where random words in a sentence are masked, and the model is trained to predict the missing word. By doing so, BERT learns to understand the context of the masked word and its relationship to the rest of the sentence. MLM is particularly useful in handling the problem of polysemy, where a single word can have multiple meanings depending on the context in which it appears, as we saw above. By training on a large corpus of text using MLM, BERT is able to develop a robust understanding of the contextual nuances of language.

Another important aspect of the BERT model is next sentence prediction (NSP). This is a pre-training task where the model is presented with pairs of sentences and is trained to predict whether they are logically connected or not. By training on this task, BERT learns to understand the relationships between sentences, which is important in tasks such as question answering and textual entailment. NSP also helps the model develop a general understanding of the structure of language, which allows it to handle a wide range of natural language processing tasks effectively.

RoBERTa (Robustly Optimized BERT approach) is an extension of the BERT model that was introduced by Facebook AI in 2019. It is essentially an improved version of BERT that addresses some of the weaknesses of the original model. While BERT achieved state-of-the-art results in several NLP tasks, RoBERTa has surpassed it on several benchmarks.

One of the key differences between RoBERTa and BERT is in the way they are pre-trained. BERT was trained on a large corpus of text using MLM and NSP, as described above. RoBERTa, on the other hand, was trained using the same masked language modeling objective but with several modifications. The authors of RoBERTa increased the batch size, training data and training time, and also removed the next sentence prediction objective. RoBERTa also uses a different tokenization method which allows it to better handle rare words and tasks where a large vocabulary is important. These modifications allowed RoBERTa to achieve better performance than BERT on several NLP tasks.