# Large Language Models and Transfer Learning 
	
The aforementioned deep learning models all vary in architecture but they share a rather high level of complexity and a high training cost. Building some of these models from scratch requires a massive amount of labeled data,  an enormous number of training resources and money to pay for those resources. Depending on several factors, those required resources might not be available. In lieu of these resources, the industry leans on **transfer learning**  using pre-trained **large language models** to leverage these powerful neural networks for NLP. Transfer learning is a machine learning technique where a model that’s trained on one task is re-purposed on a second related task. Transfer learning aims to reduce the amount of training data and resources required for a new task by relying on features and weights learned from a previously trained model. This technique also allows for models’ weights to be fine-tuned depending on the similarity of the task the model was originally trained on and the new task at hand. While fine-tuning can be useful to tailor LLMs to specific tasks, they still require a decent amount of labeled data. With the rise of transfer learning, more and more LLMs have been rolled out by research institutions to be used by the masses. 

## BERT

The first common family of LLMs is the **BERT** family. **Bidirectional Encoder Representations from Transformers** (BERT) is a family of models. This family was originally developed by Google and is a seq2seq model with multiple self-attention layers. BERT and BERT adjacent models can be used to accurately predict the meaning of words or phrases in a sentence as a means towards understanding the relationship between them. BERT is commonly used for tasks like text classification, question answering, and sentiment analysis. 


In [None]:
pip install transformers

### Huggingface 
[Huggingface](https://huggingface.co/) is an open source commuinty where pre-trained models are published for use. BERT and thousands of other models are avaiable for free via Huggingface's `transformers` library. The library supports both `Tensorflow` APIs and `PyTorch` APIs. 

In [1]:
import torch
from transformers import BertTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Instantiate the tokenizer using the Bert-Base-Uncased pretrained weights 
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Test is out 
test_string = "Let's test out the Bert Tokenizer!!"

# Apply the tokenizer 
output = tokenizer.tokenize(test_string)
print(output)

['let', "'", 's', 'test', 'out', 'the', 'bert', 'token', '##izer', '!', '!']


In [4]:
# Convert to the indices using the BERT Tokenizer 
bert_output = tokenizer.convert_tokens_to_ids(output)
print(bert_output)

[2292, 1005, 1055, 3231, 2041, 1996, 14324, 19204, 17629, 999, 999]


## ELMo	
**ELMo** or **Embeddings from Language Models** is another common large language model that uses context from both the left and the right to generate accurate word representations. It was developed by the Allen Institute for Artificial Intelligence and its output is commonly used as input to other NLP models. Keep in mind that the output is embeddings, the numeric representations of words, that convey contextual information of the surrounding words. As a result, EMLo is useful for tasks that require a deep understanding of complex language like text generation or machine translation. 

### TensorFlow Hub
As an alternative to `Huggingface`, [TensorFlow Hub](https://www.tensorflow.org/hub) is a repository of pretrained models that can be used, fine-tuned, and deployed. However, unlike `Huggingface`, `TensorFlow Hub` is only compatible with the `TensorFlow` APIs. 

## GPT	
The next family of models has recently taken over the spotlight - the **GPT** or **Generative Pre-trained Transformer** family of models developed by Open AI. As stated in the name, GPT models are generative and are trained using a self-supervised approach. This means they are trained on massive amounts of unlabeled text data using the transformer architecture. Because they’re trained on so much data and can capture long term dependencies, the GPT family of models are well suited for tasks like question answering and text generation. 

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from torch import nn
from torch.optim import Adam
from transformers import GPT2Model, GPT2Tokenizer
from tqdm import tqdm

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

In [None]:
example_text = "Let's test the GPT2 tokenizer!!"
gpt2_input = tokenizer(example_text, padding="max_length", max_length=10, truncation=True, return_tensors="pt")

In [None]:
print(gpt2_input)

In [None]:
# Define the lables dictionary
labels = {"0":0,"1":1}

In [None]:
class Dataset(torch.utils.data.Dataset):
    """
    Torch dataset class 
    """
    def __init__(self, df):
        """
        Constructor 
        
        @params:
        df: pd.DataFrame
        
        @returns:
        None
        """
        self.labels = [labels[label] for label in df['spam']]
        self.texts = [tokenizer(text,
                                padding='max_length',
                                max_length=128,
                                truncation=True,
                                return_tensors="pt") for text in df['text']]
        
    def classes(self):
        """
        Function to get labes 
        
        @params:
        
        @returns:
        self.labels: labels of target in pd.DataFrame
        """
        return self.labels
    
    def __len__(self):
        """
        Get the lenth of labels 
        
        @params:
        
        @returns:
        len(self.labels): int, number of distinct label classes
        """
        return len(self.labels)
    
    def get_batch_labels(self, idx):
        """
        Gets arrray of labes for each batch
        
        @params:
        idx: int, index 
        
        @returns 
        np.array(self.labels[idx]: np.array, labels given index)
        """
        # Get a batch of labels
        return np.array(self.labels[idx])
    
    def get_batch_texts(self, idx):
        """
        Get batch text inputs 
        
        @params:
        idx: int, index
        
        @returns:
        self.texts[idx]: str, text associated with index
        """
        # Get a batch of inputs
        return self.texts[idx]
    
    def __getitem__(self, idx):
        """
        """
        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)
        return batch_texts, batch_y

In [None]:
# Read in emails data
df = pd.read_csv("data/emails.csv")
df["spam"] = df["spam"].astype("str")

# Split data
np.random.seed(112)
df_train, df_val, df_test = np.split(df.sample(frac=1, random_state=35),
                                     [int(0.8*len(df)), int(0.9*len(df))])

print(f"Rows in training set: {len(df_train)}, Rows in validation set: {len(df_val)}, Rows in test set: {len(df_test)}")

In [None]:
class SimpleGPT2SequenceClassifier(nn.Module):
    """
    GPT2 Sequence classifier class
    """
    def __init__(self, hidden_size: int, num_classes:int ,max_seq_len:int, gpt_model_name:str):
        """
        SimpleGPT2SequenceClassifier constructor 
        
        @params:
        hidden_size: int, 
        num_classes: int, 
        max_seq_len: int, 
        gpt_model_name: str

        @returns:
        None
        """
        super(SimpleGPT2SequenceClassifier,self).__init__()
        self.gpt2model = GPT2Model.from_pretrained(gpt_model_name)
        self.fc1 = nn.Linear(hidden_size*max_seq_len, num_classes)

        
    def forward(self, input_id, mask):
        """
        Forward training 
        
        @params:
        input_id:
        mask:
        
        @returns:
        linear_output
        """
        gpt_out, _ = self.gpt2model(input_ids=input_id, attention_mask=mask, return_dict=False)
        batch_size = gpt_out.shape[0]
        linear_output = self.fc1(gpt_out.view(batch_size,-1))
        return linear_output

In [None]:
def train(model, train_data, val_data, learning_rate, epochs):
    """
    Function that fine tunes GPT2 model using the Email Spam data 
    
    @params:
    model:
    train_data: pd.DataFrame,
    val_data: pd.DataFrame,
    learning_rate: float, 
    epochs: int
    
    @returns:
    None
    
    """
    train, val = Dataset(train_data), Dataset(val_data)
    
    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)
    
    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")
    
    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=learning_rate)
    
    if use_cuda:
        model = model.cuda()
        criterion = criterion.cuda()

    for epoch_num in range(epochs):
        total_acc_train = 0
        total_loss_train = 0
        
        for train_input, train_label in tqdm(train_dataloader):
            train_label = train_label.to(device)
            mask = train_input['attention_mask'].to(device)
            input_id = train_input["input_ids"].squeeze(1).to(device)
            
            model.zero_grad()

            output = model(input_id, mask)
            
            batch_loss = criterion(output, train_label)
            total_loss_train += batch_loss.item()
            
            acc = (output.argmax(dim=1)==train_label).sum().item()
            total_acc_train += acc

            batch_loss.backward()
            optimizer.step()
            
        total_acc_val = 0
        total_loss_val = 0
        
        with torch.no_grad():
            
            for val_input, val_label in val_dataloader:
                val_label = val_label.to(device)
                mask = val_input['attention_mask'].to(device)
                input_id = val_input['input_ids'].squeeze(1).to(device)
                
                output = model(input_id, mask)
                
                batch_loss = criterion(output, val_label)
                total_loss_val += batch_loss.item()
                
                acc = (output.argmax(dim=1)==val_label).sum().item()
                total_acc_val += acc
                
            print(
            f"Epochs: {epoch_num + 1} | Train Loss: {total_loss_train/len(train_data): .3f} \
            | Train Accuracy: {total_acc_train / len(train_data): .3f} \
            | Val Loss: {total_loss_val / len(val_data): .3f} \
            | Val Accuracy: {total_acc_val / len(val_data): .3f}")
            
EPOCHS = 1
model = SimpleGPT2SequenceClassifier(hidden_size=768, num_classes=2, max_seq_len=128, gpt_model_name="gpt2")
LR = 1e-5

# Train model for a single epoch
train(model, df_train, df_val, LR, EPOCHS)

In [None]:
def evaluate(model, test_data):
    """
    Evalute the model performance 
    
    @params:
    model: SimpleGPT2SequenceClassifier, trained GPT2 classifier
    test_data: pd.Dataframe
    
    @returns:
    true_labels: List[int], true labels
    predictions_labels: List[int], predicted lables 
    
    """

    test = Dataset(test_data)

    test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:

        model = model.cuda()

        
    # Tracking variables
    predictions_labels = []
    true_labels = []
    
    total_acc_test = 0
    with torch.no_grad():

        for test_input, test_label in test_dataloader:

            test_label = test_label.to(device)
            mask = test_input['attention_mask'].to(device)
            input_id = test_input['input_ids'].squeeze(1).to(device)

            output = model(input_id, mask)

            acc = (output.argmax(dim=1) == test_label).sum().item()
            total_acc_test += acc
            
            # add original labels
            true_labels += test_label.cpu().numpy().flatten().tolist()
            # get predicitons to list
            predictions_labels += output.argmax(dim=1).cpu().numpy().flatten().tolist()
    
    print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')
    return true_labels, predictions_labels
    

In [None]:
# Generate predictions with accuracy 
true_labels, pred_labels = evaluate(model, df_test)

In [None]:
# Plot confusion matrix of results
fig, ax = plt.subplots(figsize=(8, 8))
cm = confusion_matrix(y_true=true_labels, y_pred=pred_labels, labels=range(len(labels)), normalize='true')
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=list(labels.keys()))
disp.plot(ax=ax)

## ULMFiT	
The final model family is the **ULMFiT** or the **Universal Language Model Fine-tuning** family of models. These models are built by fast.ai and are trained with a transfer learning approach. While they leverage the same transformer architecture mentioned previously, the transfer learning approach ensures these models are flexible to various tasks outside of the one it was initially trained on. They’re able to better adapt to new tasks using a small amount of data, while some other LLMs require massive amounts of data to alternate tasks.  ULMFiT models are well suited for tasks like text classification, sentiment analysis, and language generation.

In order to use the ULMFiT model, we'll need to install and import the `fastai` library

## Machine Learning vs. Deep Learning vs. LLMs for NLP 

While it’s important to distinguish the difference between techniques and pretrained models, this difference is also important to map proper techniques to proper tasks. When deciding on which model to use, whether traditional machine learning, deep learning, or a pretrained model, there are few guiding principles to consider. The first is accessibility to labeled data. Many of these models require large amounts of data to fine tune billions of parameters. Quality, labeled data is hard to come by and could be a limiting factor for building a DL model from scratch. Another factor to consider is cost, these large language models, while powerful, can be expensive to fine tune depending on the task especially considering they might need pricey GPU hardware. A third principle to consider is any regulatory or security requirements with sensitive data. Regulatory requirements can impact the needs for model interpretability which might be hard to track for some NLP models. Amongst these principles and more, the approach to NLP should be the same as any other in ML, iterative. Start with a simple, baseline mode and add complexity to measure model performance against the baseline. This could mean starting with a simple machine learning model as a baseline and iterating all the way up to an LLM to achieve the best performance. 


## A Quick Note on Challenges in NLP 
Like any other field in machine learning, natural language processing comes with its own unique set of challenges. One of the more prominent challenges in NLP is data sparsity. In most cases, tasks are dealing with large vocabularies, but it’s near impossible to have examples of all possible language outcomes. To combat this, data scientists rely on proper representation in the corpus however this proper representation can be hard to come by and also manually intensive to verify. Somewhat related, the unstructured nature of NLP data proves to be a challenge as well. Language is fluid and can be inconsistent depending on tone. As a result models can pick up on inaccurate context depending on the corpus. A third prominent challenge in NLP, especially pertaining to LLMs, is the domain adaptation of pretrained models. If the task at hand is significantly different than the task a LLM was trained on, lots of data and compute are required to fine tune models to fit certain problem domains. Sometimes the luxury of compute and large amounts of labeled data aren’t accessible. Like any other machine learning domain, NLP has its challenges, most of those challenges in NLP pertain to the data. 


## Conclusion 


To review, NLP is a field in machine learning that focuses on understanding and processing natural human language. NLP involves techniques ranging from text processing and sentiment analysis to machine translation and question answering. NLP models can use shallow or deep learning-based architectures. Large, pretrained language models like BERT, ELMo, GPT, and ULMFiT, are used to accurately predict the meaning of words or phrases in a sentence and to understand the relationship between them. NLP models are used for a variety of tasks, such as text classification, question answering, sentiment analysis, and language generation. Like any subset in machine learning there are challenges, in NLP most of those challenges trace back to the data. However, when executed well, NLP models can have a massive impact on everyday life and innovation. 
