### Installation

In [36]:
#!pip install langchain langchain-community transformers bitsandbytes accelerate langchain-openai langchain evaluate



In [2]:
import dotenv, os
dotenv.load_dotenv()

True

# Finetuning large language models

In the previous chapter, we explored the inner workings of large language models and how they can be leveraged for various tasks, such as text generation and sequence classification, through effective prompting and zero-shot capabilities. We also delved into the vast array of pre-trained models available, courtesy of the vibrant community.

However, will these pre-trained models exhibit remarkable versatility, their general purpose training may not always be optimized for specific tasks or domains. Fine-tuning emerges as a crucial techique to adapt and refine a language model's understanding to the nuances of a particular dataset or a task

Consider the field of medical research, where a language model pre-trained solely on general web text may struggle to perform effectively out-of-the-box. By fine-tuning the model on a corpus of medical literature, its ability to generate relevant medical text or assist in information extraction from healthcare documents can be significantly enhanced.

Conversational models present another compelling use case. As discussed earlier, large pre-trained models are primarily trained to predict the next token, which may not seamlessly translate to engaging, conversational interactions. By fine-tuning these models on datasets containing everyday conversations and informal language structures, we can adapt their outputs to emulate the natural flow and nuances of interfaces like ChatGPT.

The primary objective of this chapter is to establish a solid foundation in fine-tuning large language models (LLMs). Consequently, we will delve into the following key areas:

- Classifying the topic of a text using a fine-tuned encoder model
- Generating text in a particular style using a fine-tuned decoder model
- Solving multiple tasks with a single model via instruction fine-tuning
- Parameter-efficient fine-tuning techniques that enable training on smaller GPUs
- Techniques for reducing computational requirements during model inference

Through this comprehensive exploration, you will gain insights into tailoring language models to excel in specific tasks and domains, unleashing their true potential for a wide range of applications.

## Text Classification

As we discussed in earlier chapters, LLMs are generally used for generative tasks where task is to predict the next token. Other NLP tasks such as text classification, named entity recognition might not be represented easily with the default objective. Here we will see an example of using LLMs for text classification and then further finetuning to improve the metrics. 

### Identify a dataset

Let's pick publicly available dataset to demonstrate the technique. Here we'll use AG news dataset, a well known non-commercial dataset used for benchmarking text classification models and researching data mining, information retrieval and data streaming.

Let's explore the dataset to know about the text and labels. The dataset provides 120,000 training examples, more than enough data to fine-tune a model. Fine-tuning requires very little data compared to pre-training a model and just using few thousand examples should enough to get a good baseline model.

In [3]:
from datasets import load_dataset
import evaluate

accuracy_metric = evaluate.load("accuracy")
raw_datasets = load_dataset("ag_news")

Let's print the first sample from the training dataset. Output shows each sample is a dictionary with two keys: `text` and `label` . The `text` key contains the actual text of the news article, while the `label` key contains an integer representing the category of the article. In this particular example, article is labeled with integer `2`, which corresponds to `business` category according to the dataset's label encoding scheme

In [7]:
raw_datasets['train'][0]

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'label': 2}

Before the era of LLM, we used RNNs or BERT style models to capture the meaning of a sentence and then finetuning for a downstream task. Let's look at some ways to achieve the same results using LLMs


But supervised learning is only one option for text classification with LLMs. Unsupervised learning through prompt engineering has emerged as a viable alternative. How well do LLMs perform text classification when guided only by a natural language prompt? Can this approach compete with the results from supervised learning? We explore these questions and more in the next post in our series. Stay tuned!

## Prompt Engineering

Let's look at zero-shot capability of large language models (LLMs) where they can perform a task without explicit training data for that specific task.

We first extract the label names from the dataset using `raw_datasets[train].features[label].names` . This gives us the list of label names corresponding to the integer labels in the dataset

Next, we create a dictionary `id_to_label` that maps the lowercase label names to their corresponding integer labels.

Finally, we modify the dictionary `id_to_label` to expand the `sci/tech` label. This is useful since LLM might be trained with full words rather than abbrevations. We should try to be close to the initial vocabulary.

In [8]:
labels = raw_datasets['train'].features['label'].names
id_to_label = {l.lower():i for i,l in enumerate(labels)}
id_to_label['science/technology']=3 #expanding one of the label

In [9]:
id_to_label

{'world': 0,
 'sports': 1,
 'business': 2,
 'sci/tech': 3,
 'science/technology': 3}

Here, We setup the prompt engineering pipeline for text classification using an LLM. We start by importing the necessary libraries from LangChain, a framework for building applications with LLMs.

We define a `ChatPromptTemplate` and `tagging_prompt` tht will be used to construct the prompt for the LLM. The prompt instructs the LLM to extract the news label from the given article based on the `Classification` 



In [16]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI
from enum import Enum
from langchain.output_parsers.enum import EnumOutputParser


tagging_prompt = ChatPromptTemplate.from_template(
"""
Extract the News Label from the following article. Follow the instructions below.

Only extract the properties mentioned in the 'Classification' function.

Passage:
{input}
"""
)

Next, we define a `classification` class that inherits from `BaseModel`. This class represents the structured output format that the LLM will generate. It has a single field `label` of type `str`, which is an enumeration of the four label categories: 

- World
- sports
- business
- science/technology

In [12]:

class Classification(BaseModel):
    label:str = Field(enum=['world','sports','business','science/technology']) # Note: Using the expanded label of sci/tech


We then initialize the LLM using ChatOpenAI from the langchain_openai library. In this example, we use the gpt-4-turbo model from OpenAI. We set the temperature parameter to 0 to obtain deterministic outputs from the LLM. Additionally, we use the with_structured_output method to specify that the LLM should generate outputs in the format defined by the Classification class1.

In [14]:
# LLM
llm = ChatOpenAI(temperature=0, model="gpt-4-turbo").with_structured_output(
    Classification
)


tagging_chain = tagging_prompt | llm

## Invoking the LLM for text classification

Here, we invoke the LLM for text classification using `tagging_chain` we created earlier. First, we define the input text `inp` which we want to classify.

In [15]:
inp = "Iphone is a great technology. Everybody should use it."
tagging_chain.invoke({"input": inp})

Classification(label='science/technology')

In the example above, LLM correctly classifies the input text as `science/technology` demonstrating the effectiveness of prompt engineering for text classification using LLMs.

We can then repeat this process to all the test samples. Below, we call `tagging_chain.invoke` for all the examples and save output as one of the feature in dataset. Feature is stored as `result` 

In [None]:
def process_text(example):
    try:
        out=tagging_chain.invoke({'input':inp})
        example['result']=out.label
    except:
        example['result']= ''
    return example
dataset_processed = raw_datasets['test'].map(process_text,num_proc=8)



Map (num_proc=8):   0%|          | 0/7600 [00:00<?, ? examples/s]

In [None]:
dataset_processed.save_to_disk('data/processed.hf')

### Evaluation

In [13]:
dataset_processed[1]

{'text': 'The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.',
 'label': 3,
 'result': 'science/technology'}

In [17]:
references = [r for r in dataset_processed['label']]
predictions = [id_to_label[p] for p in dataset_processed['result'] if p !='' else 1]
accuracy = accuracy_metric.compute(references=references,predictions=predictions)
print('accuracy {}'.format(accuracy['accuracy']))

SyntaxError: invalid syntax (1743339559.py, line 2)

## Finetuning the model

There are two main approaches to fine-tuning large language models (LLMs) for text classification tasks.

1. **Building an Entire Domain-specific Model from scratch**

- This approach involves training a foundational model entirely on industry-specific knowledge and data, using self-supervised learning techniques like next-token prediction and masking

- It requires a massive amount of domain specific data and significant computational resources

- An example of this approach is **BloombergGPT** which was trained on decades of financial data, requiring $2.7 million and 53 days of training

- The advantage of this approach is that the resulting model is highly specialized and tailored to the specific domain, potentially leading to better performance on domain specific tasks

2. **Finetuning a pretrained LLM**

- This approach involves taking a pre-trained LLM such as GPT OR BERT, and fine tuning it on a smaller, domain specific dataset.
  
- It requires less data computation, and time compared to training from scratch, making it more efficient and cost-effective option

  
- Various techniques can be employed to enhance the fine-tuning process, such as transfer learning, retrieval-augmented generation (RAG), and a multi-task learning

  
- RAG combines the strengths of pre-trained models and information retrieval systems, enabling the model to retrieve and incorporate domain-specific knowledge during inference

  
- Multi-task learning involves training a single model on multiple related tasks simultaneously, allowing the model to learn shared representations and benefit from task synergies.

Let's see how we can fine-tune LLAMA 3 model for text classification task

### Rationale for different approaches



## Reference

- Text classification using large language models
  
https://aclanthology.org/2023.findings-emnlp.603.pdf