# Lesson 1

## Tokenization & Embeddings

### What You’ll Learn:

* How LLMs process text using tokenization
* How to convert text into embeddings
* Why embeddings are crucial for NLP tasks

### Real-World Application
* Chatbots & Virtual Assistants: Extracting sentence embeddings to match user queries with relevant responses.
* Search & Information Retrieval: Converting documents into embeddings for similarity search (used in Google Search, Semantic Search, etc.).
* Sentiment Analysis & Text Classification: Embeddings serve as features for classification models in finance, healthcare, and marketing.

In [2]:
# Import necessary libraries
from transformers import AutoTokenizer, AutoModel
import torch

# Load a pre-trained tokenizer and model (BERT in this case)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Sample text for tokenization
text = "Large Language Models are revolutionizing NLP."

In [3]:
# Step 1: Tokenize the text
tokens = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

* The goal is to adjust the model's weights to minimize the error between its predictions and the actual targets.
* This involves calculating gradients (the rate of change of the error with respect to the weights) and using them to update the weights.
* During training, PyTorch keeps track of all operations performed on tensors that have requires_grad=True to enable gradient calculation.

Inference:
* Since we're using a pre-trained model to make predictions on new data, **the model's weights are fixed, so there's no need to calculate gradients.**
* Calculating and storing gradients during inference is a waste of computational resources and memory.

In [4]:
# Step 2: Generate embeddings using the model
with torch.no_grad():  # No gradient calculation needed for inference
    output = model(**tokens)

`output.last_hidden_state`: This represents the final hidden state of the model, which is a tensor containing the contextualized representations of each token in the input sequence.

In [5]:
# Step 3: Extract the embeddings (CLS token represents the sentence embedding)
embedding = output.last_hidden_state[:, 0, :]

In [6]:
# Print shape of embedding vector
print("Embedding Shape:", embedding.shape)

Embedding Shape: torch.Size([1, 768])


In [8]:
# This is what the text became. A tensor with numbers for the LLM to understand.
embedding

tensor([[-3.4580e-01, -1.7613e-01,  9.1574e-02, -3.1258e-01, -4.3672e-01,
         -4.6326e-01,  3.5836e-01,  6.3136e-01, -1.6741e-01, -6.0901e-01,
          1.5760e-02,  2.0739e-01, -3.9844e-01,  1.2286e-01,  2.5884e-01,
          2.2635e-01, -2.9151e-01,  5.7592e-01,  2.7204e-01,  7.5218e-02,
         -3.6411e-01, -6.5309e-01,  1.4161e-01, -5.1403e-02,  6.1204e-02,
         -1.0301e-01,  8.3905e-02, -3.2397e-01,  4.2635e-02, -2.5525e-01,
         -5.8510e-01,  4.0276e-01, -1.2440e-01, -4.1155e-01,  4.0300e-01,
         -9.8790e-02, -2.1416e-01, -3.4497e-01,  6.1450e-01,  4.2820e-01,
         -6.4149e-01,  2.3772e-02,  5.7509e-02, -2.9054e-01, -2.8852e-01,
         -2.0358e-01, -3.3098e+00, -3.1943e-01, -3.3256e-01, -3.7594e-01,
         -4.7578e-01,  6.8495e-02, -1.1553e-01,  3.6843e-01,  2.6826e-01,
          2.3217e-01, -1.3763e-01,  2.9508e-01,  3.0863e-02, -1.1439e-02,
          7.3251e-02,  4.7569e-01, -3.8645e-01, -2.8943e-01, -4.0224e-01,
          2.9392e-01, -4.6213e-01,  1.

# Lesson 2

## Transformer Architecture Basics

### What You’ll Learn:

* How the Attention Mechanism works in Transformers
* How to build a Mini-Transformer Model from scratch using PyTorch.

### Real-World Application
* Machine Translation (Google Translate): Transformers power translation models like MarianMT and M2M-100.
* Speech Recognition (Whisper, DeepSpeech): Used in voice assistants like Siri and Alexa.
* Text Summarization (GPT, T5, BART): Automatic document summarization for legal, medical, and news texts.

In [10]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Define a simple self-attention mechanism
class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads  # Splitting embedding into multiple heads

        # Query, Key, and Value matrices
        self.values = nn.Linear(embed_size, embed_size, bias=False)
        self.keys = nn.Linear(embed_size, embed_size, bias=False)
        self.queries = nn.Linear(embed_size, embed_size, bias=False)
        self.fc_out = nn.Linear(embed_size, embed_size)

    def forward(self, value, key, query, mask=None):
        N = query.shape[0]  # Batch size
        value_len, key_len, query_len = value.shape[1], key.shape[1], query.shape[1]

        # Apply linear transformations
        values = self.values(value)
        keys = self.keys(key)
        queries = self.queries(query)

        # Compute scaled dot-product attention
        energy = torch.matmul(queries, keys.transpose(-2, -1)) / (self.embed_size ** (1/2))

        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))  # Apply mask for padding

        attention = torch.softmax(energy, dim=-1)  # Normalize with softmax

        # Multiply attention weights by values
        out = torch.matmul(attention, values)
        return self.fc_out(out)

# Define a Transformer Encoder block
class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        # Feedforward network
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size),
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)
        x = self.norm1(attention + query)  # Residual connection + Layer Norm
        forward = self.feed_forward(x)
        out = self.norm2(forward + x)  # Residual connection + Layer Norm
        return out


* Attention: The core idea is to let the model focus on relevant parts of the input.
* Multi-head Attention: Instead of just one way to focus, the model uses multiple "heads," each focusing on different aspects of the relationships between words.
* Attention Head: Each head is a separate, parallel attention mechanism. It learns its own set of weights to calculate attention scores, allowing it to capture different patterns and relationships within the data.

In [12]:
# Example usage
embed_size = 512  # Size of word embeddings
heads = 8  # Number of attention heads
dropout = 0.1
forward_expansion = 4

In [13]:
# Create a Transformer block
transformer_block = TransformerBlock(embed_size, heads, dropout, forward_expansion)

In [14]:
# Dummy input (batch_size=2, seq_len=5, embed_size=512)
x = torch.rand((2, 5, embed_size))

#### Transformer Block:
This refers to a fundamental building block of the Transformer architecture. It typically consists of:
* Multi-head attention.
* Feed-forward neural networks.
Layer normalization and residual connections.
* Forward Pass:
This means you're feeding input data (x) through the Transformer block to see what output it produces. It's the process of calculating the output of the network given an input.
* x, x, x:
In a standard self-attention mechanism, the Transformer block takes three inputs: query, key, and value. In your code, you're using the same input x for all three. This is common in self-attention, where the model attends to different parts of the same input sequence.
* mask=None:
The mask parameter is used to prevent the model from attending to certain parts of the input sequence. Setting it to None means no masking is applied, so the model can attend to all positions.
* output = transformer_block(x, x, x, mask=None):
This line executes the forward pass. The transformer_block processes the input and produces an output tensor.
* print("Transformer Output Shape:", output.shape):
This line prints the shape of the output tensor. This tells you the dimensions of the output, which are crucial for understanding how the data is transformed by the Transformer block.

In [15]:
# Forward pass through Transformer
# In essence, you took some input data (x), passed it through one layer of a transformer, and then printed the shape of the result. You are seeing what the output of that layer is.
output = transformer_block(x, x, x, mask=None)
print("Transformer Output Shape:", output.shape)

Transformer Output Shape: torch.Size([2, 5, 512])


# Lesson 3

## Fine-Tuning LLMs (BERT for Text Classification)
### 💡 What You’ll Learn:

* How to fine-tune a pre-trained LLM (BERT) on a text classification task
* How to train the model on a real dataset using Hugging Face’s transformers library
* Why fine-tuning is crucial for adapting LLMs to specific domains


### 📌 Real-World Applications
* Customer Feedback Analysis: Companies analyze sentiment from reviews to improve products/services (e.g., Amazon, Google Reviews).
* Social Media Monitoring: Detecting positive/negative sentiment in tweets, Facebook posts, and Reddit comments.
* Healthcare & Finance: Analyzing patient reports, legal contracts, and financial news for risk assessment.

In [20]:
#! pip install datasets --quiet

* If your text is in English, the tokenizer `"bert-base-uncased"` is a common starting point.<br>
* For other languages, you'll need a multilingual BERT `"bert-base-multilingual-cased"` or a language-specific BERT.
* If your text is from a specialized domain (e.g., medical, legal, scientific), consider using a BERT model that has been pretrained on that domain
* `"bert-base-uncased"` is a smaller model, while "bert-large-uncased" is larger. Larger = more computational power

In [48]:
# Import necessary libraries
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW
from datasets import load_dataset

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Load a sample sentiment classification dataset (IMDB reviews)
dataset = load_dataset("imdb")

In [49]:
# Tokenization function
def tokenize_function(examples):
    tokenized = tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)
    return {key: torch.tensor(val) for key, val in tokenized.items()} #Convert to tensors here.

# Tokenize dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Convert to PyTorch dataset format
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))  # Use a subset for quick training
test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(200))

# Load pre-trained BERT model for classification (2 classes: Positive/Negative)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2).to(device)

# DataLoader: It handles batching, shuffling, and parallel loading of data, making it easier to train neural networks.
# Takes your training data (train_dataset).
# Groups the data into batches of 8 samples.
# Randomly shuffles the data before each epoch.
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training loop
model.train()
for epoch in range(3):  # Train for 3 epochs
    for batch in train_loader:
        optimizer.zero_grad()
        inputs = {key: batch[key].to(device) for key in ["input_ids", "attention_mask"]}
        labels = batch["label"].to(device)
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

print("Training Complete! ✅")

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [55]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Load the fine-tuned model and tokenizer
# model_path = "./bert-sentiment-model"
# tokenizer = BertTokenizer.from_pretrained(model_path)
# model = BertForSequenceClassification.from_pretrained(model_path)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()  # Set model to evaluation mode

# Function to predict sentiment
def predict_sentiment(text):
    tokens = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)

    with torch.no_grad():
        outputs = model(**tokens)

    logits = outputs.logits
    prediction = torch.argmax(logits, dim=1).item()  # Get class with highest probability

    sentiment = "Positive" if prediction == 1 else "Negative"
    return sentiment

# Test the model on custom inputs
test_sentences = [
    "Great movie!",
    "This is the worst product I have ever bought. Totally disappointed.",
    "The food was okay, nothing special but not bad either."
]

for sentence in test_sentences:
    sentiment = predict_sentiment(sentence)
    print(f"Review: {sentence}\nPredicted Sentiment: {sentiment}\n")


Review: Great movie!
Predicted Sentiment: Positive

Review: This is the worst product I have ever bought. Totally disappointed.
Predicted Sentiment: Positive

Review: The food was okay, nothing special but not bad either.
Predicted Sentiment: Negative



# Lesson 4
## RAGs

### What You’ll Learn:

* What RAG is and why it’s useful for LLMs
* How to retrieve relevant documents and generate responses using FAISS and LangChain
* How to apply RAG to augment LLMs with real-world data

### Why Use RAG?
* Large Language Models (LLMs) are limited by their training data, which may be outdated or incomplete.
* RAG improves responses by retrieving relevant external knowledge before generating text.

### Real-World Applications
* Legal & Financial Research: Quickly find relevant documents in large knowledge bases (e.g., LexisNexis, Bloomberg).
* Medical Chatbots: Retrieve medical research papers before generating answers (e.g., PubMed).
* Enterprise AI Assistants: Use RAG for real-time knowledge augmentation in corporate settings (e.g., internal wikis).

In [None]:
# Install required libraries (if not installed)
# pip install faiss-cpu transformers langchain openai datasets

import faiss
import torch
from transformers import AutoModel, AutoTokenizer
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from datasets import load_dataset

# Load pre-trained embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Load sample documents (e.g., Wikipedia articles)
dataset = load_dataset("wikipedia", "20220301.simple", split="train").shuffle(seed=42).select(range(500))  # Use a subset for speed

# Convert documents into embeddings
documents = [doc["text"] for doc in dataset]
document_embeddings = embedding_model.embed_documents(documents)

# Create a FAISS index for fast retrieval
dimension = len(document_embeddings[0])  # Embedding size
index = faiss.IndexFlatL2(dimension)
index.add(torch.tensor(document_embeddings).numpy())

# Create a LangChain retriever
vector_store = FAISS(embedding_function=embedding_model, index=index)
retriever = vector_store.as_retriever()

# Use GPT-4 as the LLM for generation
llm = OpenAI(model_name="gpt-4", temperature=0.7)

# Build RAG pipeline
qa_chain = RetrievalQA(llm=llm, retriever=retriever)

# Query the RAG system
query = "What is the history of the Eiffel Tower?"
response = qa_chain.run(query)
print("\n🔍 Retrieved Answer:\n", response)


# Lesson 5

## Fine-Tuning LLMs with Custom Datasets
### 💡 What You’ll Learn:

* How to fine-tune GPT-style models on domain-specific data
* How to prepare custom datasets for fine-tuning
* How to train an LLM using Hugging Face’s transformers library

### Why Fine-Tune an LLM?
* Pre-trained models (like GPT-3.5, GPT-4, LLaMA) are generic, meaning they lack domain-specific knowledge.
* Fine-tuning adapts LLMs for specialized fields (e.g., legal, finance, healthcare).

### 📌 Real-World Applications
* Chatbots & Virtual Assistants: Fine-tuned models provide domain-specific knowledge.
* Legal & Finance AI: Fine-tune models on legal contracts, financial reports, or medical texts.
* Enterprise AI Solutions: Adapt LLMs for company-specific knowledge.

### Python Code: Fine-Tuning a GPT Model (LLaMA-2)
We’ll fine-tune Meta’s LLaMA-2 (7B model) on a custom dataset using the transformers library.

In [56]:
# !pip install transformers datasets peft accelerate bitsandbytes

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load a custom dataset (e.g., StackExchange for Q&A fine-tuning)
dataset = load_dataset("stanfordnlp/stackoverflow")

# Load tokenizer for LLaMA-2
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Preprocessing function
def preprocess_data(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

# Apply tokenization
tokenized_dataset = dataset.map(preprocess_data, batched=True)
train_dataset = tokenized_dataset["train"].shuffle(seed=42).select(range(10000))  # Subset for quick training


Load Model and Fine Tune

📌 How This Works
* Dataset Preparation: We load a custom dataset (StackOverflow Q&A).
* Tokenization: Converts text into LLM-readable tokens.
* Fine-Tuning: Trains LLaMA-2 using low-bit precision (8-bit) for efficiency.
* Model Saving: The trained model is saved for deployment.

In [None]:
import torch
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# Load LLaMA-2 model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", load_in_8bit=True, device_map="auto")

# Training arguments
(training_args = TrainingArguments(
    output_dir="./llama-finetuned",
    per_device_train_batch_size=4, #4 training examples at a time on each device.
    num_train_epochs=3, #3 epochs means the model will see all the training data three times.
    save_steps=500, #Every 500 training steps, the model will save a checkpoint.
    save_total_limit=2, #This limits the total number of saved checkpoints. oldest deleted
    fp16=True, #This enables mixed-precision training using 16-bit floating-point numbers (FP16).
    logging_dir="./logs" #This specifies the directory where training logs will be saved.
)
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

# Start fine-tuning
trainer.train()


# Lesson 6

## Optimizing LLM Inference with Quantization & LoRA
💡 What You’ll Learn:

* How to reduce memory usage and speed up inference for LLMs
* How to apply quantization to shrink model size
* How to fine-tune using LoRA (Low-Rank Adaptation) to save computation

### 📌 Why Optimize Large Models?
* LLMs (like LLaMA-2, GPT-4, Falcon) are huge and require expensive GPUs to run.
* Using quantization and LoRA, we can:
* ✅ Reduce GPU memory usage (2x–4x smaller models)
* ✅ Speed up inference (faster responses in production)
* ✅ Enable fine-tuning on consumer GPUs

In [None]:
# We’ll apply quantization using bitsandbytes to reduce the model’s size while keeping performance high.
pip install transformers accelerate bitsandbytes peft

In [None]:
# 📌 Step 2: Load & Quantize LLaMA-2

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Set up quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Load in 4-bit mode
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"  # Use NF4 quantization
)

# Load the LLaMA-2 model in 4-bit mode
model_id = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map="auto")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

print("✅ Model successfully loaded in 4-bit mode!")


📌 How Quantization Helps
🔹 Full Model (7B parameters) → 32GB VRAM needed
🔹 4-bit Quantized Model → Uses only ~8GB VRAM ✅
🔹 Same accuracy, faster inference

📌 Applying LoRA for Efficient Fine-Tuning
Instead of training all model parameters, LoRA (Low-Rank Adaptation) freezes most weights and trains only small adapters, saving 80%+ computation.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

# Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,  # Rank (smaller = faster)
    lora_alpha=16,
    lora_dropout=0.05
)

# Apply LoRA adapters to the model
model = get_peft_model(model, lora_config)

print("✅ LoRA adapters applied successfully!")


📌 How LoRA Helps
✅ Trains only 1%–10% of parameters instead of full fine-tuning
✅ Memory-efficient → fine-tune large models on a single GPU
✅ Easy merging → use LoRA-trained adapters on different tasks

📌 Real-World Applications

Deploying LLMs on Edge Devices:

* Run 4-bit models on consumer GPUs or even mobile devices.
* Speeding Up Chatbots: Reduce latency in real-time applications like AI customer support.
* Enterprise AI Scaling: Fine-tune huge models on low-cost hardware using LoRA adapters.

#Lesson 7

## Deploying the Optimized Multi-GPU Trained Model (FastAPI & Docker) 🚀
Now that we’ve optimized and trained our LLM using DeepSpeed/FSDP, let's deploy it using FastAPI for real-world inference.

📌 Steps to Deploy the Model

1️⃣ Load the trained model & tokenizer<br>
2️⃣ Create an API endpoint using FastAPI<br>
3️⃣ Dockerize the API for scalable deployment<br>



📌 Step 1: Install Required Dependencies

In [None]:
pip install fastapi uvicorn torch transformers deepspeed


📌 Step 2: Create a FastAPI Inference Server
This script will serve the LLM and generate text from API requests.

📜 app.py (FastAPI Server)

In [None]:
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Initialize FastAPI app
app = FastAPI()

# Load model and tokenizer (optimized for multi-GPU)
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).half().to("cuda")

# Define request format
class RequestData(BaseModel):
    prompt: str
    max_length: int = 100

@app.post("/generate/")
async def generate_text(data: RequestData):
    inputs = tokenizer(data.prompt, return_tensors="pt").to("cuda")
    output = model.generate(**inputs, max_length=data.max_length)
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return {"generated_text": response}

# Run with: uvicorn app:app --host 0.0.0.0 --port 8000


📌 Step 3: Create a Dockerfile for Deployment
To package and run the API anywhere, let’s build a Docker container.

📜 Dockerfile

In [None]:
# Use official PyTorch image with CUDA support
FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy app files
COPY app.py .

# Expose API port
EXPOSE 8000

# Run FastAPI server
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]


📌 Step 4: Build & Run the Docker Container

In [None]:
# Create a requirements.txt file
echo "fastapi\nuvicorn\ntorch\ntransformers\ndeepseed" > requirements.txt

# Build Docker image
docker build -t llama-api .

# Run the container
docker run -p 8000:8000 --gpus all llama-api


📌 Step 5: Test the API
Once running, send a POST request to generate text.

Using curl

In [None]:
curl -X 'POST' 'http://localhost:8000/generate/' \
     -H 'Content-Type: application/json' \
     -d '{"prompt": "Once upon a time,", "max_length": 50}'


Or using Python

In [None]:
import requests

url = "http://localhost:8000/generate/"
data = {"prompt": "The future of AI is", "max_length": 50}
response = requests.post(url, json=data)
print(response.json())


 Real-World Deployment
* Deploy on Cloud (AWS, Azure, GCP): Use GPU-enabled VM (A100, V100)
* Scale with Kubernetes: Run multiple containers for load balancing
* Serve via API Gateway: Expose API securely for enterprise use

# Lesson 8

## 📌 Python Code: Fine-Tuning LLM on Healthcare Data

We'll take a pre-trained LLM (like GPT-2 or LLaMA) and fine-tune it on a healthcare dataset. Let's assume you have a corpus of medical texts or patient dialogues.


Step 1: Install Required Libraries

In [None]:
pip install transformers datasets accelerate


Step 2:
Load and Preprocess Healthcare Data

For fine-tuning, we need a dataset with medical text. We can use datasets like huggingface or any domain-specific dataset.

In [None]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a healthcare dataset (e.g., medical dialogues)
dataset = load_dataset("med_dialogue")  # Example of medical dialogue dataset

# Preprocess dataset: Tokenize the data
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding="max_length", max_length=512)

# Apply the preprocessing function to the dataset
encoded_dataset = dataset.map(preprocess_function, batched=True)


Step 3: Fine-Tune the Model on Healthcare Data

In [None]:
from transformers import Trainer, TrainingArguments

# Training arguments for fine-tuning
training_args = TrainingArguments(
    output_dir="./healthcare_model",
    evaluation_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    logging_dir="./logs",
    save_steps=500,
    fp16=True,
)

# Fine-tuning using the Trainer API
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
)

trainer.train()


Step 4: Save and Evaluate the Fine-Tuned Model

After fine-tuning, save the model and tokenizer for later use.

In [None]:
# Save the fine-tuned model and tokenizer
model.save_pretrained("./fine_tuned_healthcare_model")
tokenizer.save_pretrained("./fine_tuned_healthcare_model")

# Evaluate the model performance on a validation set
eval_results = trainer.evaluate(encoded_dataset["test"])
print(f"Evaluation Results: {eval_results}")


📌 Fine-Tuning for Finance and Legal
You can apply a similar approach to Finance and Legal AI using industry-specific datasets:

* Finance: Datasets like Financial News, Stock Market Data, and Quarterly Earnings Reports.
* Legal: Use datasets like Legal Cases, Contracts, or Court Transcripts.

Just replace the healthcare dataset with finance or legal text and adjust the tokenization and preprocessing steps accordingly.