# Introduction

Dive into the fascinating world of data privacy and LLMs with this hands-on Jupyter Notebook! We're taking a pre-trained language model for a spin, fine-tuning it on both redacted and un-redacted data, and uncovering the intriguing impacts of data privacy on AI performance.

Through this notebook, we aim to:

- Fine-tune a pre-selected LLM on both redacted and un-redacted datasets.
- Apply redaction to the dataset through the Private AI platform
- Compare the outputs and evaluate the performance of the LLM when not fine tuned, fine-tuned and fine-tuned on redacted data.

## A Note on Prequisites

Please note that to run this notebook in its entirety, a Private AI platform API key is required. [Get an API key here](https://www.private-ai.com/api-key/)

# Setup

Before diving into fine-tuning our language model, we need to set up our environment with the necessary libraries and frameworks. This notebook utilizes libraries such as transformers for accessing pre-trained models and tokenizers, torch for leveraging PyTorch's deep learning capabilities, and other utility libraries like pandas, tqdm, and sklearn. The code below begins by importing required modules, followed by defining a custom QADataset class that facilitates the loading and processing of our QA (Question-Answering) data.

In [1]:
!pip install privateai_client datasets transformers

Collecting privateai_client
  Downloading privateai_client-1.3.2-py3-none-any.whl (20 kB)
Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets

In [2]:
from privateai_client import PAIClient
from privateai_client import request_objects

from datasets import load_dataset
import pandas as pd
from tqdm import tqdm
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, get_linear_schedule_with_warmup
from torch.optim import AdamW
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt


In [3]:
from getpass import getpass

# Please visit https://www.private-ai.com/start/ to get a free key
PRIVATEAI_API_KEY = getpass("Please enter Priavet AI Cloud API your key: ")

Please enter your key: ··········


In [4]:
from google.colab import drive
drive.mount('/content/drive')

#
SOURCE_DATA_URL = '/content/drive/My Drive/NvidiaDocumentationQandApairs.csv'
OUTPUT_DESTINATION = '/content/drive/My Drive/PrivateAI Fine Tuning Example/'


Mounted at /content/drive


In interactive applications like live demonstrations or presentations within Jupyter Notebooks, encountering detailed error tracebacks can be disruptive and may hinder the user experience. To make any potential errors during the chat interaction less intrusive and more user-friendly, we implement a custom exception handler. This handler simplifies the error messages, making them more concise and easier to understand at a glance.

In [None]:
from __future__ import print_function  # for python 2 compatibility
import sys
import warnings
from transformers import AutoTokenizer

ipython = get_ipython()

def exception_handler(exception_type, exception, traceback):
    print("%s: %s" % (exception_type.__name__, exception), file=sys.stderr)

ipython._showtraceback = exception_handler

warnings.filterwarnings("ignore")

# Interacting with the LLM
The following function will allow us to instantiate a model and interact with it via chat.

In [5]:
def start_chat(model_type):

    tokenizer = AutoTokenizer.from_pretrained(model_type, padding_side='left')

    model = AutoModelForCausalLM.from_pretrained(model_type)

    # Let's chat for 5 lines
    for step in range(5):
        # encode the new user input, add the eos_token and return a tensor in Pytorch
        new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')


        # generated a response while limiting the total chat history to 1000 tokens,
        chat_history_ids = model.generate(new_user_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)


        # pretty print last ouput tokens from bot
        print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, new_user_input_ids.shape[-1]:][0], skip_special_tokens=True)))

# Interacting with the Pre-Trained Model (No Fine-Tuning)

Before we assess the fine-tuned models, it's essential to establish a baseline for comparison. This baseline helps us understand the improvements our fine-tuning process brings, especially in handling domain-specific queries. In this section, we interact with the pre-trained 'microsoft/DialoGPT-medium' model without any fine-tuning. Our goal is to observe its response quality to a domain-specific question, providing a reference point for later comparisons with our fine-tuned models.

In [13]:
## Sample Questions to test
# Does CUDA support ISO
# Who made an AI system that sings Christmas songs

start_chat('microsoft/DialoGPT-medium')

(…)edium/resolve/main/tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

(…)DialoGPT-medium/resolve/main/config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

(…)/DialoGPT-medium/resolve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

(…)/DialoGPT-medium/resolve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/863M [00:00<?, ?B/s]

(…)dium/resolve/main/generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

>> User:Does CUDA support ISO


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


DialoGPT: I don't know, but I'm sure it does.


KeyboardInterrupt: ignored

# Fine Tuning Dataset
For this excercise, we'll be using a list of question - answer pairs based on Nvidia technical documentation.

In [6]:
df = pd.read_csv(SOURCE_DATA_URL)
df[['question', 'answer']].head()

Unnamed: 0,question,answer
0,What is Hybridizer?,Hybridizer is a compiler from Altimesh that en...
1,How does Hybridizer generate optimized code?,Hybridizer uses decorated symbols to express p...
2,What are some parallelization patterns mention...,The text mentions using parallelization patter...
3,How can you benefit from accelerators without ...,You can benefit from accelerators' compute hor...
4,What is an example of using Hybridizer?,An example in the text demonstrates using Para...


# Fine-Tuning the Model
In this section, we introduce the fine_tune_model function, designed to streamline the process of fine-tuning our chosen large language model (LLM) on a specific dataset.


In [7]:
class QADataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []

        max_len_question, max_len_answer = self.find_max_length(data, tokenizer)
        self.max_len = max(max_len_question, max_len_answer)

        questions = data['question'].tolist()
        answers = data['answer'].tolist()

        for question, answer in zip(questions, answers):
            encoding = tokenizer(question + tokenizer.eos_token + answer,
                                 truncation=True,
                                 max_length=self.max_len,
                                 padding='max_length',
                                 return_tensors='pt')
            self.input_ids.append(encoding['input_ids'])
            self.attn_masks.append(encoding['attention_mask'])

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

    def find_max_length(self, data, tokenizer):
        max_len_question = 0
        max_len_answer = 0

        for _, row in data.iterrows():
            tokenized_question = tokenizer.tokenize(row['question'])
            tokenized_answer = tokenizer.tokenize(row['answer'])

            max_len_question = max(max_len_question, len(tokenized_question))
            max_len_answer = max(max_len_answer, len(tokenized_answer))

        return max_len_question, max_len_answer


In [8]:
def fine_tune_model( train_data, model_type, save_name,
                    batch_size = 4,   epochs = 3, max_length = 64, learning_rate = 3e-5):

    #Curtail the volume of training output to streamline demonstrations
    transformers.logging.set_verbosity_error()

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_type)
    model = AutoModelForCausalLM.from_pretrained(model_type)

    # Setting padding token
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id


    # Create the DataLoader
    dataset = QADataset(train_data, tokenizer)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    # Set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    # Define optimizer and scheduler
    optimizer = AdamW(model.parameters(), lr=learning_rate)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(dataloader)*epochs)

    # Training loop
    model.train()

    train_losses = []
    validation_losses = []
    total_len = len(dataloader)

    for epoch in range(epochs):
        total_loss = 0
        progress_bar = tqdm(dataloader, desc=f'Epoch {epoch+1}/{epochs}', leave=False)


        for input_ids_batch, attn_masks_batch in dataloader:
            input_ids_batch = input_ids_batch.squeeze().to(device)
            attn_masks_batch = attn_masks_batch.squeeze().to(device)

            optimizer.zero_grad()
            outputs = model(input_ids_batch, attention_mask=attn_masks_batch, labels=input_ids_batch)
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            scheduler.step()
            progress_bar.set_postfix({'training_loss': f'{loss.item():.3f}'})
            progress_bar.update(1)

            total_loss += loss.item()

        avg_train_loss = total_loss / total_len
        print(f"Epoch: {epoch + 1}, Training loss: {avg_train_loss:.4f}")
        train_losses.append(avg_train_loss)


    # Save the model
    model.save_pretrained(save_name)
    tokenizer.save_pretrained(save_name)

    return model, tokenizer

# Fine-Tuning on Domain-Specific Data (Un-Redacted)
The power of large language models (LLMs) like GPT-3 or DialoGPT lies in their capacity to generate human-like text based on the data they were trained on. However, while these models are proficient in general tasks, they may lack expertise in specific domains. To address this, we fine-tune our model on un-redacted, domain-specific data, which, in this case, is a set of question-and-answer pairs from Nvidia documentation. This approach aims to enhance the model's proficiency in our domain of interest.

__Important__
Please note that the resulting model and tokenizer will be saved to the 'save_name' folder. We will be referencing this folder later when we test the fine tuned un-redacted model.

In [9]:

df = pd.read_csv(SOURCE_DATA_URL)
model_type = 'microsoft/DialoGPT-medium'
save_name = OUTPUT_DESTINATION  + '/fine_tuned_DialoGPT'

model, tokenizer = fine_tune_model( df, model_type, save_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/863M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Epoch 1/3:   0%|          | 0/1777 [00:00<?, ?it/s]

Epoch: 1, Training loss: 1.2834


Epoch 2/3:   0%|          | 0/1777 [00:00<?, ?it/s]

Epoch: 2, Training loss: 0.9670


Epoch 3/3:   0%|          | 0/1777 [00:00<?, ?it/s]

Epoch: 3, Training loss: 0.8523


# Interacting with the First Fine-Tuned Model (Un-Redacted Data)

After establishing a baseline with the pre-trained model, we now explore the capabilities of our first fine-tuned model. This model has been fine-tuned on un-redacted, domain-specific data, potentially endowing it with a more profound understanding of domain-specific queries. However, it's important to note that while the model may exhibit enhanced subject matter expertise, it could also inadvertently reveal sensitive information contained within the training data.

In [10]:

## Sample Questions to test
# Does CUDA support ISO
#  name an organization with deep AI research
# who made an AI system created the sings Christmas songs?

start_chat(OUTPUT_DESTINATION  + '/fine_tuned_DialoGPT')

>> User:Does CUDA support ISO




DialoGPT: Yes, CUDA support in ISO is supported.


KeyboardInterrupt: ignored

# AI-Powered Redaction of Domain-Specific Data

While our previous efforts focused on fine-tuning the language model using un-redacted data, this fined tuned model risks leaking PII data. To address this concern, we're employing AI-powered redaction on our domain-specific dataset, courtesy of the Private AI open API, before the fine-tuning process. This method ensures sensitive information within the dataset is anonymized or removed, thereby protecting individual privacy and confidential information.

__Important__
To access the cloud API requires a key. Please visit https://www.private-ai.com/start and click on "Get Started Now" to get a free key.

In [12]:
%%time
def redact_text_through_API(dataframe, batch_size=1000):

    # Initialize a new DataFrame to store the results
    new_dataframe = pd.DataFrame(columns=dataframe.columns)

    # Split the dataframe into chunks
    batches = [dataframe[i:i+batch_size] for i in range(0, dataframe.shape[0], batch_size)]

    client = PAIClient("https", 'api.private-ai.com/deid' )
    client.add_api_key(PRIVATEAI_API_KEY)

    # if instead of the Cloud API you have a local container, here is a sample of how to connect to it
    #client = PAIClient(url="http://localhost:8080")

    marker = " >>>>>>>>>> "

    progress_bar = tqdm(range(0, len(dataframe)), desc=f'Processing Requests', leave=False)
    for batch in batches:

        # Initialize lists to store the unpacked responses
        prompt_answer_d = []
        prompt_question_d = []

        concat_list = []

        # Loop over each row of the chunk
        for index, row in batch.iterrows():
            concatenated_text = row['question'] + marker + row['answer']
            concat_list.append(concatenated_text)

        # Send text to Private AI platform to be redacted
        text_request = request_objects.process_text_obj(text=concat_list, link_batch=True)
        response = client.process_text(text_request).processed_text

        # Unpack response request
        for row in response:
            # Unpack the response and append them to respective lists
            # Using marker to split the response into original parts
            resp_parts = row.split(marker)

            # Append each part to respective lists
            prompt_question_d.append(resp_parts[0])
            prompt_answer_d.append(resp_parts[1])

        # Create a temporary DataFrame to store the results of this chunk
        temp_df = pd.DataFrame({
            'question': prompt_question_d,
            'answer': prompt_answer_d
        })

        # Append the results of this chunk to the new dataframe
        new_dataframe = pd.concat([new_dataframe, temp_df], ignore_index=True)
        progress_bar.update(batch_size)


    # If the original dataframe has more columns, you can copy them to the new dataframe here

    print('Completed request')

    return new_dataframe




CPU times: user 5 µs, sys: 1e+03 ns, total: 6 µs
Wall time: 8.58 µs


In [15]:
%%time
df = pd.read_csv(SOURCE_DATA_URL)
redacted_df = redact_text_through_API(df, batch_size=100)
redacted_df.to_csv(OUTPUT_DESTINATION  +'output_redact.csv', index=False)

Processing Requests:   0%|          | 0/7108 [00:00<?, ?it/s]

Completed request
CPU times: user 9.63 s, sys: 392 ms, total: 10 s
Wall time: 6min 14s


# Fine-Tuning on Redacted Data
Having applied AI-powered redaction to our domain-specific dataset, we proceed to fine-tune a new instance of our language model using this modified data. This example will help us understand the implications of training on redacted data, particularly regarding the model's ability to comprehend and generate domain-specific information while maintaining privacy.

__Important__ Please note that the resulting model and tokenizer will be saved to the 'save_name' folder. We will be referencing this folder later when we test the fine tuned redacted model.

In [None]:
df = pd.read_csv(OUTPUT_DESTINATION  + 'output_redact.csv')
model_type = 'microsoft/DialoGPT-medium'
save_name = OUTPUT_DESTINATION  + '/fine_tuned_DialoGPT_redact'

model, tokenizer = fine_tune_model( df, model_type, save_name)

Epoch 1/3:   0%|          | 0/1777 [00:00<?, ?it/s]

# Interacting with the Second Fine-Tuned Model (Redacted Data)
Having observed the responses from the pre-trained and the first fine-tuned models, we now turn our attention to a model fine-tuned on redacted data. This version aims to strike a balance between retaining domain-specific expertise and upholding data privacy. By training on redacted data, we expect the model to demonstrate proficiency in the subject matter while minimizing the disclosure of sensitive information.

In [None]:
## Sample Questions to test
# Does CUDA support ISO
# who made an AI system created the sings Christmas songs?

start_chat('fine_tuned_DialoGPT_redact')

>> User: who made an AI system created the sings Christmas songs?


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


DialoGPT: The [OCCUPATION_8] of the [ORGANIZATION_2] created an AI system that automatically sang the songs of the [ORGANIZATION_2] Christmas music.


KeyboardInterrupt: Interrupted by user


# Conclusion
Throughout this exploration, we embarked on a journey to understand the intricacies of fine-tuning large language models (LLMs) with a keen focus on the balance between enhancing domain-specific knowledge and preserving data privacy. Our experiments involved fine-tuning different instances of a model on both redacted and un-redacted datasets, followed by interactive sessions to gauge performance.