# 00 Data Augmentation
In this notebook, I perform data augmentation using a Large Language Model (LLM). The goal is to generate new documents by modifying the original ones, ensuring that the data is augmented while preserving textual diversity and maintaining the core meaning.

### Notes
I reviewed the standard approaches for textual data augmentation in the literature. However, given my knowledge of the Generative AI field, I preferred to delegate the creation of new samples to an LLM (in this case, Gemini). This is because I believe that, when used correctly, these models can potentially outperform traditional data augmentation technique (especially for textual data).

## 0 | Setup

### 0.1 | Install dependencies

In [None]:
!pip3 install pydantic==2.8.2 -q
!pip3 install google-cloud-aiplatform==1.62.0 -q
!pip3 install llama-index-llms-gemini==0.3.1 -q
!pip3 install llama-index==0.11.0 -q
!pip3 install llama-index-llms-vertex==0.3.2 -q

### 0.2 | Google Cloud Login

In [None]:
!gcloud auth login
!gcloud auth application-default login
!gcloud config set project <your-project-id>

### 0.3 | Declare Environment variables 
In this section, we define and initialize the environment variables required for the notebook. These variables typically include project configurations, model parameters, and other settings necessary to run the data augmentation process. Declaring them ensures a consistent and flexible environment for executing the subsequent steps.

In [None]:
import os

os.environ['PROJECT_ID'] = '<your-project-id>'
os.environ['REGION'] = 'europe-west4'
os.environ['MODEL'] = 'gemini-2.0-flash' # I used flash for a speed up
os.environ['TEMPERATURE'] = '1'
os.environ['MAX_RETRIES'] = '5'
os.environ['MAX_TOKENS'] = '8192'
os.environ['PROMPT_TEMPLATE'] = """
# Goal

I will input a medical document belonging to a specific category (orthopaedics, radiology, gastroenterology, neurology, urology).

The objective is to perform data augmentation, i.e. to generate a new document inspired by the original but sufficiently distinct, while maintaining consistency with the reference medical category.
Use technique like this:
    - random deletion,
    - random insertion,
    - shuffling,
    - synonym replacement

#Important

- A machine learning algorithm must not understand interpreting the generated document as different from the first one
- Never put special characters like \\u0000-\\u001F or others

# Parameters

Consider these parameters during the generation of the response:

    - <category>: The category label of the document
    - <original_document>: The input text document that needs augmentation

<category>
{category}
</category>

<original_document>
{original_document}
</original_document>


"""

### 0.3 | Output Model Definition
Pydantic models are used to structure and validate the output data from the model, ensuring that it adheres to the expected format. By defining the model, we ensure that the generated text is properly encapsulated, with clear specifications for attributes such as the augmented text and other metadata."

In [None]:
from pydantic import BaseModel, Field

class ResponseAI(BaseModel):
    text: str = Field(
        description="The new document", examples=["Lorem ipsum "]
    )

### 0.4 | Agent Class

This class helps me keep the code organized and have a reusable component in other notebooks or various applications.

In [None]:
from google.cloud import aiplatform
from vertexai.generative_models._generative_models import HarmBlockThreshold, HarmCategory
from llama_index.core import PromptTemplate
from llama_index.core.output_parsers import PydanticOutputParser
from llama_index.core.program import LLMTextCompletionProgram
from llama_index.llms.vertex import Vertex

class VertexAIAgent:
    def __init__(self, project_id, region, model_name, temperature, max_retries, max_tokens, prompt_template, safety_settings=None):
        """
        Initialize the Vertex AI Model for text generation.

        :param project_id: Google Cloud project ID
        :param region: Google Cloud region
        :param model_name: Name of the model to use
        :param temperature: Temperature setting for randomness
        :param max_retries: Max retries for requests
        :param max_tokens: Max tokens for the response
        :param prompt_template: The prompt template to use for generating text
        :param safety_settings: (optional) Custom safety settings for content filtering
        """
        self.project_id = project_id
        self.region = region
        self.model_name = model_name
        self.temperature = temperature
        self.max_retries = max_retries
        self.max_tokens = max_tokens
        self.prompt_template = prompt_template

        aiplatform.init(project=self.project_id, location=self.region)

        if safety_settings is None:
            safety_settings = {
                HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
                HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
                HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_NONE,
                HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
                HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
            }

        self.llm = Vertex(
            model=self.model_name,
            temperature=self.temperature,
            max_retries=self.max_retries,
            max_tokens=self.max_tokens,
            safety_settings=safety_settings,
        )

        self.parser = PydanticOutputParser(output_cls=ResponseAI)

        self.prompt_template = PromptTemplate(template=self.prompt_template)

        self.llm_assistant = LLMTextCompletionProgram.from_defaults(
            output_cls=ResponseAI,
            output_parser=self.parser,
            prompt=self.prompt_template,
            llm=self.llm,
            verbose=True,
            #response_validation=False
        )

    def generate_text(self, original_document, category):#, augmentation_type):
        """
        Generate augmented text from the original document based on the given category and augmentation type.

        :param original_document: The original document text to augment
        :param category: The category label of the document
        :param augmentation_type: The type of augmentation to perform (e.g., "paraphrasing", "summarization")
        :return: The augmented text generated by the model
        """
        output = self.llm_assistant(
            original_document=original_document,
            category=category,
            #augmentation_type=augmentation_type
        )
        return output.text

## Augmentation

### 1 | Train Data Exploration

#### 1.1 | Distribution

In [None]:
import pandas as pd

train_data = pd.read_csv("./train/train_data.csv")
label_counts = train_data['domain'].value_counts()

print("Label Distribution:\n", label_counts)


#### 1.2 | Minority Classes

In [None]:
# Determine the minority class threshold (e.g., below mean count)
minority_classes = label_counts[label_counts < label_counts.max()].index.tolist()

print("Minority Classes:", minority_classes)

### 2 | Augment Data
In this section, gemini is used to generate new documents from the originals. The output is checked with Pydantic


#### 2.1 | Initialize LLM

In [None]:
_llm = VertexAIAgent(
        project_id=os.environ['PROJECT_ID'],
        region=os.environ['REGION'],
        model_name=os.environ['MODEL'],
        temperature=os.environ['TEMPERATURE'],
        max_retries=os.environ['MAX_RETRIES'],
        max_tokens=os.environ['MAX_TOKENS'],
        prompt_template=os.environ['PROMPT_TEMPLATE']
    )

#### 2.2 | Utils

In [None]:
import re

def clean_text(text: str) -> str:
    """Removes control characters and unwanted whitespace from the text."""
    text = re.sub(r'[\x00-\x1F\x7F]', '', text)  # Remove control characters
    text = text.replace("\ufeff", "")  # Remove BOM characters if present
    text = text.strip()  # Trim whitespace
    return text


def save_augmented_document(augmented_text:str, category:str, file_name:str, out_folder:str):
    """
    Save the augmented version of the document to the specified folder.

    :param augmented_text: The generated augmented text.
    :param category: The category of the document (used for naming and organization).
    :param file_name: The original file name, used to create the augmented file name.
    :param augmented_folder: The folder path where the augmented document will be saved.
    """

    augmented_file_path = os.path.join(out_folder, f"aug_{category}_{file_name}")

    with open(augmented_file_path, "w", encoding="utf-8") as aug_file:
        aug_file.write(augmented_text)

    print(f"Augmented document `{file_name}` saved to `{augmented_file_path}`")


#### 2.3 | Generate and Save

In [None]:

out_folder = "./train/augment"
os.makedirs(out_folder, exist_ok=True)

for category in ["Neurology", "Orthopedic", "Urology", "Radiology","Gastroenterology"]:

    category_files = train_data[train_data['domain'] == category]['file_name'].tolist()

    for file_name in category_files:
        file_path = f"./train/notes/{file_name}"

        try:
            with open(file_path, "r", encoding="utf-8") as file:
                original_text = file.read()
        except UnicodeDecodeError:
            with open(file_path, "r", encoding="latin1") as file:
                original_text = file.read()

        out_document = _llm.generate_text(
            original_document=clean_text(original_text),
            category=category
        )

        save_augmented_document(augmented_text=out_document,
                                category=category,
                                file_name=file_name,
                                out_folder=out_folder)


#### 2.4 | Save additional Training Data
Saves .csv file with (file_name, domain) header

In [None]:
import os
import csv
import re

def extract_domain_and_filename(file_name:str)->tuple[str, str]:
    """Extracts the domain from the filename while keeping the full filename."""
    match = re.match(r'aug_([A-Za-z]+)_\d+\.txt', file_name)
    if match:
        domain = match.group(1)
        return file_name, domain
    return file_name, "Unknown"

def generate_csv(folder_path:str, output_file:str):
    """Generates a CSV file with file names and domains."""
    file_entries = []

    for file_name in sorted(os.listdir(folder_path)):
        if file_name.endswith(".txt"):
            full_file_name, domain = extract_domain_and_filename(file_name)
            file_entries.append([full_file_name, domain])

    with open(output_file, mode='w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(["file_name", "domain"])
        writer.writerows(file_entries)

    print(f"CSV file '{output_file}' created successfully.")

generate_csv(folder_path="./train/augment/",output_file="./train/train_data_aug.csv")

## Considerations

This data augmentation approach enables the generation of new documents with enough variations to maintain diversity, ensuring that our machine learning algorithms recognize them as distinct.
For example, you may notice that parameters such as age, vitals and other details change slightly, allowing us to consider the new document as new samples.