# Data Augmentation using PHI 4 MINI

This repository demonstrates how to use

*   Large Language Models (LLMs), specifically Microsoft Phi 4 Mini, to perform data augmentation on a structured dataset. The goal is to generate additional realistic records for an Employee Salary Dataset to enhance the training data for a salary prediction model.
*   Data augmentation is a fundamental technique in machine learning used to expand and diversify datasets by generating synthetic data. This can improve model robustness, generalization, and performance. LLMs excel at this task because they can understand context, mimic writing styles, and generate plausible outputs based on prompts.



In [None]:
!pip install transformers pandas

In [None]:
# Import necessary libraries
from transformers import pipeline
import pandas as pd
from io import StringIO

In [None]:
# Load a Pretrained LLM (Microsoft Phi 4 Mini instruct)
generator = pipeline("text-generation", model="microsoft/Phi-4-mini-instruct")

In [None]:
# Define a Structured Prompt for Data Generation
prompt = """
Generate a structured table in CSV format with columns:
Employee ID, Name, Age, Department, Salary ($), Experience (Years).

Example:
1, John Doe, 28, Engineering, 75000, 3
2, Jane Smith, 32, Marketing, 85000, 5
3, Alice Brown, 45, HR, 95000, 10
4, Robert White, 38, Engineering, 90000, 7
5, Emily Davis, 29, Finance, 72000, 4
6, Michael Johnson, 50, Sales, 110000, 20
7, Sarah Wilson, 31, HR, 78000, 6
8, David Lee, 42, Marketing, 88000, 12
9, Jennifer Moore, 27, Engineering, 71000, 2
10, Kevin Clark, 35, Finance, 93000, 8
11, Jessica Taylor, 30, Sales, 79000, 5
12, William Martin, 37, HR, 87000, 9
13, Olivia Adams, 40, Engineering, 99000, 14
14, Daniel Harris, 26, Finance, 70000, 2
15, Sophia Anderson, 33, Marketing, 85000, 7
16, Matthew Thomas, 29, Sales, 73000, 3
17, Laura Jackson, 36, HR, 89000, 10
18, Anthony Rodriguez, 41, Engineering, 105000, 15
19, Lisa Scott, 39, Marketing, 92000, 11
20, Andrew Hall, 34, Finance, 94000, 9
"""

In [None]:
# Generate Synthetic Data using Microsoft Phi 4 Mini
generated_data = generator(prompt, max_length=5000, num_return_sequences=1)


In [None]:
# Parse the Generated Text into a DataFrame
generated_text = generated_data[0]['generated_text']
generated_text = generated_text.replace(prompt, "").strip()
data = pd.read_csv(StringIO(generated_text), names=["Employee ID", "Name", "Age", "Department", "Salary", "Experience"])

In [None]:
# Print the Generated Data
print(data)