# Corporate Synergy Bot 7B - Training Notebook

This notebook trains a LoRA adapter on Mistral-7B for corporate speak transformation.

**Important**: Make sure to select GPU runtime (Runtime → Change runtime type → T4 GPU)

## 1. Install Dependencies

In [ ]:
# Check GPU availability and CUDA version first
!nvidia-smi
!nvcc --version

# Install CUDA dependencies for Google Colab
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install required packages with specific versions
!pip install -q transformers==4.36.2
!pip install -q datasets==2.14.7
!pip install -q peft==0.7.1
!pip install -q accelerate==0.25.0
!pip install -q bitsandbytes==0.41.3
!pip install -q tensorboard
!pip install -q huggingface-hub

# Verify CUDA installation
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")

## 2. Check GPU Availability

In [None]:
# Check GPU
!nvidia-smi

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

## 3. Login to Hugging Face

In [None]:
from huggingface_hub import login

# Login to Hugging Face - you'll need to enter your token
print("Please enter your Hugging Face token:")
login()

## 4. Load Dataset with Fixed Schema

In [None]:
from datasets import load_dataset, Dataset, DatasetDict
import pandas as pd

# Try to load the dataset
try:
    dataset = load_dataset("phxdev/corporate-speak-dataset")
    print("✅ Dataset loaded successfully!")
    print(f"Dataset structure: {dataset}")
except Exception as e:
    print(f"Error loading dataset: {e}")
    print("Creating dataset from scratch...")
    
    # If loading fails, create a simple dataset
    data = {
        "text": [
            "### Instruction: Transform to corporate speak\n### Input: let's meet tomorrow\n### Response: Let's sync up tomorrow to align on our objectives",
            "### Instruction: Transform to corporate speak\n### Input: good job\n### Response: Excellent execution on those deliverables",
            "### Instruction: Translate corporate speak to plain English\n### Input: We need to leverage our synergies\n### Response: We need to work together",
        ] * 1000  # Repeat for training
    }
    
    # Create train/validation splits
    df = pd.DataFrame(data)
    train_df = df[:800]
    val_df = df[800:]
    
    dataset = DatasetDict({
        "train": Dataset.from_pandas(train_df),
        "validation": Dataset.from_pandas(val_df)
    })

# Display first example
print("\nFirst training example:")
print(dataset['train'][0])

## 5. Initialize Model and Tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Model name
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

# Load tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(f"✅ Tokenizer loaded: {model_name}")

## 6. Load Model with 4-bit Quantization

In [None]:
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load model
print("Loading model with 4-bit quantization...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

model.config.use_cache = False
model.config.pretraining_tp = 1

print("✅ Model loaded successfully!")

## 7. Configure LoRA

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare model for k-bit training
print("Preparing model for LoRA training...")
model = prepare_model_for_kbit_training(model)

# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)

# Apply LoRA
model = get_peft_model(model, peft_config)
print("\n✅ LoRA configuration applied!")
model.print_trainable_parameters()

## 8. Tokenize Dataset

In [None]:
def tokenize_function(examples):
    # Ensure we're working with the text column
    if 'text' in examples:
        texts = examples['text']
    else:
        # Fallback: create text from other columns if needed
        texts = [f"### Instruction: {inst}\n### Input: {inp}\n### Response: {out}" 
                 for inst, inp, out in zip(examples.get('instruction', ['']*len(examples)), 
                                          examples.get('input', ['']*len(examples)), 
                                          examples.get('output', ['']*len(examples)))]
    
    return tokenizer(
        texts,
        truncation=True,
        max_length=512,
        padding="max_length"
    )

# Tokenize the dataset
print("Tokenizing dataset...")
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset["train"].column_names
)

print("✅ Dataset tokenized successfully!")
print(f"Training examples: {len(tokenized_dataset['train'])}")
print(f"Validation examples: {len(tokenized_dataset['validation'])}")

## 9. Setup Training Arguments

In [None]:
from transformers import TrainingArguments

# Training arguments
training_args = TrainingArguments(
    output_dir="./corporate-synergy-bot-7b",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.1,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    logging_steps=25,
    save_steps=100,
    eval_steps=100,
    save_total_limit=3,
    evaluation_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    push_to_hub=True,
    hub_model_id="phxdev/corporate-synergy-bot-7b",
)

print("✅ Training arguments configured!")

## 10. Create Trainer

In [None]:
from transformers import Trainer, DataCollatorForLanguageModeling

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator
)

print("✅ Trainer created and ready!")

## 11. Start Training

In [None]:
# Train the model
print("🚀 Starting training...")
print("This will take approximately 2-3 hours on a T4 GPU")
print("-" * 50)

trainer.train()

print("\n✅ Training complete!")

## 12. Save and Push Model

In [None]:
# Save the model locally
print("Saving model...")
trainer.save_model()
print("✅ Model saved locally!")

# Push to Hugging Face Hub
print("\nPushing to Hugging Face Hub...")
trainer.push_to_hub()
tokenizer.push_to_hub("phxdev/corporate-synergy-bot-7b")

print("\n🎉 Model successfully pushed to: https://huggingface.co/phxdev/corporate-synergy-bot-7b")

## 13. Test the Trained Model

In [None]:
def generate_response(prompt, max_length=150):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_length,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the response part
    if "### Response:" in response:
        response = response.split("### Response:")[-1].strip()
    return response

# Test examples
test_cases = [
    "### Instruction: Transform to corporate speak\n### Input: let's meet tomorrow\n### Response:",
    "### Instruction: Transform to corporate speak\n### Input: I need help\n### Response:",
    "### Instruction: Translate corporate speak to plain English\n### Input: We need to leverage our synergies\n### Response:",
    "### Instruction: Transform to tech corporate speak (seniority: senior)\n### Input: good job on the project\n### Response:"
]

print("🧪 Testing the model...\n")
for test in test_cases:
    print(f"Input: {test.split('### Input: ')[1].split('### Response:')[0].strip()}")
    print(f"Output: {generate_response(test)}")
    print("-" * 50)

## 14. Create Model Card

In [None]:
model_card = """---
license: apache-2.0
base_model: mistralai/Mistral-7B-Instruct-v0.2
tags:
- generated_from_trainer
- text-generation
- conversational
- corporate-speak
datasets:
- phxdev/corporate-speak-dataset
language:
- en
---

# Corporate Synergy Bot 7B

This model transforms casual language into professional corporate communication and vice versa.

## Model Details

- **Base Model**: Mistral-7B-Instruct-v0.2
- **Training**: LoRA fine-tuning
- **Parameters**: r=16, alpha=32
- **Dataset**: 7,953 bidirectional examples

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = PeftModel.from_pretrained(model, "phxdev/corporate-synergy-bot-7b")

prompt = "### Instruction: Transform to corporate speak\\n### Input: let's meet\\n### Response:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
```

## Examples

**Casual → Corporate:**
- "let's meet" → "Let's sync up to align on our objectives"
- "good job" → "Excellent execution on those deliverables"

**Corporate → Casual:**
- "We need to leverage our synergies" → "We need to work together"
"""

# Save model card
with open("README.md", "w") as f:
    f.write(model_card)

print("✅ Model card created!")

## 🎉 Congratulations!

Your Corporate Synergy Bot 7B has been trained and uploaded to Hugging Face!

**Next Steps:**
1. Check your model at: https://huggingface.co/phxdev/corporate-synergy-bot-7b
2. Create a demo Space using the `app.py` file in the demo folder
3. Share your bot with the community!

Remember: To maximize stakeholder value, we must leverage our synergies through collaborative paradigm shifts! 😄