# Fine-tuning Mistral-7B with LoRA for Code Review and API Deployment

This notebook demonstrates the process of fine-tuning the `mistralai/Mistral-7B-Instruct-v0.2`
model for the specific task of code review. It utilizes several key technologies:

- **Hugging Face Transformers:** For loading the pre-trained model and tokenizer.
- **PEFT (Parameter-Efficient Fine-Tuning):** To apply LoRA (Low-Rank Adaptation), which significantly reduces the computational cost of fine-tuning.
- **BitsAndBytes:** For 4-bit quantization, making it possible to run large models on consumer-grade GPUs.
- **FastAPI & Ngrok:** To deploy the fine-tuned model as a publicly accessible web API.

The process is broken down into the following steps:
1.  **Setup:** Installing and importing necessary libraries.
2.  **Data Preparation:** Loading a custom dataset and tokenizing it for the model.
3.  **Model Loading:** Loading the base Mistral model with 4-bit quantization.
4.  **LoRA Configuration:** Setting up the LoRA parameters for efficient fine-tuning.
5.  **Training:** Running the training job using the `Trainer` class.
6.  **Saving & Pushing to Hub:** Saving the trained LoRA adapter and pushing it to the Hugging Face Hub.
7.  **API Deployment:** Creating a FastAPI endpoint and exposing it to the web with ngrok.

## 1. SETUP: INSTALL LIBRARIES

In [None]:
# First, we install all the required Python packages.
# - `datasets`: For loading our training data.
# - `accelerate`: To enable efficient training on different hardware setups.
# - `peft`: The Parameter-Efficient Fine-Tuning library from Hugging Face.
# - `bitsandbytes`: For model quantization.
# - `transformers`: The core library for working with transformer models.
# - `uvicorn`, `fastapi`, `pyngrok`: For creating and deploying the API.
#
# The `-q` flag is used for a "quiet" installation, reducing the output noise.

!pip install -q datasets accelerate peft bitsandbytes transformers
!pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q uvicorn fastapi pyngrok nest-asyncio pydantic

## 2. CONFIGURATION & DATA PREPARATION

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from huggingface_hub import login

# --- Hugging Face Configuration ---
# It is highly recommended to store your token as a secret in your environment.
# Replace the placeholder with your actual Hugging Face token.
HF_TOKEN = "YOUR_HUGGING_FACE_TOKEN_HERE"
login(token=HF_TOKEN)

# --- Model and Dataset Configuration ---
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"
DATA_FILE = "training_data.jsonl"  # Assumes this file is in the same directory
OUTPUT_HUB_REPO = "your-hf-username/mistral-code-review-lora" # Replace with your desired repo

# --- Load Dataset ---
# We load the training data from a JSON Lines file. Each line should be a JSON
# object with 'prompt' and 'completion' keys.
try:
    dataset = load_dataset("json", data_files=DATA_FILE)['train']
except FileNotFoundError:
    print(f"Error: The data file '{DATA_FILE}' was not found.")
    print("Please make sure it is uploaded to your Colab/Jupyter environment.")

# --- Tokenizer Initialization ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    """
    Tokenizes the combined prompt and completion text from the dataset.
    The 'labels' are set to the 'input_ids' which is standard practice for
    causal language model fine-tuning.
    """
    # Concatenate prompt and completion for a single training sequence
    combined_text = [p + c for p, c in zip(examples['prompt'], examples['completion'])]

    tokenized_inputs = tokenizer(
        combined_text,
        truncation=True,
        padding='max_length',
        max_length=512, # You can adjust this based on your VRAM and typical sample length
    )

    tokenized_inputs["labels"] = tokenized_inputs["input_ids"]
    return tokenized_inputs

# --- Process Dataset ---
# Apply the tokenization function to the entire dataset.
tokenized_dataset = dataset.map(tokenize_function, batched=True)

## 3. MODEL LOADING AND QUANTIZATION

In [None]:
# Here, we load the base Mistral model. To make it fit into memory, we use
# 4-bit quantization via the BitsAndBytes library.

# --- BitsAndBytes Configuration (4-bit quantization) ---
# This configuration enables loading the model in 4-bit precision, which
# significantly reduces the memory footprint.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # Use the "Normal Float 4" quantization type
    bnb_4bit_use_double_quant=True,    # Improves quantization precision
    bnb_4bit_compute_dtype=torch.bfloat16 # Use bfloat16 for computations
)

# --- Load Base Model ---
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto" # Automatically maps model layers to available devices (GPU/CPU)
)

# --- Prepare Model for LoRA ---
# This crucial step prepares a quantized model for PEFT (LoRA) training. It freezes
# the original model weights and adds the trainable LoRA adapters.
model = prepare_model_for_kbit_training(model)

## 4. LoRA CONFIGURATION AND MODEL TRAINING

In [None]:
# Now we define the LoRA configuration and set up the training arguments.

# --- LoRA Configuration ---
# Configures the parameters for Low-Rank Adaptation.
lora_config = LoraConfig(
    r=8,                             # Rank of the update matrices (a lower rank means fewer trainable parameters)
    lora_alpha=16,                   # Alpha parameter for scaling LoRA weights
    target_modules=["q_proj", "v_proj"], # Apply LoRA to the query and value projections in the attention layers
    lora_dropout=0.1,                # Dropout for regularization
    bias="none",                     # Do not train bias terms
    task_type="CAUSAL_LM"            # Specify the task type
)

# --- Apply PEFT to the Model ---
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() 

# --- Training Arguments ---
training_args = TrainingArguments(
    output_dir="outputs",
    per_device_train_batch_size=2,   # Adjust based on your GPU memory
    gradient_accumulation_steps=4,   # Effectively increases batch size
    num_train_epochs=1,              # A single epoch is often enough for fine-tuning
    logging_dir="logs",
    save_total_limit=1,              # Only keep the latest checkpoint
    save_steps=50,                   # Save a checkpoint every 50 steps
    logging_steps=10,                # Log training metrics every 10 steps
    report_to="none",                # Can be set to "wandb", "tensorboard", etc.
    fp16=True                        # Use mixed-precision training for speed
)

# --- Initialize Trainer ---
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

# --- Start Training ---
print("Starting model training...")
trainer.train()
print("Training complete.")

## 5. SAVE AND PUSH MODEL TO HUB

In [None]:
# After training, we save the LoRA adapter locally and then push it to the
# Hugging Face Hub for easy access and deployment.

# --- Save LoRA Adapter and Tokenizer Locally ---
LORA_ADAPTER_DIR = "mistral-code-review-lora-adapter"
model.save_pretrained(LORA_ADAPTER_DIR)
tokenizer.save_pretrained(LORA_ADAPTER_DIR)
print(f"Model adapter and tokenizer saved to '{LORA_ADAPTER_DIR}'.")

# --- Push to Hugging Face Hub ---
# This pushes only the trained adapter weights, not the full model.
model.push_to_hub(OUTPUT_HUB_REPO, use_auth_token=True)
tokenizer.push_to_hub(OUTPUT_HUB_REPO, use_auth_token=True)
print(f"Model and tokenizer pushed to Hugging Face Hub: {OUTPUT_HUB_REPO}")

## 6. API DEPLOYMENT WITH FASTAPI AND NGROK

In [None]:
# This final section sets up a web server to serve our fine-tuned model and
# uses ngrok to create a public URL for it.

from fastapi import FastAPI
from pydantic import BaseModel
import nest_asyncio
from pyngrok import ngrok
import uvicorn

# --- Initialize FastAPI App ---
app = FastAPI(
    title="Code Review API",
    description="An API to get AI-powered code reviews using a fine-tuned Mistral model.",
    version="1.0.0"
)

# --- Load Fine-Tuned Model for Inference ---
# We reload the base model and then apply our trained LoRA adapter on top.
print("Loading model for inference...")
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto"
)
inference_model = PeftModel.from_pretrained(base_model, OUTPUT_HUB_REPO)
inference_tokenizer = AutoTokenizer.from_pretrained(OUTPUT_HUB_REPO)
print("Inference model and tokenizer loaded successfully.")

# --- Define API Request Schema ---
class ReviewRequest(BaseModel):
    """Defines the structure for a code review request."""
    diff: str

# --- Define API Endpoint ---
@app.post("/review")
def review_code(request: ReviewRequest):
    """
    Accepts a code diff and returns a constructive review.
    """

    prompt = f"""You are an expert code reviewer. Here's a code diff from a pull request:
        ```diff
        {request.diff}
        ```
        Please write a constructive review summarizing:
        - What the code does
        - What changed
        - How it can be improved (if anything)
        - Whether it follows best practices
    """

    print("Generating review for the provided diff...")
    inputs = inference_tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate the review
    output = inference_model.generate(**inputs, max_new_tokens=400, temperature=0.7)
    review = inference_tokenizer.decode(output[0], skip_special_tokens=True)
    print("Review generated successfully.")

    return {"review": review}

# --- Ngrok Tunneling Setup ---
# Replace with your actual ngrok authtoken.
NGROK_AUTH_TOKEN = "YOUR_NGROK_AUTH_TOKEN_HERE"
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# `nest_asyncio` is needed to run `uvicorn` in a notebook environment
nest_asyncio.apply()

# --- Expose API using Ngrok ---
public_url = ngrok.connect(8000)
print(f"Public API URL: {public_url}")
print("Send a POST request to the /review endpoint of the URL above.")
print("Example using curl:")
print(f"curl -X POST \"{public_url}/review\" -H \"Content-Type: application/json\" -d '{{\"diff\": \"your_code_diff_here\"}}'")


uvicorn.run(app, host="0.0.0.0", port=8000)