# Automated Code Review Assistant (GARA)
## Fine-tuning Mistral-7B with LoRA and GitHub CI/CD Integration

This notebook covers the **Phase 2: LLM Fine-Tuning** and **Phase 3: GitHub Integration and Automation (Model Hosting)** aspects of the GARA project. It demonstrates how to fine-tune the Mistral-7B model using Low-Rank Adaptation (LoRA) on a custom dataset of code diffs and review comments, and then sets up a basic API for inference, ready for deployment.

**Project Phases:**
1.  **Data Collection & Preprocessing:** (Covered in `data_collection.py` / `preprocessing.py`)
2.  **LLM Fine-Tuning:** (This notebook)
3.  **GitHub Integration and Automation:** (This notebook - model hosting, API endpoint)
4.  **Evaluation and Optimization:** (Subsequent steps)
5.  **Documentation & Presentation:** (Overall project documentation)

### 1. Setup and Library Installation
First, we install all necessary libraries. `bitsandbytes` and `accelerate` are crucial for efficient 4-bit quantization and training on GPUs. `peft` enables LoRA fine-tuning, and `transformers` provides the core LLM functionalities. `pyngrok` and `uvicorn`/`fastapi` are for setting up the inference API.

In [None]:
!pip install -q datasets accelerate peft
!pip install -U bitsandbytes
!pip install -U transformers
!pip install -q pyngrok uvicorn fastapi nest-asyncio

### 2. Hugging Face Authentication and Dataset Loading
To access and push models on Hugging Face Hub, authentication is required. We also load our preprocessed `training_data.jsonl` dataset, which contains `prompt` and `completion` pairs (code diffs and their corresponding review comments).

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, TaskType, PeftModel, PeftConfig, prepare_model_for_kbit_training
import torch
import os
from huggingface_hub import login # Import login for interactive authentication

# IMPORTANT: It's recommended to use notebook_login() for interactive Colab sessions
# or set HF_TOKEN as an environment variable in your Colab secrets/environment setup.
# For this notebook, we assume the token is set or will be prompted.

# os.environ["HF_TOKEN"] = "hf_YOUR_TOKEN_HERE" # You can uncomment and paste your token, but interactive login is safer.

# This will prompt you to enter your Hugging Face token if not already logged in.
# Ensure your token has 'write' permissions if you plan to push models.
login()

# Load the preprocessed dataset
dataset = load_dataset("json", data_files="training_data.jsonl")['train']
print(f"Dataset loaded with {len(dataset)} examples.")
print("Sample example from dataset:")
print(dataset[0])

### 3. Tokenization
We load the tokenizer corresponding to our base model (Mistral-7B-Instruct-v0.2). A `tokenize` function is then defined to convert our `prompt` and `completion` text into numerical `input_ids` and `attention_mask` suitable for the model. The `labels` are set to `input_ids` for causal language modeling, where the model predicts the next token in the sequence.

In [None]:
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

def tokenize(example):
    # Combine prompt and completion for training as a single sequence
    combined = example['prompt'] + example['completion']
    tokenized = tokenizer(
        combined,
        truncation=True,
        padding='max_length',
        max_length=512,
    )
    return {
        'input_ids': tokenized['input_ids'],
        'attention_mask': tokenized['attention_mask'],
        'labels': tokenized['input_ids'],
    }

# Apply the tokenization function to the entire dataset
tokenized_dataset = dataset.map(tokenize, batched=True) 
print(f"Tokenized dataset size: {len(tokenized_dataset)}")
print("Sample tokenized example (first 10 input_ids):")
print(tokenized_dataset[0]['input_ids'][:10])

### 4. Model Loading and LoRA Configuration
We load the base Mistral-7B model with 4-bit quantization using `bitsandbytes` to reduce memory footprint and enable training on Colab's GPUs. The model is then prepared for LoRA (Low-Rank Adaptation) training, which is a parameter-efficient fine-tuning technique.

In [None]:
# Configure 4-bit quantization with BitsAndBytes
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the base pre-trained model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare the model for k-bit training (important for LoRA with quantization)
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)

print("Model loaded and prepared for LoRA training:")
model.print_trainable_parameters() 

### 5. Fine-tuning with Hugging Face Trainer
We define `TrainingArguments` for our fine-tuning process, specifying parameters like batch size, number of epochs, logging, and saving strategies. The `Trainer` class then orchestrates the fine-tuning, optimizing the LoRA adapter weights.

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="outputs", 
    per_device_train_batch_size=2, 
    gradient_accumulation_steps=4, 
    num_train_epochs=1, 
    logging_dir="logs", 
    save_total_limit=1, 
    save_steps=20, 
    logging_steps=5, 
    report_to="none", 
    fp16=True
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False) 
)

print("Starting fine-tuning...")
trainer.train()
print("Fine-tuning complete!")

### 6. Saving and Pushing the Fine-tuned Adapter to Hugging Face Hub
After fine-tuning, we save the LoRA adapter weights and the tokenizer. These are then pushed to the Hugging Face Hub, making the fine-tuned model accessible for inference and future use.

In [None]:
model.save_pretrained("lora-adapter")
tokenizer.save_pretrained("lora-adapter")
print("LoRA adapter and tokenizer saved locally to 'lora-adapter' directory.")

# Push the fine-tuned adapter and tokenizer to Hugging Face Hub
# Ensure your Hugging Face token has 'write' permissions.
try:
    # Replace 'ishaanj91' with your Hugging Face username or organization name
    hub_model_name = "ishaanj91/mistral-code-review-lora"
    model.push_to_hub(hub_model_name, use_auth_token=True)
    tokenizer.push_to_hub(hub_model_name, use_auth_token=True)
    print(f"Successfully pushed LoRA adapter and tokenizer to {hub_model_name} on Hugging Face Hub.")
except Exception as e:
    print(f"Failed to push to Hugging Face Hub. Error: {e}")
    print("Please ensure your Hugging Face token has 'write' permissions and you have access to the repository.")

### 7. Setting up the Inference API (FastAPI with ngrok)
To integrate our fine-tuned model into CI/CD workflows, we need an API endpoint. FastAPI is used to create a simple web server, and `ngrok` is employed to expose this local server to the internet, providing a temporary public URL for testing.

In [None]:
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForCausalLM # Ensure AutoModelForCausalLM is imported
from peft import PeftModel
import torch

app = FastAPI(title="GARA Code Review API")

base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
lora_adapter_id = "ishaanj91/mistral-code-review-lora" # Replace with your actual repo name

print(f"Loading base model: {base_model_id}...")
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    load_in_4bit=True 
)
print("Base model loaded.")

print(f"Loading LoRA adapter from: {lora_adapter_id}...")
model = PeftModel.from_pretrained(base_model, lora_adapter_id)
print("LoRA adapter loaded.")

print(f"Loading tokenizer from: {lora_adapter_id}...")
tokenizer = AutoTokenizer.from_pretrained(lora_adapter_id)
print("Tokenizer loaded.")

class ReviewRequest(BaseModel):
    diff: str

@app.post("/review")
def review_code(request: ReviewRequest):
    prompt = f"Review this code diff:\n\n{request.diff}\n\nComment:"
    
    # Tokenize the prompt and move to GPU
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device) # Use model.device for consistency
    
    # Generate the review from the model
    output = model.generate(**inputs, max_new_tokens=300, temperature=0.7, do_sample=True)
    
    # Decode the generated tokens back to human-readable text
    review = tokenizer.decode(output[0], skip_special_tokens=True)
    
    # The generated output will contain the prompt itself, so we need to extract only the completion
    # Find the start of the completion by looking for the 'Comment:' part of the prompt
    if 'Comment:' in review:
        review = review.split('Comment:', 1)[1].strip() # Split once and take the part after 'Comment:'
    
    return {"review": review}

### 8. Exposing the API with ngrok
Finally, we use `ngrok` to create a public URL for our FastAPI application running on Colab. This allows external services (like GitHub Actions) to send review requests to our model.

In [None]:
from pyngrok import ngrok
import nest_asyncio
import uvicorn
import threading

# Apply nest_asyncio to allow running asyncio event loops within Jupyter/Colab
nest_asyncio.apply()

# IMPORTANT: Set your ngrok authentication token here
# You can get one from https://ngrok.com/signup
# ngrok.set_auth_token("YOUR_NGROK_AUTH_TOKEN_HERE") # Uncomment and replace if you haven't set it globally

# Define the port for our FastAPI application
PORT = 7860
public_url = ngrok.connect(PORT)
print(f"Your FastAPI application is publicly accessible at: {public_url}")

# Start the Uvicorn server in a separate thread
# This allows the Colab cell to remain responsive while the server runs
uvicorn_thread = threading.Thread(
    target=uvicorn.run,
    args=(app,), # Pass the FastAPI app object directly
    kwargs={"host": "0.0.0.0", "port": PORT}
)
uvicorn_thread.start()

print(f"FastAPI app is running locally on http://0.0.0.0:{PORT}")
print("Visit the public URL to interact with your API (e.g., append /docs for Swagger UI).")
