# TeluguGPT — Colab Notebook

This Colab-friendly notebook covers **data processing**, **tokenization**, **LoRA fine-tuning** (PEFT) using `sarvamai/sarvam-1` (or fallback), and a **minimal Streamlit deployment** via `ngrok`. Follow cells in order. ⚠️ **Read license notes before using `sarvam/sarvam-1` for deployment.**

## 0) Setup & Install Dependencies
Run the following cell in Google Colab. This installs required packages (may take several minutes). If you use Colab Pro, enable GPU runtime (Runtime → Change runtime type → GPU).

In [1]:
# Install required libraries
# Note: installing bitsandbytes and some packages may require CUDA compatibility on the Colab runtime.
!pip install -q torch --index-url https://download.pytorch.org/whl/cu118
!pip install -q transformers==4.33.2 datasets accelerate==0.20.3 bitsandbytes==0.40.0 peft "xformers>=0.0.20" gradio streamlit pyngrok gtts trl unsloth
# Optional: Coqui TTS (may add heavy deps)
!pip install -q TTS
# Show versions
import torch, transformers, datasets, peft, accelerate, bitsandbytes
print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("peft:", peft.__version__)
print("accelerate:", accelerate.__version__)
print("bitsandbytes:", bitsandbytes.__version__)

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for tokenizers [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for tokenizers (pyproject.toml) ... [?25l[?25herror
[31m  ERROR: Failed building wheel for tokenizers[0m[31m
[0m[31mERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (tokenizers)[0m[31m
[0m[31mERROR: Ignored the following versions that require a different python version: 0.0.10.2 Requires-Python >=3.6.0, <3.9; 0.0.10.3 Requires-Python >=3.6.0, <3.9; 0.0.11 Requires-Python >=3.6.0, <3.9; 0.0.12 Requi

## 1) Upload your dataset (`content_cleaned.json`)
Upload your `content_cleaned.json` file using the file uploader below. The notebook expects the JSON file to be an array of objects with fields like `instruction`, `input`, `output`, `moral` (or `response`).

In [2]:
from google.colab import files
import json, os
uploaded = files.upload()
# Save uploaded files
for fn in uploaded:
    print("Uploaded:", fn)
# Check file
if "content_cleaned.json" in uploaded:
    with open("content_cleaned.json","r",encoding="utf-8") as f:
        data = json.load(f)
    print("Loaded JSON entries:", len(data))
    print("Example:", list(data[0].items())[:3])
else:
    # Try to find a json file
    for fn in uploaded:
        if fn.endswith(".json"):
            os.rename(fn, "content_cleaned.json")
            with open("content_cleaned.json","r",encoding="utf-8") as f:
                data = json.load(f)
            print("Renamed and loaded:", fn)
            break


Saving content_cleaned.json to content_cleaned (1).json
Uploaded: content_cleaned (1).json
Renamed and loaded: content_cleaned (1).json


## 2) Prepare Hugging Face Dataset
This cell converts your JSON into a `datasets.Dataset`, formatting prompt/response pairs.

In [3]:
from datasets import Dataset
import json
import os

# Load JSON
with open("content_cleaned.json","r",encoding="utf-8") as f:
    raw = json.load(f)

# Normalize keys (support variations)
examples = []
for rec in raw:
    instruction = rec.get("instruction") or rec.get("prompt") or ""
    inp = rec.get("input","")
    # some files use 'response' or 'output'
    response = rec.get("response") or rec.get("output") or rec.get("answer") or ""
    moral = rec.get("moral","")
    # combine if moral present
    if moral:
        response = response + "\n\nనీతి: " + moral
    prompt = instruction.strip()
    if inp:
        prompt = prompt + "\n\nఇన్‌పుట్: " + inp.strip()
    # final prompt style in Telugu request
    full_prompt = f"ఈ క్రింది ఆజ్ఞ ఆధారంగా తెలుగు కథను రాయండి:\n\n{prompt}\n\nకథ:"
    examples.append({"prompt": full_prompt, "response": response.strip()})

print("Converted examples:", len(examples))
ds = Dataset.from_list(examples)
ds = ds.train_test_split(test_size=0.05)
print(ds)
ds.save_to_disk("data/telugu_stories_ds")
print('Saved dataset to data/telugu_stories_ds')

Converted examples: 20
DatasetDict({
    train: Dataset({
        features: ['prompt', 'response'],
        num_rows: 19
    })
    test: Dataset({
        features: ['prompt', 'response'],
        num_rows: 1
    })
})


Saving the dataset (0/1 shards):   0%|          | 0/19 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

Saved dataset to data/telugu_stories_ds


## 3) Tokenize Dataset using model tokenizer
Use `sarvamai/sarvam-1` tokenizer. If you prefer a fallback open model (e.g., Mistral), change `BASE_MODEL` accordingly.

In [4]:
from transformers import AutoTokenizer
from datasets import load_from_disk
BASE_MODEL = "sarvamai/sarvam-1"  # change to fallback if needed
ds = load_from_disk("data/telugu_stories_ds")
print("Loading tokenizer:", BASE_MODEL)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=False)
max_length = 1024

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(batch):
    # Here we concatenate prompt + response so model learns continuation
    texts = [p + "\n\n" + r for p,r in zip(batch["prompt"], batch["response"])]
    out = tokenizer(texts, truncation=True, padding="max_length", max_length=max_length)
    out["labels"] = out["input_ids"].copy()
    return out

tokenized = ds.map(tokenize_fn, batched=True, remove_columns=ds["train"].column_names)
tokenized.save_to_disk("data/tokenized_telugu")
print("Saved tokenized dataset to data/tokenized_telugu")

Loading tokenizer: sarvamai/sarvam-1


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/19 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/19 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

Saved tokenized dataset to data/tokenized_telugu


In [None]:
# Update bitsandbytes for 8-bit quantization
!pip install -U bitsandbytes



## 4) Fine-tune with LoRA (PEFT)
This cell runs a small LoRA fine-tuning run. Adjust batch sizes/epochs to match your GPU. **Ensure you have accepted the model license on Hugging Face if required.**

In [6]:
import torch
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_from_disk
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import bitsandbytes # Import bitsandbytes

BASE_MODEL = "sarvamai/sarvam-1"  # change if needed
tokenized = load_from_disk("data/tokenized_telugu")
train_ds = tokenized["train"]
eval_ds = tokenized["test"]

print("Loading base model (this may take some time)...")
# Ensure bitsandbytes is updated before loading the model
# !pip install -U bitsandbytes # This is already done in a previous cell, but we ensure it's imported
import bitsandbytes # Add import here
model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, load_in_8bit=True, device_map='auto')

# Prepare for k-bit training if using bitsandbytes
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj","v_proj","k_proj","o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=False)
# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="./models/story_model_lora",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./models/story_model_lora")
print('Saved LoRA model to ./models/story_model_lora')

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading base model (this may take some time)...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  trainer = Trainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss


Saved LoRA model to ./models/story_model_lora


## 5) Quick Inference Test
Load the LoRA adapters and generate a story for a sample prompt.

In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

BASE_MODEL = "sarvamai/sarvam-1"
LORA_DIR = "./models/story_model_lora"

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=False)
base = AutoModelForCausalLM.from_pretrained(BASE_MODEL, device_map="auto", torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base, LORA_DIR, torch_dtype=torch.float16)
model.eval()

prompt = "ఒక చక్కటి పిల్లల bedtime కథ చెప్పండి: ఒక చిన్న ఆవు మరియు నక్క గురించి."
inputs = tokenizer(prompt, return_tensors="pt").to(next(model.parameters()).device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.8, top_p=0.95)
text = tokenizer.decode(out[0], skip_special_tokens=True)
print("Generated:\n", text)

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Generated:
 ఒక చక్కటి పిల్లల bedtime కథ చెప్పండి: ఒక చిన్న ఆవు మరియు నక్క గురించి. 
 
 ### దశల వారీ వివరణ: 
 1. **శీర్షికను పరిచయం చేయండి*: "ఒక చిన్న ఆవు మరియు నక్క గురించి" 
 2. **పరిచయం*: కథను పరిచయం చేయడానికి ఒక సాధారణ కథాంశాన్ని అందించండి (రెండు జంతువులను కలిగి ఉంటుంది). 
 3. **పాల్గొనేవారు**: "ఒక చిన్న ఆవు మరియు నక్క" 
 4. **సెట్-అప్ ** : "ది లిటిల్ ఆవు అండ్ ది ఫాక్స్" తో ప్రారంభించి, రెండు పాత్రలను పరిచయం చేస్తుంది. 
 5. **సంఘర్షణ: "కానీ అప్పుడు వారు ఆడుకోవడం ఆపలేదు." 
 6. **తీర్మానం*: "అప్పుడు వారు ఆట ఆడటం మానేయడానికి నిరాకరించారు" తో ముగుస్తుంది. 
 
 ### ఉదాహరణ: 
 **ఒక చిన్న ఆవు, ఒక నక్క గురించి.** 
 --- 
 ### దశలవారీ వివరణ: 
 1. **శీర్షికను పరిచయం చేయండి **: "ఒక చిన్న ఆవు మరియు నక్క గురించి" 
 2. **


In [26]:
!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
!dpkg -i cloudflared-linux-amd64.deb


Selecting previously unselected package cloudflared.
(Reading database ... 126374 files and directories currently installed.)
Preparing to unpack cloudflared-linux-amd64.deb ...
Unpacking cloudflared (2025.8.1) ...
Setting up cloudflared (2025.8.1) ...
Processing triggers for man-db (2.10.2-1) ...


In [27]:
# Start Streamlit app on port 8501 and expose it with Cloudflared
!streamlit run app.py --server.port 8501 & cloudflared tunnel --url http://localhost:8501


[90m2025-09-14T19:19:35Z[0m [32mINF[0m Thank you for trying Cloudflare Tunnel. Doing so, without a Cloudflare account, is a quick way to experiment and try it out. However, be aware that these account-less Tunnels have no uptime guarantee, are subject to the Cloudflare Online Services Terms of Use (https://www.cloudflare.com/website-terms/), and Cloudflare reserves the right to investigate your use of Tunnels for violations of such terms. If you intend to use Tunnels in production you should use a pre-created named tunnel by following: https://developers.cloudflare.com/cloudflare-one/connections/connect-apps
[90m2025-09-14T19:19:35Z[0m [32mINF[0m Requesting new quick Tunnel on trycloudflare.com...

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
2025-09-14 19:19:36.448 Port 8501 is already in use
[90m2025-09-14T19:19:38Z[0m [32mINF[0m +--------------------------------------------------------------------------------------------+
[90m2