# **Fine-Tuning the Turnsense Model (latishab/turnsense)**
In this notebook we load the published turnsense model (which is based on SmolLM2-135M and already fine‑tuned for turn detection) from Hugging Face, and then further fine‑tune it on an additional dataset.

**Steps:**
1.   Load the tokenizer and dataset.
2.   Preprocess and format the dataset.
3. Load the turnsense model.
4. (Optional) Re-apply or check the adapter configuration.
5. Fine-tune using SFTTrainer.
6. Save the newly fine‑tuned adapter/model.


In [1]:
!pip install --upgrade transformers datasets peft trl bitsandbytes scikit-learn

Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting trl
  Downloading trl-0.25.0-py3-none-any.whl.metadata (11 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Downloading datasets-4.4.1-py3-none-any.whl (511 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.25.0-py3-none-any.whl (462 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 kB[0m [31m37.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl (59.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from transformers import AutoTokenizer
from datasets import Dataset, load_dataset
from sklearn.model_selection import train_test_split
import pandas as pd

## 1. Load Tokenizer and Dataset
tokenizer = AutoTokenizer.from_pretrained("latishab/turnsense")
dataset = load_dataset("latishab/turns-2k")["train"]

# Convert to pandas DataFrame for stratified splitting
df = dataset.to_pandas()

# Stratified split (55% train, 45% test)
train_df, test_df = train_test_split(
    df,
    test_size=0.20,
    stratify=df["label"],
    random_state=42
)

# Convert back to Hugging Face Dataset format
ds = {
    "train": Dataset.from_pandas(train_df),
    "test": Dataset.from_pandas(test_df)
}
print(f"Dataset split: {len(ds['train'])} training samples, {len(ds['test'])} test samples")

## 2. Format Each Example
def format_example(example):
    text = f"<|user|> {example['content'].strip()}"

    inputs = tokenizer(text, padding="max_length", max_length=256)
    inputs["labels"] = example["label"]
    return inputs

ds["train"] = ds["train"].map(format_example)
ds["test"] = ds["test"].map(format_example)
print("Dataset formatted.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/50.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/75.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset split: 1600 training samples, 400 test samples


Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Dataset formatted.


In [3]:
from transformers import AutoModelForSequenceClassification, AutoConfig
from peft import PeftModel

## 3. Load the base model WITH classification head
base_model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
config = AutoConfig.from_pretrained(base_model_name)

# (e.g., binary classification → num_labels=2)
config.num_labels = 2
base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name, config=config)

## 4. Load the Turnsense adapter on top of the base model
adapter_model_name = "latishab/turnsense"
model = PeftModel.from_pretrained(base_model, adapter_model_name)

print("Successfully loaded Turnsense with its adapter for classification!")

config.json:   0%|          | 0.00/861 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at HuggingFaceTB/SmolLM2-135M-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


adapter_config.json:   0%|          | 0.00/839 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/156M [00:00<?, ?B/s]

Successfully loaded Turnsense with its adapter for classification!


In [13]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding

## 5. Setup TrainingArguments
training_args = TrainingArguments(
    output_dir="outputs",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=1,
    learning_rate=1e-5,
    num_train_epochs=3,
    warmup_ratio=0.1,
    weight_decay=0.0,
    logging_steps=50,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    save_total_limit=5,
    metric_for_best_model="loss",
    load_best_model_at_end=True,
    fp16=True,
    lr_scheduler_type="constant_with_warmup",
    optim="adamw_8bit",
    seed=3407,
    report_to="none"  # Disables wandb
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [15]:
import wandb
from transformers import AutoTokenizer
# wandb.finish()
# wandb.init(project="huggingface", reinit=True)
# import os
# os.environ["WANDB_RUN_ID"] = wandb.util.generate_id()

## 6. Train the Model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    data_collator=data_collator
)
trainer.train()

## 7. Save the fine-tuned adapter
adapter_save_path = "turnsense_finetuned_adapter"
model.save_pretrained(adapter_save_path)
print(f"Adapter saved to: {adapter_save_path}")

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss
50,0.1004,0.11273
100,0.109,0.11086
150,0.0889,0.109088
200,0.1112,0.107627


Step,Training Loss,Validation Loss
50,0.1004,0.11273
100,0.109,0.11086
150,0.0889,0.109088
200,0.1112,0.107627
250,0.1017,0.106516
300,0.0907,0.1051


Adapter saved to: turnsense_finetuned_adapter


In [20]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

## 8. Load tokenizer, base model, and fine-tuned adapter
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")
base_model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name)
adapter_model_name = "turnsense_finetuned_adapter"
model = PeftModel.from_pretrained(base_model, adapter_model_name)
model.eval()

def predict_single_text(text, threshold=0.5):
    """
    Predict the label for a single piece of text based on the fine-tuned model.
    """
    # Apply the same format used during training
    formatted_text = f"<|user|> {text.strip()}"

    # Tokenize the formatted input text
    inputs = tokenizer(formatted_text, padding="max_length", max_length=256, return_tensors="pt")

    # Get model prediction
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # Apply softmax to get probabilities
    probs = torch.nn.functional.softmax(logits, dim=1)

    # Use the threshold for classification
    pred = 1 if probs[0][1].item() >= threshold else 0
    confidence = probs[0][pred].item()

    return pred, confidence

## 9. Test the fine-tuned model
while True:
    text_input = input("Enter a text to classify (or type 'exit' to stop): ")
    if text_input.lower() == 'exit':
        break
    try:
        prediction, confidence = predict_single_text(text_input)
        print(f"Prediction: {prediction} (Confidence: {confidence:.4f})")
    except Exception as e:
        print(f"Error during prediction: {e}")

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at HuggingFaceTB/SmolLM2-135M-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Enter a text to classify (or type 'exit' to stop): im not sure but...
Prediction: 0 (Confidence: 0.9244)
Enter a text to classify (or type 'exit' to stop): hello there!
Prediction: 1 (Confidence: 0.9648)
Enter a text to classify (or type 'exit' to stop): haha!
Prediction: 1 (Confidence: 0.8815)
Enter a text to classify (or type 'exit' to stop): wait.
Prediction: 1 (Confidence: 0.6416)
Enter a text to classify (or type 'exit' to stop): so i was thinking of
Prediction: 0 (Confidence: 0.9877)


KeyboardInterrupt: Interrupted by user

In [19]:
from sklearn.metrics import classification_report, confusion_matrix
from tqdm.auto import tqdm

def run_evaluation(test_dataframe):
    """
    Runs the model over the entire test dataframe and prints a
    classification report and confusion matrix.
    """
    print("Evaluating model on the test dataset...")

    # Get the true labels from the dataframe
    true_labels = test_dataframe['label'].tolist()

    all_predictions = []

    # Loop over the 'content' column with a progress bar
    for text in tqdm(test_dataframe['content'], desc="Predicting on test set"):
        prediction, _ = predict_single_text(text)
        all_predictions.append(prediction)

    print("\n--- Evaluation Results ---")

    print("\nConfusion Matrix:")
    print(confusion_matrix(true_labels, all_predictions))

    print("\nClassification Report:")
    print(classification_report(true_labels, all_predictions, target_names=["Label 0", "Label 1"]))

run_evaluation(test_df)

Evaluating model on the test dataset...


Predicting on test set:   0%|          | 0/400 [00:00<?, ?it/s]


--- Evaluation Results ---

Confusion Matrix:
[[169   8]
 [  5 218]]

Classification Report:
              precision    recall  f1-score   support

     Label 0       0.97      0.95      0.96       177
     Label 1       0.96      0.98      0.97       223

    accuracy                           0.97       400
   macro avg       0.97      0.97      0.97       400
weighted avg       0.97      0.97      0.97       400

