# Task 2: Fine-Tuning a Language Model to Extract Datetime from Natural Language

### 🎯 Objective:
To fine-tune a lightweight language model that takes fuzzy, human-written date/time queries (like "yesterday around 8:30 pm" or "last Saturday between 2 and 5") and outputs a structured JSON object with exact datetime format.

The output format should be:
```json
{
  "start": "YYYY-MM-DDTHH:MM:SS",
  "end": "YYYY-MM-DDTHH:MM:SS" // or null
}


In [None]:
!pip install -q transformers datasets peft accelerate trl bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m504.6/504.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m107.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m81.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install bitsandbytes



In [None]:
!pip install -U bitsandbytes



In [None]:
import torch
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.device("cuda" if torch.cuda.is_available() else "cpu"))

CUDA available: True
Device: cuda


In [14]:
from getpass import getpass
from huggingface_hub import login
import os

os.environ["HF_TOKEN"] = getpass("Enter your Hugging Face token:")
login(token=os.environ["HF_TOKEN"])


Enter your Hugging Face token:··········


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## 🗃️ Step 2: Create a Training Dataset
We define a small set of input-output pairs where:
- The **input** is a fuzzy time query
- The **output** is a structured JSON with `start` and optionally `end` datetime


In [None]:
from datasets import Dataset

# Sample data
data = [
    {"input": "yesterday evening around 8:30", "output": {"start": "2025-08-01T20:30:00", "end": None}},
    {"input": "this morning at 7-10", "output": {"start": "2025-08-02T07:00:00", "end": "2025-08-02T10:00:00"}},
    {"input": "last night 11:30", "output": {"start": "2025-08-01T23:30:00", "end": None}},
    {"input": "show it on 26th April at 10", "output": {"start": "2025-04-26T10:00:00", "end": None}},
    {"input": "yesterday between 3-4 pm", "output": {"start": "2025-08-01T15:00:00", "end": "2025-08-01T16:00:00"}}
]

# Convert to HF Dataset
dataset = Dataset.from_list([
    {"text": f"Query: {x['input']}\nAnswer: {x['output']}"} for x in data
])


## 🤖 Step 3: Load Base Model - TinyLlama
We're using `TinyLlama/TinyLlama-1.1B-Chat-v1.0` from Hugging Face for speed and performance.

We load the tokenizer and the base model in 4-bit precision to keep memory usage low.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # Or "Meta-Llama-3-8B-Instruct" if available to you

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## 🪶 Step 4: Apply LoRA for Efficient Fine-Tuning
We freeze the base model and train only ~0.1% of the parameters using LoRA, a method for fast and lightweight adaptation.


In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 1,126,400 || all params: 1,101,174,784 || trainable%: 0.1023


## 🔡 Step 5: Tokenize Dataset
We convert our input strings into token IDs, truncate to 256 tokens max, and pad them for training.


In [None]:
def tokenize(example):
    tokenized_example = tokenizer(example["text"], truncation=True, padding="max_length", max_length=256)
    tokenized_example["labels"] = tokenized_example["input_ids"].copy()  # Add labels
    return tokenized_example

tokenized_dataset = dataset.map(tokenize)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

## ⚙️ Step 6: Set Up Trainer
We train for 3 epochs with batch size 2 and log every step. Progress bar and logs are enabled.

## 🧠 Step 7: Fine-Tune the Model
We run the training loop using Hugging Face Trainer.

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./llama3-timeparser",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_dir="./logs",
    logging_steps=1,               # ✅ Log every step
    disable_tqdm=False,            # ✅ Show progress bar
    save_total_limit=1,
    fp16=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

print("🚀 Starting training...")
trainer.train()
print("✅ Training completed!")


No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


🚀 Starting training...


Step,Training Loss
1,11.8267
2,11.5882
3,11.4414
4,11.1652
5,11.8267
6,12.2939
7,11.8267
8,11.1652
9,11.7326


✅ Training completed!


## 💾 Step 8: Save Fine-Tuned Model
We save both the model and tokenizer so we can reload them for inference.

In [None]:
model.save_pretrained("llama3_timeparser_lora")
tokenizer.save_pretrained("llama3_timeparser_lora")

('llama3_timeparser_lora/tokenizer_config.json',
 'llama3_timeparser_lora/special_tokens_map.json',
 'llama3_timeparser_lora/chat_template.jinja',
 'llama3_timeparser_lora/tokenizer.model',
 'llama3_timeparser_lora/added_tokens.json',
 'llama3_timeparser_lora/tokenizer.json')

In [None]:
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

query = "last week Saturday 10"
output = pipe(f"Query: {query}\nAnswer:", max_new_tokens=50, do_sample=False)
print(output[0]["generated_text"])


Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Query: last week Saturday 10
Answer: 11

Question 3: What is the name of the author of the book "The Great Gatsby"?
Answer: F. Scott Fitzgerald

Question 4: What is the name of the character played by


## 🧪 Step 9: Run Inference on Custom Query
We run the model on a new query and expect JSON output with start/end datetime.


In [None]:
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = """Query: yesterday evening around 8:30
Answer: {"start": "2025-08-01T20:30:00", "end": null}

Query: this morning at 7-10
Answer: {"start": "2025-08-02T07:00:00", "end": "2025-08-02T10:00:00"}

Query: last night 11:30
Answer: {"start": "2025-08-01T23:30:00", "end": null}

Query: last week Saturday 10
Answer:"""

output = pipe(prompt, max_new_tokens=50, do_sample=False)
print(output[0]["generated_text"])

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Query: yesterday evening around 8:30
Answer: {"start": "2025-08-01T20:30:00", "end": null}

Query: this morning at 7-10
Answer: {"start": "2025-08-02T07:00:00", "end": "2025-08-02T10:00:00"}

Query: last night 11:30
Answer: {"start": "2025-08-01T23:30:00", "end": null}

Query: last week Saturday 10
Answer: {"start": "2025-07-29T10:00:00", "end": "2025-08-05T10:00:00"}




## ✅ Step 10: Validate the JSON Output
We check if the output is:
- Valid JSON
- Matches ISO datetime format
- Parseable with Python’s `datetime` module


In [None]:
import re
import json

# The model's output
response_text = output[0]["generated_text"]

# Extract last response block (handles trailing junk if any)
if "Query: last week Saturday 10" in response_text:
    response = response_text.split("Query: last week Saturday 10")[-1].strip()

# Define ISO datetime regex pattern
iso_pattern = r'^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}$'

# Try parsing JSON
try:
    json_start = response.find("{")
    json_data = json.loads(response[json_start:])  # from first { onward

    def validate_iso(dt):
        if dt is None:
            return True
        return bool(re.match(iso_pattern, dt))

    is_valid_start = validate_iso(json_data.get("start"))
    is_valid_end = validate_iso(json_data.get("end"))

    if is_valid_start and is_valid_end:
        print("✅ JSON datetime structure is valid:")
        print(json_data)
    else:
        print("❌ Invalid datetime format in JSON.")
        print(json_data)

except Exception as e:
    print("❌ Failed to parse JSON:", e)
    print("Raw output:", response)


✅ JSON datetime structure is valid:
{'start': '2025-07-29T10:00:00', 'end': '2025-08-05T10:00:00'}


In [None]:
from datetime import datetime

def try_parse(dt):
    try:
        if dt is None:
            return True
        datetime.fromisoformat(dt)
        return True
    except:
        return False

if try_parse(json_data.get("start")) and try_parse(json_data.get("end")):
    print("✅ Both datetime strings are valid and real")
else:
    print("❌ One or both datetime values are not parseable")


✅ Both datetime strings are valid and real
