### Link to kaggle notebook: 
https://www.kaggle.com/code/paridhidchoudhary25/ml-assignment-1-paridhi


### Link to trained model "calendar_gmail_classifier" 
https://www.kaggle.com/models/paridhidchoudhary25/calendar-gmail-classifier

### Introduction
This notebook is used to perform binary text classification for calendar and gmail related queries.
The workflow is as follows:
1. Dataset creation (nearly 120 queries, along with the ground truth labels: 0 = gmail , 1 = calendar)
2. Model selection and training: DistilBERT is selected
3. Trained model is saved into current working directory and also uploaded as input in order to avoid rerunning of the training loops again, (final trained model is also added as input in the final submission notebook under the name "calendar_gmail_classifier")
4. function predict_class is defined to check the predicted class of individual queries
5. Additional tasks (Brownie tasks) are done post this: (a). extract_time_range: provides the time if mentioned in calendar related queries
   (b). extract_people : used to do NER to identify names of people if mentioned in queries
   (c). analyze_query: overall function to predict class of a particular query, extract date if the predicted class is 1 (calendar realted query) and extract people name if present.

### Installations

In [47]:
!pip install transformers torch
!pip install dateparser
!pip install spacy
!python -m spacy download en_core_web_sm



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m83.6 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


### Imports

In [48]:
from collections import Counter
import random
from sklearn.model_selection import train_test_split

import torch
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification
)

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.optim import AdamW
import torch
from sklearn.metrics import classification_report, accuracy_score

import re
import calendar
from datetime import date, timedelta
import dateparser

import spacy



### Dataset Creation

In [49]:
# Label mapping
LABEL_GMAIL = 0
LABEL_CALENDAR = 1


In [50]:
#Dataset building

# dataset = [
#     ("query text here", label),
# ]

dataset = [

    # -------- GMAIL QUERIES --------
    ("Find emails from Sarah about the project", 0),
    ("Show unread messages in my inbox", 0),
    ("Search for emails with PDF attachments", 0),
    ("Find messages with subject 'quarterly report'", 0),
    ("Show all emails from marketing@company.com", 0),
    ("Find emails I starred last week", 0),
    ("Show emails with large attachments", 0),
    ("Find messages labeled 'urgent'", 0),
    ("Show emails received yesterday", 0),
    ("Search for mails from HR team", 0),
    ("Show me unread messages in my inbox",0),
    ("Find my conversation with the finance team", 0),
    ("Show recent messages", 0),
    ("Emails about travel reimbursement", 0),
    ("Find messages mentioning bonus", 0),
    ("Show inbox emails from last week", 0),
    ("Search emails related to internship", 0),
    ("Find messages sent by my manager", 0),
    ("Show mails I haven’t read yet", 0),
    ("Find all email threads with Sarah", 0),
    ("Show promotions emails", 0),
    # -------- CALENDAR QUERIES --------
    ("When is my next meeting with the design team", 1),
    ("Show events scheduled for next Tuesday", 1),
    ("Find appointments with Dr. Johnson", 1),
    ("When did I schedule the quarterly review", 1),
    ("Show all recurring meetings", 1),
    ("Find events where I am marked optional", 1),
    ("When is the marketing presentation scheduled", 1),
    ("Show all day events in May", 1),
    ("Find meetings I have not responded to yet", 1),
    ("When is the team lunch scheduled", 1),
    ("Show my calendar for June 2025", 1),
    ("What meetings do I have tomorrow", 1),
    ("Do I have anything scheduled today", 1),
    ("Find my meetings next week", 1),
    ("Show events from last month", 1),
    ("When is my interview scheduled", 1),
    ("Check my availability on Friday", 1),
    ("What is planned for Monday", 1),
    ("Show my schedule for today", 1),
    ("Find upcoming events", 1),
]



In [51]:
# -------- EDGE / AMBIGUOUS CASES --------
dataset.extend([
    ("Find my conversation from yesterday", 0),
    ("What did I do last Monday", 1),
    ("Show updates from yesterday", 0),
    ("Sarah meeting details", 1),
    ("Find notes from last week", 0),
    ("What is happening tomorrow", 1),
    ("Check messages from yesterday", 0),
    ("Any plans for this weekend", 1),
])


In [52]:
dataset.extend([
    ("Find emails sent to me this morning", 0),
    ("Show messages containing invoice", 0),
    ("Search emails about flight booking", 0),
    ("Find mails where I was cc'd", 0),
    ("Show email threads from the placement cell", 0),
    ("Find messages with attachments from last month", 0),
    ("Emails related to hostel accommodation", 0),
    ("Show mails exchanged with Juspay HR", 0),
    ("Find emails mentioning interview feedback", 0),
    ("Search inbox for emails from registrar", 0),
    ("Find messages discussing project deadline", 0),
    ("Show emails that mention budget approval", 0),
    ("Find unread mails from professors", 0),
    ("Emails I received during the weekend", 0),
    ("Show communication from admin office", 0),
    ("Find mail conversations about scholarship", 0),
    ("Search emails containing meeting notes", 0),
    ("Find emails sent by placement coordinator", 0),
    ("Show mail threads with multiple recipients", 0),
    ("Find emails flagged as important", 0),
])


In [53]:
dataset.extend([
    ("When is my next interview round", 1),
    ("Show schedule for the upcoming week", 1),
    ("Find events planned for this month", 1),
    ("What meetings are lined up today", 1),
    ("Show my calendar availability next week", 1),
    ("Find all scheduled reviews", 1),
    ("What is on my agenda tomorrow", 1),
    ("Show events I am invited to", 1),
    ("Find overlapping meetings on Friday", 1),
    ("What appointments do I have this afternoon", 1),
    ("Show schedule for exam week", 1),
    ("Find my calendar events with external guests", 1),
    ("What is planned after lunch today", 1),
    ("Show meetings scheduled in the evening", 1),
    ("Find events I declined", 1),
    ("When is the orientation session", 1),
    ("Check my schedule for next Monday", 1),
    ("Show timeline of events today", 1),
    ("Find planned sessions for tomorrow", 1),
    ("What is booked on my calendar this weekend", 1),
])


In [54]:
dataset.extend([
    ("Show discussions from last week", 0),
    ("What was planned last Friday", 1),
    ("Find updates shared yesterday", 0),
    ("Any engagements tomorrow", 1),
    ("Check my history from Monday", 0),
    ("What commitments do I have today", 1),
    ("Find information shared by Sarah", 0),
    ("Do I have anything lined up later", 1),
])


In [55]:
dataset.extend([
    ("Find emails discussing salary breakup", 0),
    ("Show mails related to offer letter", 0),
    ("Search emails containing NDA document", 0),
    ("Find messages exchanged with legal team", 0),
    ("Show emails mentioning background verification", 0),
    ("Find mails received after office hours", 0),
    ("Emails about relocation assistance", 0),
    ("Show communication related to joining formalities", 0),
    ("Find inbox messages from unknown senders", 0),
    ("Search emails that mention onboarding", 0),
])


In [56]:
dataset.extend([
    ("When is my onboarding session scheduled", 1),
    ("Show events planned after joining date", 1),
    ("Find meetings scheduled before noon", 1),
    ("What is my schedule during induction week", 1),
    ("Show calendar events for the first week of July", 1),
    ("Find sessions scheduled back to back", 1),
    ("What appointments are set for tomorrow morning", 1),
    ("Show events I am hosting", 1),
    ("Find meetings scheduled outside working hours", 1),
    ("What is planned on my joining day", 1),
])


In [57]:
labels = [label for _, label in dataset]
Counter(labels)


Counter({0: 59, 1: 58})

In [58]:
#Shuffling 
random.seed(42)
random.shuffle(dataset)


In [59]:
#Train test val split
texts = [q for q, _ in dataset]
labels = [l for _, l in dataset]

X_train, X_temp, y_train, y_temp = train_test_split(
    texts, labels, test_size=0.30, random_state=42, stratify=labels
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp
)

print(len(X_train), len(X_val), len(X_test))


81 18 18


### Model Development
DistilBERT Model is used.

In [60]:
tokenizer = DistilBertTokenizerFast.from_pretrained(
    "distilbert-base-uncased"
)


In [61]:
def tokenize_texts(texts):
    return tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=64,
        return_tensors="pt"
    )


In [62]:
#Model Initialization
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [63]:
#AdamW optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)


In [64]:
#Training Loop
model.train()

train_inputs = tokenize_texts(X_train)
train_labels = torch.tensor(y_train)

for epoch in range(5):
    optimizer.zero_grad()

    outputs = model(
        input_ids=train_inputs["input_ids"],
        attention_mask=train_inputs["attention_mask"],
        labels=train_labels
    )

    loss = outputs.loss
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch+1} | Training Loss: {loss.item():.4f}")


Epoch 1 | Training Loss: 0.6921
Epoch 2 | Training Loss: 0.6615
Epoch 3 | Training Loss: 0.6109
Epoch 4 | Training Loss: 0.5217
Epoch 5 | Training Loss: 0.4556


In [65]:
#Validation check
model.eval()

with torch.no_grad():
    val_inputs = tokenize_texts(X_val)
    val_outputs = model(
        input_ids=val_inputs["input_ids"],
        attention_mask=val_inputs["attention_mask"]
    )
    val_preds = torch.argmax(val_outputs.logits, dim=1)

accuracy = (val_preds == torch.tensor(y_val)).float().mean()
print("Validation Accuracy:", accuracy.item())


Validation Accuracy: 0.8888888955116272


In [66]:
# saving of pretrained model -> optional -> did to save the checkpoints and avoid rerunning again
MODEL_DIR = "/kaggle/working/calendar_gmail_classifier"

model.save_pretrained(MODEL_DIR)
tokenizer.save_pretrained(MODEL_DIR)



('/kaggle/working/calendar_gmail_classifier/tokenizer_config.json',
 '/kaggle/working/calendar_gmail_classifier/special_tokens_map.json',
 '/kaggle/working/calendar_gmail_classifier/vocab.txt',
 '/kaggle/working/calendar_gmail_classifier/added_tokens.json',
 '/kaggle/working/calendar_gmail_classifier/tokenizer.json')

### Link to pretrained model
https://www.kaggle.com/models/paridhidchoudhary25/calendar-gmail-classifier

The pretrained model is taken as input whose path is shared, however, if its not visible, it can be added as input and the path can be modified

In [67]:
#loading the pretrained model -> the pretrained model saved in kaggle can be taken as input instead of rerunning the entire training loop
# pretrained model path
MODEL_DIR = "/kaggle/input/calendar-gmail-classifier/pytorch/default/1"

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)


### Evaluation Metrics

The classifier is evaluated using accuracy, precision, recall, and F1-score. 
While accuracy gives a general performance estimate, precision and recall are 
particularly important for ambiguous queries (e.g., "find my conversation from yesterday"), 
where misclassification can lead to incorrect downstream actions. 
F1-score is used as the primary metric to balance precision and recall.


In [68]:
#Test set evaluation
model.eval()

with torch.no_grad():
    test_inputs = tokenize_texts(X_test)
    test_outputs = model(
        input_ids=test_inputs["input_ids"],
        attention_mask=test_inputs["attention_mask"]
    )
    test_preds = torch.argmax(test_outputs.logits, dim=1)

print("Test Accuracy:", accuracy_score(y_test, test_preds))
print(classification_report(y_test, test_preds))


Test Accuracy: 0.9444444444444444
              precision    recall  f1-score   support

           0       1.00      0.89      0.94         9
           1       0.90      1.00      0.95         9

    accuracy                           0.94        18
   macro avg       0.95      0.94      0.94        18
weighted avg       0.95      0.94      0.94        18



In [69]:
def predict_class(test_query: str) -> int:
    """
    Predicts the class of a user query.
    Returns:
        0 -> Gmail-related query
        1 -> Calendar-related query
    """
    model.eval()

    inputs = tokenizer(
        test_query,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=64
    )

    with torch.no_grad():
        outputs = model(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"]
        )

    return torch.argmax(outputs.logits, dim=1).item()


### Evaluation on few examples:

In [70]:
print(predict_class("Find emails from HR about joining"))      # Expected: 0
print(predict_class("Show my meetings for next week"))         # Expected: 1


0
1


In [71]:
print(predict_class("Find my meetings before noon"))   #expected 1        
print(predict_class("When is my next meeting with the design team?")) #expected 1
print(predict_class("agenda for friday"))   # Ambiguous



1
1
1


From above, it can be seen that ambiguous queries rely on learned contextual patterns rather than explicit keywords

### Time Range Extraction

In [72]:
def extract_time_range(query: str):
    """
    Extracts time range from queries like:
    - June 2025
    - last week / next week / this week
    - yesterday / today / tomorrow
    """
    
    q = query.lower()
    today = date.today()

    # Handle relative week expressions
    if "last week" in q:
        start = today - timedelta(days=today.weekday() + 7)
        end = start + timedelta(days=6)
        return {"from": start.isoformat(), "to": end.isoformat()}

    if "this week" in q:
        start = today - timedelta(days=today.weekday())
        end = start + timedelta(days=6)
        return {"from": start.isoformat(), "to": end.isoformat()}

    if "next week" in q:
        start = today + timedelta(days=(7 - today.weekday()))
        end = start + timedelta(days=6)
        return {"from": start.isoformat(), "to": end.isoformat()}

    # Handle relative day expressions
    if "yesterday" in q:
        d = today - timedelta(days=1)
        return {"from": d.isoformat(), "to": d.isoformat()}

    if "today" in q:
        return {"from": today.isoformat(), "to": today.isoformat()}

    if "tomorrow" in q:
        d = today + timedelta(days=1)
        return {"from": d.isoformat(), "to": d.isoformat()}

    # Handle explicit Month Year (June 2025)
    month_year_pattern = r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+(\d{4})"
    match = re.search(month_year_pattern, query, re.IGNORECASE)

    if match:
        month_name, year = match.groups()
        month = list(calendar.month_name).index(month_name.capitalize())
        year = int(year)

        start = date(year, month, 1)
        end = date(year, month, calendar.monthrange(year, month)[1])

        return {
            "from": start.isoformat(),
            "to": end.isoformat()
        }

    # Fallback: single-date parsing
    parsed_date = dateparser.parse(query)

    if parsed_date:
        d = parsed_date.date()
        return {"from": d.isoformat(), "to": d.isoformat()}

    return None


In [73]:
print(extract_time_range("Find my meetings for June 2025"))
print(extract_time_range("Find my meeting notes from last week"))
print(extract_time_range("Show meetings next week"))
print(extract_time_range("What meetings do I have tomorrow"))
print(extract_time_range("Show events today"))



{'from': '2025-06-01', 'to': '2025-06-30'}
{'from': '2025-12-08', 'to': '2025-12-14'}
{'from': '2025-12-22', 'to': '2025-12-28'}
{'from': '2025-12-19', 'to': '2025-12-19'}
{'from': '2025-12-18', 'to': '2025-12-18'}


In [74]:
print(extract_time_range("Find my meeting notes from yesterday"))

{'from': '2025-12-17', 'to': '2025-12-17'}


### People/Entity Extraction

In [75]:
nlp = spacy.load("en_core_web_sm")


In [76]:

# simple rule based extraction fails in some of the ambiguous queries like "Sarah meeting details", Spacy based NER helps
# def extract_people(query: str):
#     """
#     Extracts person names using simple rule-based patterns.
#     """
#     pattern = r"(from|with|to|by|about)\s+([A-Z][a-z]+)"
#     matches = re.findall(pattern, query)
#     return [name for _, name in matches]

def extract_people(query: str):
    people = set()

    # spaCy NER (primary)
    doc = nlp(query)
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            people.add(ent.text)

    # Fallback: "by <Name>" pattern (handles passive voice)
    fallback_pattern = r"\b(?:by|from)\s+([A-Z][a-z]+)\b"
    matches = re.findall(fallback_pattern, query)
    for name in matches:
        people.add(name)

    return list(people)


In [77]:
extract_people("Find emails from Sarah about the project delivered by Rohan")


['Rohan', 'Sarah']

### Final Evaluation
Overall function to predict class, give time range (if the predicted class ==1) and give people name for each query

In [78]:
def analyze_query(query: str):
    result = {
        "predicted_class": predict_class(query),
        "time_range": None,
        "people": extract_people(query)
    }

    if result["predicted_class"] == 1:
        result["time_range"] = extract_time_range(query)

    return result


In [79]:
ambiguous_queries = [
    "find my conversation from yesterday",
    "show updates from last week",
    "what was discussed on monday",
    "notes from the team meeting",
    "messages about the review",
    "discussion scheduled yesterday",
]
calendar_ambiguous = [
    "what is planned for tomorrow",
    "review scheduled next week",
    "agenda for friday",
    "follow-up discussion today",
]
gmail_ambiguous = [
    "updates shared yesterday",
    "conversation about deployment",
    "notes sent last week",
    "threads related to onboarding",
]
entity_ambiguous = [
    "Rohan discussion yesterday",
    "Sarah update from last week",
    "meeting details from Ananya",
    "messages shared by Amit",
]
mixed_signal = [
    "emails from the meeting yesterday",
    "calendar update sent last week",
    "discussion invite from sarah",
]


In [80]:
test_queries = (
    ambiguous_queries +
    calendar_ambiguous +
    gmail_ambiguous +
    entity_ambiguous +
    mixed_signal
)

for q in test_queries:
    print(q)
    print(analyze_query(q))
    print("-" * 40)


find my conversation from yesterday
{'predicted_class': 0, 'time_range': None, 'people': []}
----------------------------------------
show updates from last week
{'predicted_class': 0, 'time_range': None, 'people': []}
----------------------------------------
what was discussed on monday
{'predicted_class': 1, 'time_range': None, 'people': []}
----------------------------------------
notes from the team meeting
{'predicted_class': 1, 'time_range': None, 'people': []}
----------------------------------------
messages about the review
{'predicted_class': 0, 'time_range': None, 'people': []}
----------------------------------------
discussion scheduled yesterday
{'predicted_class': 1, 'time_range': {'from': '2025-12-17', 'to': '2025-12-17'}, 'people': []}
----------------------------------------
what is planned for tomorrow
{'predicted_class': 1, 'time_range': {'from': '2025-12-19', 'to': '2025-12-19'}, 'people': []}
----------------------------------------
review scheduled next week
{'pr

### Analysis of Ambiguous Queries

The examples above are intentionally ambiguous and not part of the training dataset. 
The model infers intent based on semantic context rather than keyword matching. 
For example, queries containing "conversation" or "messages" are classified as Gmail-related, 
while terms like "scheduled", "agenda", and "planned" lean towards Calendar intent. 
Time extraction is performed only for Calendar-related queries to avoid incorrect interpretations.
