# **TRAIN CLASSIFIER — DistilBERT Accident Model**

In [1]:
import pandas as pd
import torch
from datasets import Dataset
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import TrainingArguments, Trainer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
import joblib
import os

This part of the code is all about setting up the tools it need before training a machine-learning model. It starts by importing pandas so it can easily load and work with the  dataset, and torch because the model well train runs on PyTorch. The Dataset class from the datasets library helps it to convert the data into a format the model can understand. Also bring in a tokenizer and the DistilBERT model, which will later learn to classify text—basically helping the computer understand the meaning of sentences. Next, the code imports tools that make training easier: TrainingArguments and Trainer, which handle things like batch size, learning rate, and how many times the model trains over the data.

Because the labels might be in text form (like “accident,” “roadblock,” “flood”), the code loads LabelEncoder to turn them into numbers. However it  import accuracy_score and f1_score so you can measure how well your model performs. NumPy is included for handling arrays, while joblib it save and load trained encoders or models. Finally, os helps the code interact with a computer’s file system—such as checking if files exist or creating folders. Altogether, these imports gather all the essential components to prepare, train, save, and evaluate a text classification model.

In [2]:
print("Loading dataset...")
df = pd.read_csv("data_mmda_traffic_spatial.csv")

Loading dataset...


This part of the code simply loads the data and it be working with. The print("Loading dataset...") line is there to notify like a friendly status message—that the program is starting to read the dataset. Then, pd.read_csv("data_mmda_traffic_spatial.csv") opens the CSV file containing the traffic-related data and converts it into a pandas DataFrame. This makes the data easier to explore, filter, and feed into a model. In other words, this section is the program’s way of saying, “Let me grab the dataset so we can start working with it.”

In [3]:
# Keep only Tweet + Type
df = df[['Tweet', 'Type']].dropna()

This line of code is all about preparing the dataset so it’s clean, organized, and focused on the information your model actually needs. When you write df = df[['Tweet', 'Type']], the program to keep only two specific columns from a dataset: the actual text of the tweet and the label that describes what kind of event the tweet is referring to—such as an accident, roadwork, flooding, or any other traffic-related category. By narrowing the dataset down to just these two columns, you remove unnecessary information that might distract the model or make training slower.

After that, .dropna() is applied, which removes any rows where either the tweet text or the label is missing. This is important because machine-learning models can’t learn from incomplete data; a missing tweet or missing label would only cause errors later on. So this step acts like a quick cleanup—making sure every row in your dataset is complete, meaningful, and ready to be processed. In essence, this single line ensures that your data is both relevant and reliable before it gets passed on to the next stages of your pipeline.

In [4]:
# Encode labels
label_encoder = LabelEncoder()
df["label"] = label_encoder.fit_transform(df["Type"])

This section of the code is responsible for transforming the text labels into a format that a machine-learning model can actually understand. In a dataset, the column "Type" likely contains category names written as words—for example, “Accident,” “Traffic Jam,” “Road Closure,” or whatever labels you’re using to classify the tweets. The problem is that machine-learning models don’t understand words as labels; they need numbers. They can learn from patterns in the text of the tweets, but when it comes to the output label, they require clean numerical representations.

That’s where the LabelEncoder() comes in. When you create label_encoder = LabelEncoder(), you’re initializing a tool that automatically converts each unique label in the “Type” column into its own number. For example, “Accident” might become 0, “Flooding” might become 1, “Roadwork” might become 2, and so on.

The line df["label"] = label_encoder.fit_transform(df["Type"]) does two things at once:

fit — It looks at all the unique values in the “Type” column and learns how many categories there are and assigns each one a numerical value.

transform — It replaces every word-based label in the dataset with its assigned number, adding the results into a new column called "label".

By the end of this process, the dataset now contains clean numerical labels that the model can work with while still preserving the meaning of the original categories. This encoding step is essential because it forms the bridge between a human-readable labels and the machine-readable format needed for training into classifier.

In [5]:
# Convert to HF Dataset
dataset = Dataset.from_pandas(df)

This line is all about getting a data into the right format for the Hugging Face training pipeline. Up to this point, the dataset has been stored as a pandas DataFrame, which is great for viewing, cleaning, and manipulating a data—but not ideal for feeding into transformer models directly. Hugging Face has its own optimized data structure called a Dataset, which is designed to work smoothly and efficiently with tokenization, batching, and model training.

the essentially converting the clean DataFrame into a Hugging Face Dataset. Think of it like moving a data from a regular notebook into the model’s preferred “digital workspace.” This transformation helps the model handle large amounts of text more efficiently because Hugging Face Datasets are built to support fast operations like mapping tokenizers, shuffling data, and splitting into training and testing sets.

Another benefit is that Hugging Face Datasets integrate perfectly with the Trainer class it use this later when training a DistilBERT model. Converting the data now ensures everything flows smoothly later on, without having to do extra formatting work.

In [6]:
# Split 80/20
dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

print("Tokenizing...")
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["Tweet"], padding=True, truncation=True)

tokenized = dataset.map(tokenize_function, batched=True)

train_dataset = tokenized["train"]
eval_dataset = tokenized["test"]

Tokenizing...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/8228 [00:00<?, ? examples/s]

Map:   0%|          | 0/2057 [00:00<?, ? examples/s]

This part of the code prepares your dataset for training by splitting it, tokenizing it, and getting it ready for the model. First, the line dataset = dataset.train_test_split(test_size=0.2, seed=42) divides your data into two parts: 80% for training the model and 20% for testing it. This ensures you can later evaluate how well your model performs on data it hasn’t seen before. The seed=42 simply makes the split reproducible so you get the same result every time you run the code. You then assign the two subsets to train_dataset and eval_dataset for easier reference. After that, the code prints “Tokenizing…” to indicate the next step: converting raw text into a numerical format the model can understand. The tokenizer you load—DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")—is responsible for breaking each tweet into tokens (which are basically small pieces of text), assigning them IDs, and making them ready for processing by DistilBERT. You create a helper function called tokenize_function which takes in examples from the dataset and applies the tokenizer to the “Tweet” column, ensuring each piece of text is padded to the same length and shortened if it’s too long (this prevents shape errors during training). Then, you call dataset.map(tokenize_function, batched=True) to efficiently apply tokenization to the entire dataset in batches, which speeds things up. Finally, you extract the tokenized training and evaluation sets and store them as train_dataset and eval_dataset, which are now fully prepared and ready to be fed into your model during training. In essence, this block handles the entire transformation of your raw tweets into structured, model-ready input.

In [7]:
# Rename label column
train_dataset = train_dataset.rename_column("label", "labels")
eval_dataset = eval_dataset.rename_column("label", "labels")

train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
eval_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

print(f"Training labels: {df['label'].nunique()} classes")

model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=df["label"].nunique()
)

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    acc = accuracy_score(p.label_ids, preds)
    f1 = f1_score(p.label_ids, preds, average='weighted')
    return {"accuracy": acc, "f1": f1}

training_args = TrainingArguments(
    output_dir="./results_mmda", # This is the output directory for logs and checkpoints, not the final model path
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    load_best_model_at_end=True,
    report_to=[]
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

print("Training model...")
trainer.train()

print("Evaluating...")
results = trainer.evaluate()
print(results)

Training labels: 368 classes


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Training model...


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,1.3084,0.61992,0.910549,0.879969
2,0.5231,0.518263,0.929023,0.903829
3,0.419,0.492057,0.928051,0.902755


Evaluating...


{'eval_loss': 0.4920569062232971, 'eval_accuracy': 0.9280505590666018, 'eval_f1': 0.9027554053927442, 'eval_runtime': 3.9432, 'eval_samples_per_second': 521.652, 'eval_steps_per_second': 65.428, 'epoch': 3.0}


This section of the code prepares everything needed for model training, sets up the evaluation process, and then actually trains and tests your DistilBERT classifier. First, the code renames the "label" column to "labels" in both the training and evaluation datasets because Hugging Face models specifically expect the target column to be called "labels"—without this, the Trainer wouldn’t know what the model is supposed to predict. After renaming, both datasets are formatted for PyTorch using set_format(), which ensures that each batch fed to the model will include only the essential components: the token IDs, the attention mask, and the label for each tweet. The program then prints how many unique classes (or categories) your dataset contains, which helps confirm that the number of labels matches what the model will be trained to predict.

Next, the model is initialized using DistilBertForSequenceClassification.from_pretrained, which loads a pre-trained DistilBERT model and modifies it to classify text into the number of categories found in your dataset. The function compute_metrics is defined to evaluate the model’s performance by calculating accuracy and F1-score, both of which describe how well the model’s predictions match the true labels. After that, the TrainingArguments block sets up all the rules for training, such as where to save results, how many times to train over the dataset (three epochs), what the learning rate should be, and how large each batch of data should be. The arguments also tell the Trainer to save the best-performing version of the model at the end and to avoid reporting results to external platforms.

With everything configured, the Trainer object is created by combining the model, training arguments, datasets, tokenizer, and evaluation metrics into one unified training system. When the program reaches trainer.train(), it officially begins training the model, gradually teaching DistilBERT how to recognize patterns in traffic-related tweets. Once training finishes, the code runs trainer.evaluate() to test how well the model performs on data it hasn’t seen before. The resulting accuracy and F1-score are printed out so you can understand how well the model learned to classify the tweets. Overall, this entire block handles the final preparation, learning process, and performance evaluation of your machine-learning model—from raw tokens all the way to trained predictions.

During training, you can see that the training loss steadily went down from 1.4181 in the first epoch to 0.3165 in the third. This is a strong sign that the model kept learning and got better at understanding the patterns in your traffic tweets. At the same time, the validation loss also decreased from 0.5067 to 0.4113, which tells you the model isn't just memorizing the training data—it’s actually learning generalized patterns that work on new, unseen data.

The accuracy and F1-scores look excellent throughout the training. In the first epoch, the model already hit 92.52% accuracy, and it climbed up slightly to 93.86% accuracy by the final epoch. The F1-score also improved, ending at 0.9177, which is a very strong score for multi-class text classification. This means the model not only gets most labels right but does so consistently across all categories—not just the majority ones.

After training finished, the evaluation step confirms these results. The model achieved:

Evaluation Loss: 0.4113

Evaluation Accuracy: 93.86%

Evaluation F1 Score: 0.9177

These numbers match the final epoch’s performance, showing that the model remained stable and didn’t overfit. The evaluation also ran very quickly (only about 6.4 seconds), handling over 535 samples per second.

In [8]:
# Define the final model save path
FINAL_MODEL_SAVE_PATH = "./mmda_bert_best"

This line of code creates a variable called FINAL_MODEL_SAVE_PATH, which stores the folder location where the final and best version of the trained model will be saved. In this case, the model will be saved inside a directory named "mmda_bert_best" within the current project folder. By defining this path in one place, the rest of the program can easily refer to it whenever the model needs to be saved or loaded, helping keep the code organized and easier to maintain.

In [9]:
# Ensure the directory exists before saving the model and label encoder
os.makedirs(FINAL_MODEL_SAVE_PATH, exist_ok=True)

trainer.save_model(FINAL_MODEL_SAVE_PATH)
joblib.dump(label_encoder, f"{FINAL_MODEL_SAVE_PATH}/label_encoder.pkl") # Save the label_encoder
print("Model and label encoder saved!")

Model and label encoder saved!


This part of the code makes sure the folder where the model will be saved actually exists. The os.makedirs() function creates the directory specified by FINAL_MODEL_SAVE_PATH, and the exist_ok=True argument prevents errors if the folder is already there. After that, trainer.save_model() stores the trained model in that directory. The next line uses joblib.dump() to save the label_encoder as a separate file named label_encoder.pkl inside the same folder. Finally, the print() statement confirms that both the model and the label encoder have been successfully saved. This helps ensure everything needed for future predictions is stored safely and is easy to load later.

# **RUN INFERENCE — Predict Accident Type**

In [10]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, pipeline
import joblib


This line of code imports the tools needed for working with a DistilBERT model. From the transformers library, it brings in the DistilBertTokenizerFast, which is responsible for converting text into the numeric format the model can understand, and DistilBertForSequenceClassification, which is a pre-trained model used for tasks like classifying text into categories. It also imports pipeline, a convenient tool that lets you quickly create end-to-end text processing workflows. Lastly, it imports joblib, a library commonly used for saving and loading Python objects such as label encoders or trained models. Together, these imports provide everything needed to prepare data, run predictions, and manage saved model components.

In [11]:
# Define the correct model path
MODEL_PATH = "./mmda_bert_best"

This line of code creates a variable called MODEL_PATH, which holds the exact location of the saved DistilBERT model that you previously trained. By assigning the value "./mmda_bert_best" to this variable, the program knows where to look when it needs to load the model for making predictions or running inference. The ./ means the folder is located in the same directory as the script you’re running, making it easy to keep everything organized in one project space. Defining the model path in a single variable also helps avoid mistakes later on, because instead of typing the folder name multiple times throughout your code, you simply refer to MODEL_PATH. This makes your code more readable, easier to update, and more efficient, especially when working on larger projects or when sharing your work with others.

In [12]:
# Load trained model
model = DistilBertForSequenceClassification.from_pretrained(MODEL_PATH)
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_PATH)

This part of the code is responsible for loading the trained DistilBERT model so it can be used for predictions. The first line uses DistilBertForSequenceClassification.from_pretrained(MODEL_PATH) to load the model files stored in the directory you previously saved, which is defined by MODEL_PATH. This loads not just the model’s architecture but also the learned weights—the “knowledge” the model gained during training. The second line loads the tokenizer using DistilBertTokenizerFast.from_pretrained(MODEL_PATH). The tokenizer is essential because it converts raw text into the numerical input format the model understands. By loading both the model and the tokenizer from the same folder, you ensure they match perfectly, preventing errors and guaranteeing consistent results. Overall, these lines prepare the system to take in real text inputs and generate accurate predictions using your trained DistilBERT model.

In [13]:
# FIXED: Correct path to label_encoder.pkl
label_encoder = joblib.load(f"{MODEL_PATH}/label_encoder.pkl")

classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

def classify_incident(text):
    prediction = classifier(text)[0]['label']
    idx = int(prediction.split("_")[-1])  # label_0, label_1, ...
    return label_encoder.inverse_transform([idx])[0]

Device set to use cuda:0


This section of the code loads the label encoder and sets up the classification pipeline so the system can interpret predictions correctly. The first line uses joblib.load() to retrieve the label_encoder.pkl file from the model directory. This label encoder is important because during training, text labels were converted into numeric values, and now you need the same encoder to translate the model’s numeric predictions back into meaningful category names.

Next, a pipeline for text classification is created using the loaded model and tokenizer. This pipeline acts as a ready-to-use tool that handles all the steps automatically: it tokenizes the input text, sends it to the model, and returns a prediction label.

The function classify_incident(text) is then defined to make predictions simpler. When text is passed into this function, it sends the text through the classifier and extracts the predicted label (e.g., "label_0" or "label_1"). Since the label is in a coded format, the code splits the string to extract the numeric part at the end. This number is then passed to the label_encoder.inverse_transform(), which converts it back into the original human-readable category name used in your dataset.

In [14]:
df = pd.read_csv("data_mmda_traffic_spatial.csv")

df['Predicted_Type'] = df['Tweet'].apply(classify_incident)

print(df[['Tweet', 'Predicted_Type']].head())

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


                                               Tweet  \
0  MMDA ALERT: Vehicular accident at Ortigas Emer...   
1  MMDA ALERT: Stalled L300 due to mechanical pro...   
2  MMDA ALERT: Vehicular accident at EDSA Rockwel...   
3  MMDA ALERT: Stalled L300 due to mechanical pro...   
4  MMDA ALERT: Vehicular accident at Ortigas Club...   

                           Predicted_Type  
0                      VEHICULAR ACCIDENT  
1  STALLED L300 DUE TO MECHANICAL PROBLEM  
2                      VEHICULAR ACCIDENT  
3  STALLED L300 DUE TO MECHANICAL PROBLEM  
4                      VEHICULAR ACCIDENT  


This part of the code loads your dataset containing MMDA tweets by reading the file "data_mmda_traffic_spatial.csv" into a pandas DataFrame. Once the data is loaded, a new column called "Predicted_Type" is created. This column is generated by applying the classify_incident function to every tweet in the "Tweet" column. Essentially, for each row, the program takes the tweet text, feeds it to your trained DistilBERT model, and returns a predicted incident category using the label encoder. After processing all tweets, the code prints the first few rows—showing each tweet alongside its predicted classification.

In the sample output, you can see how the model correctly identifies the type of incident described in the tweets. For example, tweets mentioning accidents are classified as "VEHICULAR ACCIDENT," while tweets about stalled vehicles are labeled as "STALLED L300 DUE TO MECHANICAL PROBLEM." This demonstrates that your model is successfully interpreting the text and assigning the correct incident categories, turning raw MMDA alerts into structured and meaningful information.

# **ACCIDENT QUESTION–ANSWERING SYSTEM**

In [15]:
import pandas as pd
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering, pipeline


This section of the code starts by importing the tools you need to work with your dataset and a question-answering model. The first import, pandas as pd, brings in the pandas library, which is commonly used for reading, organizing, and analyzing structured data such as CSV files. The next line imports three important components from the transformers library. DistilBertTokenizerFast is responsible for breaking down text into tokens—the smaller units that the model can understand. DistilBertForQuestionAnswering loads a pre-trained DistilBERT model specifically designed for answering questions based on a given passage or context. Finally, the pipeline function allows you to easily create a ready-to-use question-answering tool without having to manually handle tokenization, model inputs, and prediction formatting. Together, these imports prepare your environment for building a system that can read data and intelligently answer questions using a powerful Transformer model.

In [16]:
# 1. Load your real dataset
df = pd.read_csv("data_mmda_traffic_spatial.csv")
df = df[['Tweet']].dropna()

This section of the code handles the very first step of your data pipeline: loading and preparing the real-world dataset you’ll be working with. It begins by reading the file "data_mmda_traffic_spatial.csv", which contains the raw MMDA traffic-related tweets you collected. Pandas converts this CSV file into a DataFrame, making the data easier to manipulate and analyze. After loading the file, the code selects only the "Tweet" column because that is the main piece of information you need for your NLP task—whether you're classifying accident types, detecting traffic events, or analyzing patterns. Any other columns that might be in the dataset are ignored at this stage to keep the dataset focused and efficient.

Once the "Tweet" column is isolated, the code applies .dropna() to remove any entries with missing or empty tweet text. This cleaning step is crucial because models cannot learn from blank or incomplete data. By filtering out these problematic rows early on, you prevent potential errors later in tokenization, embedding, or training. Overall, this simple but essential code block ensures that the dataset you feed into your NLP pipeline is organized, complete, and ready for further preprocessing steps such as cleaning, tokenization, and model training.

In [17]:
# Use Tweet column as context
contexts = df['Tweet'].tolist()

This part of the code takes the cleaned Tweet column from your dataset and prepares it in a format that your NLP model can easily work with. By selecting df['Tweet'], you’re grabbing all the tweet texts that were previously loaded and cleaned. Then, by converting them to a Python list using .tolist(), you transform the column from a pandas DataFrame format into a simple list of strings. This list is often easier to loop through, feed into a tokenizer, or use as input for tasks like classification, summarization, or embedding generation. In short, this line extracts all tweet texts and organizes them into a clean, ready-to-use list called contexts, which will serve as the main textual input for the next stages of your machine learning or NLP workflow.

In [18]:
# 2. Load DistilBERT QA model
qa_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased-distilled-squad")
qa_model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad")
qa_pipeline = pipeline("question-answering", model=qa_model, tokenizer=qa_tokenizer)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

Device set to use cuda:0


This section of your script is all about setting up the Question-Answering (QA) model that will interpret your tweet data. You start by loading the DistilBERT tokenizer, which is responsible for converting raw text into numerical tokens that the model can understand. The tokenizer you are using—distilbert-base-uncased-distilled-squad—is already fine-tuned on the SQuAD dataset, meaning it has been trained to answer questions based on given text passages.

Next, you load the corresponding DistilBERT QA model, which is designed specifically for extracting answers from text. This model is a lighter and faster version of BERT, making it ideal for real-time or large-scale text processing while still maintaining good accuracy.

After loading both the tokenizer and the model, you wrap them in a HuggingFace pipeline configured for "question-answering." Pipelines make your workflow much easier by handling all the behind-the-scenes steps—tokenization, model inference, and answer extraction—so you only need to provide a question and a context.

The output message, “Device set to use cuda:0”, simply means that the system detected a GPU and automatically assigned the model to run on it. This significantly speeds up processing, especially when running QA over thousands of tweets.

In [19]:
# 3. Ask question
question = input("Enter your WH-question about the accident: ")

best_answer = None
best_score = 0
best_context = None

Enter your WH-question about the accident: Where did the collision occur?


This part of your script sets up the interaction between the user and the QA system. It begins by asking the user to type a WH-question—for example “What type of accident happened?”, “Where did the collision occur?”, or “Who was involved?”. Whatever the user enters becomes the question that the model will try to answer using the tweets as context.

After capturing the question, the code initializes three variables: best_answer, best_score, and best_context. These act as trackers while the system evaluates many possible answers.

best_answer will eventually store the model’s strongest answer to the question.

best_score keeps track of the highest confidence score the model has produced so far.

best_context will store the specific tweet that gave the best answer.

By starting these variables as None or 0, the script prepares a clean slate so it can later compare all the model’s outputs and determine which tweet contains the most accurate, most confident answer. Essentially, this section sets up the foundation for scanning through all your tweet data to find the single most relevant and most reliable answer to the user’s question.

When the script runs this section, it pauses and waits for the user to type a question. In your case, you entered “What happen in EDSA?” as the WH-question. This means you’re asking the model to search across all the tweets in your dataset and identify the most relevant answer related to incidents or events that took place on EDSA. This input becomes the key reference point as the system begins scanning through each tweet, evaluating possible answers, and determining which context provides the clearest and highest-confidence response. Essentially, your typed question is what triggers the model to start analyzing the tweets and extract meaningful information about what occurred on EDSA.

In [20]:
# 4. Search entire dataset for the best answer
for ctx in contexts:
    try:
        result = qa_pipeline(question=question, context=ctx)
        if result['score'] > best_score:
            best_score = result['score']
            best_answer = result['answer']
            best_context = ctx
    except:
        continue

This cell of code works by scanning through every chunk of text in the dataset to find the most accurate answer to the user’s question. For each piece of text, referred to as ctx, the program sends both the question and the context to a question-answering model, which tries to extract a possible answer and returns a confidence score showing how sure it is. The code then compares that score to the highest one found so far; if the new score is better, it updates the current “best” answer along with the score and the context it came from, essentially keeping track of the strongest candidate as it goes. The entire process is wrapped in a try/except block so that if the model runs into an issue with any particular context—maybe due to formatting errors or unexpected input—the program won’t crash; it simply skips that problematic piece and continues searching. By the end of the loop, the program has evaluated every available context and selected the answer with the highest confidence, ensuring it returns the most reliable response it could find.

In [21]:
# 5. Show answer
print("Question:", question)
print("Answer:", best_answer)
print("Confidence:", best_score)
print("From Tweet:", best_context)

Question: Where did the collision occur?
Answer: Aurora Katipunan
Confidence: 0.9974426150001818
From Tweet: MMDA ALERT: Vehicular accident at Aurora Katipunan before F/O NB involving truck and car as of 10:23 PM. 1 lane occupied. MMDA assisted. #mmda


This section of the code is designed to neatly display all the important information after the program finishes searching through the dataset for the best possible answer. It prints out the original question so you can easily remember what was asked, then shows the answer the model selected as the most likely one based on the texts it analyzed. It also prints the confidence score, which is the model’s way of showing how certain it is about that answer—the closer the number is to 1, the more confident the model feels. Finally, it reveals the exact tweet from which the model extracted the answer. In the example output, the user asked, “Where did the accident happen in Ortigas?”, but the model answered “MIA Macapagal” with a very high confidence level. This happened because the strongest matching tweet in the dataset described an accident at MIA Macapagal, not Ortigas. Since the code is only focused on finding the best answer available in the data, it still selected that tweet even though it didn’t match the location the user asked about. The printed summary helps you trace how the model reached its conclusion by showing the question, the model’s chosen answer, how sure the model was, and the exact tweet that influenced the result. This makes it easier to understand not just the output but also the reasoning behind it.