# Zero-Shot Top Three Labeling

Summary: In brief, our process involves utilizing log entries, label titles, and descriptions to perform zero-shot classification. We use a BERT language model to convert the log entries and label descriptions into embeddings, which we then compare using cosine similarity. The three labels with the highest similarity to each log entry are identified and recorded in the result.csv file for future reference. At the end of the notebook, there is a "label analysis" section where we created 3D scatter plot and heatmaps of label descriptions.

### Import necessary packages and files

In [1]:
import json
import torch
import numpy as np
from pathlib import Path
from collections import defaultdict, Counter
from pprint import pprint
from sklearn.metrics import precision_recall_fscore_support
from sentence_transformers import SentenceTransformer, util
import pandas as pd


  from .autonotebook import tqdm as notebook_tqdm
2023-05-04 10:04:52.015991: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
occ_logs_file_path = ("occ-logs-2022-11-01-12-31.csv")
#model = SentenceTransformer("paraphrase-mpnet-base-v2", device="cpu")
model = SentenceTransformer("tsdae-model") # use the model in the folder

In [3]:
label_description = {
"TRAPPED IN ELEVATOR":	'The "TRAPPED IN ELEVATOR" category under "QUARTERLY INCIDENT AND OCCURRENCE REPORT" covers incidents related to elevator malfunctions or entrapments, which may impact passenger safety. These events involve passengers getting stuck in elevators at various station locations, such as platforms, parking garages, or transit areas. These incidents are reported when Station Agents, elevator technicians, or the fire department are called to respond and extricate the trapped individuals.',
"VANDALISM":	'The "VANDALISM" category under "QUARTERLY INCIDENT AND OCCURRENCE REPORT" covers incidents involving the misuse or discharge of fire extinguishers, or vandalism to fire extinguishers or hose cabinets within the transit system. Examples include fire extinguishers being discharged in stations, facilities, or inside trains.',
"FIRE":	'The "FIRE" category under "QUARTERLY INCIDENT AND OCCURRENCE REPORT" events at BART encompasses incidents involving fires or related hazards which may affect the operation of the transit system or the safety of patrons or BART personnel. Examples include fires in BART equipment, stations, facilities, or along the right-of-way. Fires appearing to have originated from activities occurring in a homeless encampment will be categorized as "HENC". The "FIRE" reporting threshold will be considered a fire that requires suppression by BART personnel or the fire department.',
"EVACUATION":	'The "FLS EVACUATION" category under "QUARTERLY INCIDENT AND OCCURRENCE REPORT" at BART refers to events that necessitate the evacuation of stations, trains, or other BART facilities. Examples may include instances of derailment, mechanical failure, smoke, fire, arson, hazardous materials leaks, fires on trains or nearby structures, or other safety concernts that require evacuation as a precautionary measure.',
"HENC":	'The "HENC (Homeless Encampment)" category under "QUARTERLY INCIDENT AND OCCURRENCE REPORT" at BART encompasses events related to fires or other safety hazards originating from activities occurring in homeless encampments near BART facilities or tracks. These incidents may include fires, smoke, or other dangerous situations that can impact the safety of BART patrons, personnel, or the operation of the transit system.',
"SMOKE":	'The "SMOKE" category under "QUARTERLY INCIDENT AND OCCURRENCE REPORT" at BART includes events involving smoke sightings or incidents related to smoke near or within BART facilities, tracks, or trains which may affect the operation of the transit system or the safety of patrons or BART personnel. These events may be caused by various factors, such as fires, electrical or mechanical issues, or debris in the trackway.',
"CSE":	'The "CSE (Computer System Engineering)" category for BART delays encompasses issues related to the communication and control systems that govern the operations of the transit network. This includes disruptions in the Integrated Control System (ICS) and Field Interface Process (FIP) communication link, which may occur during specific time intervals (e.g., 1000-1130), as well as problems with the ICS and Supervisory Control and Data Acquisition (SCADA) communication. These delays arise from technical complications within the computer systems that manage and regulate BART infrastructure, impacting the overall efficiency and timeliness of the transit system.',
"Operation":	'The "Operation" category for BART delays involves issues that stem from personnel, procedures, and management-related factors affecting the performance of the transit system. This includes delays due to staffing shortages, train operators making their own turns, and train management zone procedures. Additionally, delays may arise from train operators falling ill, following specific procedures such as auxiliary power off, circuit breaker trips, late boarding, or accidentally opening doors on the wrong side of the train. These operational delays impact the efficiency and punctuality of the BART system, as they are primarily related to the human and procedural aspects of the transit network.',
"Police":	'The "Police" category for BART delays pertains to incidents that require intervention or investigation by the BART Police Department (BPD) and subsequently impact the operations of the transit system These delays may result from BPD conducting sweeps or fare inspections, as well as holding trains due to various situations on or near the platforms. Examples of such situations include fights, smoking, intoxication, disorderly conduct, disturbances, drug use, theft, assault, battery, suspicious activity, or welfare checks. BPD may also be involved in searching for suspects, dealing with animals, or responding to incidents involving weapons or lewd acts. These police-related delays can disrupt the normal flow of the BART system, causing delays as officers work to ensure the safety and security of passengers and staff.',
"Track":	'The "Track" category for BART delays refers to issues with the physical infrastructure of the rail system that can lead to disruptions in service. One example of such a delay cause is rail kinks, which occur when the track warps or bends, potentially compromising the safety and smooth operation of the trains. These track-related delays impact the overall performance of the BART system as they require inspection, repair, or maintenance to ensure the reliability and safety of the rail network.',
"Train Control":	'The "Train Control" category for BART delays involves issues related to the systems that manage and regulate train movement and coordination. Examples of delay causes within this category include false occupancy, which occurs when the control system mistakenly detects the presence of a train on a section of track, and routing problems, which can arise from issues with the geolocation system. These train control-related delays impact the overall efficiency and safety of the BART system, as they can cause disruptions to train scheduling, coordination, and navigation, requiring prompt resolution to restore smooth operations.',
"Traction Power":	'The "Traction Power" category for BART delays pertains to issues with the electrical systems responsible for powering the trains. This includes problems such as loose third rail cover boards, which may require manual operation or removal, and disruptions to the third rail power supply. Delays within this category impact the overall functionality and reliability of the BART system, as they can cause trains to lose power or operate inefficiently. Addressing traction power issues is crucial to maintaining the safety and performance of the transit network.',
"Vehicle":	'The "Vehicle" category for BART delays encompasses issues related to the trains and their various components, which can impact the performance and reliability of the transit system. Causes for delays in this category include propulsion problems, door malfunctions, inverter issues, car shortages, and complications with the Automatic Train Operations (ATO) system. Other delay factors may involve brake issues, late dispatches, communication problems that require resetting, or disabled trains within the yard. These vehicle-related delays can significantly affect the efficiency of BART system , as they require inspection, repair, or maintenance to ensure the safe and smooth operation of the trains.'
}

## Fine Tune Model

make sentences pairs [label_description, entry] for fine tune 

In [6]:
from sklearn.model_selection import train_test_split
labeled_data = pd.read_csv("labeled_data.csv")
train_set, test_set = train_test_split(labeled_data, test_size=0.2)
labels = (labeled_data.CATEGORY.unique())
df1 = labeled_data.groupby('CATEGORY')
input = []
for i in labels:
    for j in df1.get_group(i)['ENTRY']:
        input.append([label_description[str(i.upper())], j])

fine tune model with similar logs (entry and label_description) grouped together, with an arbitrary similarity score of 0.9

In [8]:
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
train_examples = []
for i in input:
    train_examples.append(InputExample(texts = i, label = 0.9)) # can try out different values, like 1.0
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.CosineSimilarityLoss(model=model)

In [9]:
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100)

Iteration: 100%|██████████| 3/3 [00:09<00:00,  3.06s/it]
Iteration: 100%|██████████| 3/3 [00:08<00:00,  2.99s/it]
Iteration: 100%|██████████| 3/3 [00:08<00:00,  2.89s/it]
Epoch: 100%|██████████| 3/3 [00:26<00:00,  8.96s/it]


## Zero-Shot Classification

The previous Zero-Shot Classification has been adjusted to support tagging where train logs are "tagged" with the top 3 most similar categories based on vector cosine similarity. 

In [64]:
class ZeroShotClassifier:

    def __init__(self, model=None):
        self.model = model
        self.labels = []
        self.label_embeddings = None
        self.scores = []

    def train(self, labels, descriptions):
        self.labels = labels
        self.label_embeddings = model.encode(descriptions)

    def predict(self, input_texts=None, input_embeddings=None, output_scores=False):

        if input_embeddings is None:
            input_embeddings = self.model.encode(input_texts)

        S = util.pytorch_cos_sim(input_embeddings, self.label_embeddings)

        predicted_labels = []
        predicted_scores = []

        for i in range(input_embeddings.shape[0]):   
            label_scores = S[i].tolist()
            scored = sorted(
                zip(self.labels, label_scores),
                key=lambda x: x[1],
                reverse=True
            )

            pred = [scored[0][0], scored[1][0], scored[2][0]]
            score = [round(scored[0][1], 2), round(scored[1][1], 2), round(scored[2][1], 2)]


            predicted_scores.append(scored)
            predicted_labels.append(pred)
            self.scores.append((score))

        if output_scores:
            return predicted_labels, predicted_scores
        else:
            return predicted_labels

### Run classifier and get predictions

The following cell runs Zero-Shot Classification for the FLS data with both labels and descriptions for Delay and Fire Life Safety categories. 

In [66]:

my_classifier = ZeroShotClassifier(
    model=model,
)

my_classifier.train(
    labels=label_description.keys(),
    descriptions=list(label_description.values())
)

pred = my_classifier.predict(
    test_set['ENTRY'].tolist()
)

Format data and save prediction in results.csv, which contains entry, top three labels, similarity score for each label.

In [67]:

test_set['Pred'] = pred
test_set['Total Scores'] = my_classifier.scores
test_set = test_set.sort_values(by=["Pred"], ascending=True)
test_set.to_csv("testset_after_fine_tune_09.csv")

## Label Analysis

In [8]:
import plotly.express as px
from sklearn.decomposition import PCA
#3d scatterplot
pca3 = PCA(3)
threed = pca3.fit_transform(my_classifier.label_embeddings)
scatter = pd.DataFrame()
type = np.array(["FLS", "FLS", "FLS", "FLS", "FLS", "FLS", "Delay Management", "Delay Management", "Delay Management", "Delay Management", "Delay Management", "Delay Management", "Delay Management"])
scatter["type"] = type
scatter["x"] = threed[:,0]
scatter["y"] = threed[:,1]
scatter["z"] = threed[:,2]
scatter["label"] = list(label_description.keys())
fig = px.scatter_3d(scatter, x='x', y='y', z='z', symbol='type',color="label")
fig.show()
#heatmap
labels_similarity = util.pytorch_cos_sim(my_classifier.label_embeddings, my_classifier.label_embeddings)
labels_similarity.tolist()
fig = px.imshow(labels_similarity.tolist(),
                x=list(label_description.keys()),
                y=list(label_description.keys()))
fig.show()

Comments: from the 3d scatterplot, we can observe the FLS labels and delay management labels are clearly separated. 