## BERTopic Inference

We can use this notebook to create new `.xlsx` files with new data (i.e., data that are not involved in the original 3-months training data).  


In [7]:
import preprocess
import numpy as np
import pandas as pd
import re
from tqdm.auto import tqdm

from bertopic import BERTopic

ModuleNotFoundError: No module named 'numpy.strings'

In [2]:
# setting to display progress bars
tqdm.pandas()

## Read and preprocess new data

Here please read new data that you would like to be processed by the model we previously trained.  
As an example, we have the training data read here.  
You should change the variables `passwd` and `path_to_data` that fits the new data files  
Please be reminded that the new data should be in the same format as our training data (i.e., `CONFIDENTIAL_2024UBC_X99_3-Month-DataSet-REDACTED-Shared 2024-05-15.xlsx`)

In [3]:
passwd = 'Capstone_2024' # please edit this to the password for the new data file

path_to_data = "../data/CONFIDENTIAL_2024UBC_X99_3-Month-DataSet-REDACTED-Shared 2024-05-15.xlsx" # please change this to the path of the new data file

# this `read_encrypted_data` function is designed to read data that are saved in the same format as the training data provided
# i.e., CONFIDENTIAL_2024UBC_X99_3-Month-DataSet-REDACTED-Shared 2024-05-15.xlsx

# To run this function properly with new data, please make sure the new data are saved in the same format
data = preprocess.read_encrypted_data(passwd = passwd, path_to_data = path_to_data)
df = preprocess.reformat_data(data)

In [4]:
# glossary is a dictionary that contains police jargon acronyms as key and their meanings as values
glossary = {
    "CO": "Complainant",
    "VI": "Victim",
    "OFF": "Offender",
    "VREG": "Vehicle Registration",
    "LRT": "Light Rail Transit", 
    #"Standby":	"Police presence requested to remove belongings from a location",
    "REG":	"Regimental number of officer",
    "UNIT":	"police vehicle unit",
    "BMQ":	"Broadcast Message Question (message broadcasted over the police radio instead of CAD to alert units)",
    "RTOC":	"Real Time Operations Centre",
    "CSS":	"Court Services Section",
    "ECO":	"Emergency Call Operator (call taker at 911)",
    "KOC":	"Knows of call (indicates Duty Sergeant (Sgt.) is aware of event)",
    "CST":	"Constable/Officer",
    "SENTRY": "The Police Records System",
    #"Event Priority":	'Level of response required (see "Data Column Summary" for a breakdown and definition)',
    "APU":	"Arrest Processing Unit",
    "EPO":	"Emergency Protection Order",
    #"dispatch cad code": "refers to event type code generated by call taking assistance tool. Dispatch CAD codes are mapped to our internal event subtypes (10 codes)",
    #"cross-referenced":	"some calls may be related, hence will be cross-referenced",
    "LOI": "Location of Interest",
    "DOAP":"Downtown Outreach Addictions Partnership run by Alpha House Society (nonprofit) (CPS partners with various nonprofit agencies providing service to people in need)",
    "POET":"Prolific Offender Engagement Team",
    "BWC":"Body Worn Camera"
}


# a function to expand acronyms in a text
def hybrid_expand_acronyms(text, glossary=glossary):
    for acronym, full_term in glossary.items():
        # Use word boundaries to replace only whole words
        text = re.sub(rf'\b{acronym}\b', f'{acronym} ({full_term})', text)
    return text


# a function to compile text from the dataframe row
def compile_text(x):
    text = (
        f"Description of the behavior or criminal offense: {x['Occurrence_Type']}. \n" 
        f"Broadest level of categorization: {x['Occurrence_Type_UCR_Category']}. \n" 
        f"Secondary level categorization: {x['Occurrence_Report_Category']}. \n"
        f"The priority level assigned to the call by the ECO (911 call taker): {x['Priority']}. \n"
        f"Was the call initiated by a member of the public? {x['Public_Generated_Event_Flag']}. \n" 
        f"Flag that indicates the call was attended in person by an officer: {x['Event_Attended_Flag']}. \n"
        f"The log of the event: {hybrid_expand_acronyms(x['Event_Remarks_Text'])}" 
    )
    text = re.sub(r"\[redacted\]", "[MASK]", text, flags=re.IGNORECASE)
    return text


# a function to output embeddings
def output_embedding(txt, model):
    try:
        embd = model.encode(txt, device="auto")
        return embd
    except Exception as e:
        print(f"Error encoding text: {e}")
        return None


In [5]:
docs = df.progress_apply(lambda x: compile_text(x),axis=1)

  0%|          | 0/9751 [00:00<?, ?it/s]

## Load Trained BERTopic Model

In [7]:
#load previously trained BERTopic model

bertopic_model_path = "../models/tuned_gte_large_model"

loaded_model = BERTopic.load(bertopic_model_path)

## Inference

In [8]:
from bertopic._utils import MyLogger
logger = MyLogger("ERROR")
loaded_model.verbose = False

In [9]:
# Use the trained BERTopic model to classify the new data
# This may take awhile to run depending on the new data size
topics, probabilities = loaded_model.transform(docs)

In [10]:
topic_df = loaded_model.get_topic_info()
display(topic_df)

Unnamed: 0,Topic,Count,Name,Representation,Aspect1,Representative_Docs
0,-1,3580,"0_""Public Inquiry Calls to 911""assistant\n\nIt...","[""Public Inquiry Calls to 911""assistant\n\nIt ...",,
1,0,82,HARASSMENT/THREATS,"[""Online Harassment""assistant\n\nBased on the ...","[description behavior criminal, harassment, ca...",
2,1,29,FRAUD,[Fraud Investigation Reportsassistant\n\nTopic...,"[offense fraud defraud, fraud priority level, ...",
3,2,29,THEFT,[Theft Incident Reportsassistant\n\nBased on t...,"[theft priority level, offense theft 5000, off...",
4,3,17,ASSAULT,"[""Assault and Complaints""assistant\n\nBased on...","[complaint assault, chief complaint assault, a...",
...,...,...,...,...,...,...
71,70,27,64_Custody Issue Incidentsassistant\n\nBased o...,[Custody Issue Incidentsassistant\n\nBased on ...,"[description behavior criminal, behavior crimi...",
72,71,27,"65_""Police Officer Reports Without Timers""assi...","[""Police Officer Reports Without Timers""assist...","[incident involves custody, custody issue disp...",
73,72,26,"66_""Protest Events in Calgary""assistant\n\nBas...","[""Protest Events in Calgary""assistant\n\nBased...","[description behavior criminal, behavior crimi...",
74,73,25,"67_""Regimental Number Officer Response Events""...","[""Regimental Number Officer Response Events""as...","[protesters, description behavior criminal, 91...",


In [11]:
# save infered topics to the original dataframe
df["Topic"] = topics

# merge topic info into the original dataframe as well
infered_df = df.merge(topic_df[["Topic", "Name", "Representation","Aspect1"]], how="left", on="Topic")

## Output new data with infered topics by BERTopic Model

In [12]:
# output the new data with topics assigned by our trained BERTopic Model
output_file_name = "../results/bertopic_infered_new_data.xlsx" # Please change the output path and file name as you see fit
infered_df.to_excel(output_file_name,index=False)