>[Introduction](#scrollTo=vRixN8r3vOun)

>>[Traffic Accident Data](#scrollTo=vRixN8r3vOun)

>>[Project Highlight](#scrollTo=vRixN8r3vOun)

>>[Next Steps](#scrollTo=vRixN8r3vOun)

>[Remove PII information using Nvidia Nemo](#scrollTo=ZV3Wn-E4n2In)

>>[Check the effect of nemo_curator.modifiers.pii_modifier](#scrollTo=538oKVB7nmey)

>[Build a ChatBot (gpt-3.5-turbo) based on the Redacted Traffic Accident Data](#scrollTo=saGoGhIStWdY)



# Introduction
## Traffic Accident Data
- The traffic accident data used in this project are downloaded and cleaned from Michigan Traffic Crash Facts website (https://www.michigantrafficcrashfacts.org/legacy/querytool#q1;0;2022;;). This dataset include detailed information about Crash Date, Crash Time, Crash Type, Light, Road Surface Condition, Vehicle Type, and some information about persons involed in the traffic accidents such as Age, Sex, Race, etc. Furthermore, there is a column called "Narative" in the dataset which is unstructured nlp data extracted from police accident report.
- It would be good to build a ChatBot to analyze the traffic accident data, and answer some interesting questions like:
      "What are the most common traffic accident patterns?",
      "What is the average age of the drivers involved in the traffic accidents?",
      "What is the most frequent weather condition when traffic accidents happened?",
      "What time of day do most accidents occur?",
      "Which vehicle makes are most frequently involved in accidents?"
- One challenging issue is how to remove personal sensitive information from the traffic accident reports. Sometimes the victim names appear in the Narrative Column.
## Project Highlight
- I first applied Nvidia nemo-curator to remove any personal sensitive information from the traffic accident data.  
- The I created a vector database using llama-index, each row of the dataset is treated as a traffic accident file with Narrative as the mainbody and the other columns as metadata, then I created the chatbot using gpt-3.5-turbo
- The results show that Nemo pii_modifier can effectively identify person names in Narrative column and replaced the sensitive information
- llamaindex really simplify the pipeline to make a ChatBot up and running
## Next Steps
- Due to limited capacity, I only randomly sampled 564 traffic accident records from 2022 from the database. In total there are 293341 traffic accidents just in Michigan State based on the description in the website. It would be good to download more data.
- We can also improve the prompt used in the chatbot when communicating with the LLMs
- We can also try other LLMs except the gpt-3.5-turbo


In [1]:
!pip install nemo-curator
!pip install llama-index

[0mCollecting llama-index
  Using cached llama_index-0.11.19-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.4.0,>=0.3.4 (from llama-index)
  Using cached llama_index_agent_openai-0.3.4-py3-none-any.whl.metadata (728 bytes)
Collecting llama-index-cli<0.4.0,>=0.3.1 (from llama-index)
  Using cached llama_index_cli-0.3.1-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.12.0,>=0.11.19 (from llama-index)
  Using cached llama_index_core-0.11.19-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-index-embeddings-openai<0.3.0,>=0.2.4 (from llama-index)
  Using cached llama_index_embeddings_openai-0.2.5-py3-none-any.whl.metadata (686 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.3.0 (from llama-index)
  Using cached llama_index_indices_managed_llama_cloud-0.4.0-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Using cached llama_index_legacy-0.9.48.post3-py3-none-any.whl.metadata (8.5 k

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
import pandas as pd
import torch
from tqdm.auto import tqdm
import difflib

from nemo_curator.modifiers.pii_modifier import PiiModifier
from llama_index.core import VectorStoreIndex, Document
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

# Remove PII information using Nvidia Nemo
- The columns that may include PII in the traffic accident dataset are ['Driver Licence State', 'Age', 'Sex', 'Race', 'Narrative']
- The Narrative column is the description of the traffic accidents provided by the police, the names of the traffic accident victims are mentioned sometimes



In [None]:
# Define the columns where PII information can be included
focus_columns = ['Driver Licence State', 'Age', 'Sex', 'Race', 'Narrative']

# Custom preprocessing function to combine focus columns
def preprocess_data(df):
    df['text'] = df[focus_columns].astype(str).agg(' '.join, axis=1)
    return df

# Custom postprocessing function to split the redacted text back into original columns
def postprocess_data(df):
    redacted_parts = df['text'].str.split(n=len(focus_columns)-1, expand=True)
    for i, col in enumerate(focus_columns):
        df[col] = redacted_parts[i]
    df = df.drop(columns=['text'])
    return df

# Define input and output files
input_file = "/content/drive/MyDrive/nemo_llamma_traffic_accidents/traffic_accident_2022_filtered_data.csv"
output_file = "/content/drive/MyDrive/nemo_llamma_traffic_accidents/traffic_accident_2022_filtered_data_redacted.csv"

# Initialize the PiiModifier
modifier = PiiModifier(
    language="en",
    supported_entities=["PERSON", "PHONE_NUMBER", "ID_NUMBER"],
    anonymize_action="replace",
    batch_size=32,  # Adjust this based on the GPU memory
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Load the deidentifier
deidentifier = modifier.load_deidentifier()

# Read the CSV file
print(f"Reading file: {input_file}")
df = pd.read_csv(input_file)
print(f"File read successfully. Shape: {df.shape}")

# Apply preprocessing
df = preprocess_data(df)

# Apply the PiiModifier
print("Applying PII modification...")
batch_size = 32  # Adjust this based on GPU memory
texts = df['text'].tolist()
modified_texts = []
changes = []

for i in tqdm(range(0, len(texts), batch_size)):
    batch = texts[i:i+batch_size]
    modified_batch = deidentifier.deidentify_text_batch(batch)
    modified_texts.extend(modified_batch)

    # Compare original and modified texts
    for original, modified in zip(batch, modified_batch):
        diff = list(difflib.ndiff(original.split(), modified.split()))
        changes.append('\n'.join([d for d in diff if d.startswith('+ ') or d.startswith('- ')]))

df['text'] = modified_texts
df['changes'] = changes

# Apply postprocessing
df = postprocess_data(df)

# Write the modified data to disk
df.to_csv(output_file, index=False)

print(f"PII redaction complete. Redacted file saved as {output_file}")

# Display a sample of changes
print("\nSample of changes made:")
sample_size = min(5, len(df))
for _, row in df.sample(sample_size).iterrows():
    print(f"\nOriginal columns: {row[focus_columns].to_dict()}")
    print(f"Changes:\n{row['changes']}")

# Optional: Save changes to a separate file
changes_file = "/content/drive/MyDrive/nemo_llamma_traffic_accidents/traffic_accident_2022_filtered_data_changes.csv"
df[['Crash ID', 'changes']].to_csv(changes_file, index=False)
print(f"\nDetailed changes saved to {changes_file}")




Reading file: /content/drive/MyDrive/nemo_llamma_traffic_accidents/traffic_accident_2022_filtered_data.csv
File read successfully. Shape: (564, 30)
Applying PII modification...


  0%|          | 0/18 [00:00<?, ?it/s]



PII redaction complete. Redacted file saved as /content/drive/MyDrive/nemo_llamma_traffic_accidents/traffic_accident_2022_filtered_data_redacted.csv

Sample of changes made:

Original columns: {'Driver Licence State': 'MI', 'Age': '33', 'Sex': 'M', 'Race': 'W', 'Narrative': 'VEHICLE#2 WAS SLOWING OR STOPPED FOR TRAFFIC COMING OFF SB 23 RAMP INTO THE ROUNDABOUT. VEHICLE#2 WAS REAR ENDED BY VEHICLE#1.\\N\\NVEHICLE #1 STATED THAT HE COULD NOT STOP IN TIME AND REAR ENDED VEHICLE#2 CAUSING THE ACCIDENT. \\N\\NNO INJURIES REPORTED, MEDICAL REFUSED. '}
Changes:


Original columns: {'Driver Licence State': 'MI', 'Age': '23', 'Sex': 'M', 'Race': 'W', 'Narrative': 'BOTH VEHICLES WERE IN HEAVY STOP & GO TRAFFIC.  VEH #2 STARTED, MOVED FORWARD THEN STOPPED AGAIN.  VEH #1 ALSO MOVED FORWARD AND DID NOT STOP REAR ENDING VEH #2. '}
Changes:


Original columns: {'Driver Licence State': 'OH', 'Age': '60', 'Sex': 'M', 'Race': 'W', 'Narrative': 'Unit 1 struck Unit 2 while changing lanes on SB US 23. Driv

## Check the effect of nemo_curator.modifiers.pii_modifier
- accident id 343305

  before applying PII_Modifier: the Narative of this traffic accident is: Unit 1 lost control while negotiating a curve and ran off the roadway right before overturning. Vincent reported a headache.

  after applying PII_Modifier: Unit 1 lost control while negotiating a curve and ran off the roadway right before overturning. \<PERSON\> reported a headache.

  we can see the name "Vincent" has been replaced with \<PERSON\>. The sensitive information has been removed
- One can also compare accident id 1250945 from traffic_accident_2022_filtered_data.csv and traffic_accident_2022_filtered_data_redacted.csv

# Build a ChatBot (gpt-3.5-turbo) based on the Redacted Traffic Accident Data

In [4]:
from google.colab import userdata
# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Load the data
df_accidents_cleaned = pd.read_csv("/content/drive/MyDrive/traffic_accident_2022_filtered_data_redacted.csv")
df_accidents_cleaned = df_accidents_cleaned.drop(["changes"], axis=1)

# Initialize OpenAI LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.7)

# Initialize OpenAI Embedding
embed_model = OpenAIEmbedding()

# Configure global settings
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512

# Modify the document creation function
def df_to_documents(df):
  return [Document(text=row['Narrative'], metadata={col: row[col] for col in df.columns if col != 'Narrative'}) for _, row in df.iterrows()]

documents = df_to_documents(df_accidents_cleaned)

# Create a node parser
node_parser = SimpleNodeParser.from_defaults(chunk_size=512, chunk_overlap=50)

# Create an index from the documents
index = VectorStoreIndex.from_documents(
    documents,
    node_parser=node_parser,
)

# Function to ask questions
def ask_question(question):
    try:
        query_engine = index.as_query_engine(similarity_top_k=3)
        response = query_engine.query(question)
        return response.response
    except Exception as e:
        return f"An error occurred while processing the question: {str(e)}"

# Example questions
questions = [
    "What are the most common traffic accident patterns?",
    "What is the average age of the drivers involved in the traffic accidents?",
    "What is the most frequent weather condition when traffic accidents happened?",
    "What time of day do most accidents occur?",
    "Which vehicle makes are most frequently involved in accidents?",
]

# Ask and print answers to example questions
for question in questions:
    print(f"Q: {question}")
    answer = ask_question(question)
    print(f"A: {answer}\n")

# Interactive Q&A
while True:
    user_question = input("Ask a question about the traffic accident data (or type 'quit' to exit): ")
    if user_question.lower() == 'quit':
        break
    answer = ask_question(user_question)
    print(f"A: {answer}\n")

Q: What are the most common traffic accident patterns?
A: Rear-end collisions and sideswipe-same incidents are among the most common traffic accident patterns based on the provided context information.

Q: What is the average age of the drivers involved in the traffic accidents?
A: The average age of the drivers involved in the traffic accidents is 53.3 years.

Q: What is the most frequent weather condition when traffic accidents happened?
A: The most frequent weather condition when traffic accidents happened based on the provided context information is snow.

Q: What time of day do most accidents occur?
A: Accidents in the provided context occurred during different times of the day - both during daylight and at night.

Q: Which vehicle makes are most frequently involved in accidents?
A: HONDA, FORD, and TOYOTA vehicles are most frequently involved in accidents based on the provided context information.

Ask a question about the traffic accident data (or type 'quit' to exit): quit
