This repository contains supplementary materials for the following conference paper:
Shreya Singhal, Andres Felipe Zambrano, Maciej Pankiewicz, Xiner Liu, Chelsea Porter and Ryan S. Baker De-Identifying Student Personally Identifying Information with GPT-4 In Proceedings of the 16th International Conference on Educational Data Mining (EDM 2024).
This research project focuses on utilizing Large Language Models (LLMs) to remove personally identifiable data (PII) from forum posts.
original_files
: These files contain the original data with PII.
human_redacted_files
: These are the files that humans have de-identified.
results
: These are the files that are generated by de_identified_csv_generator.py
prompts.csv
: This CSV file includes a list of prompts to be used during the de-identification process and for evaluation.
-
de_identified_csv_generator.py
This component takes as input an original file. It generates an OpenAI de-identified CSV file as output for every prompt. -
de_identified_csv_evaluator.py
This component evaluates the reliability of the de-identification process using GPT4. It takes files from theresults/OpenAI_redacted_files
folder (created by the De-identified CSV Generator) andhuman_redacted_files
folder as input. The evaluator creates/updates the metrics.csv file with metrics for each file and each prompt.
Before getting started, ensure you have the following prerequisites:
Python Environment: Make sure you have Python installed on your system.
OpenAI API Key: Obtain an API key from OpenAI to access their GPT-4 model for text generation. You can sign up for an API key on the OpenAI platform.
Input Data: Prepare the following input data:
original_files
: These files contain the original data with PII.
human_redacted_files
: These are the files that humans have de-identified.
Here's how to initiate the project:
Step 1: Organize Data Place your original files with PII in the original_files folder and human redacted files in human_redacted_files folder. We have artificially created some sample files for your reference in the folder.
Step 2: cd to the repository and add OpenAI API Key to de_identified_csv_generator.py
Step 3: Run the de_identified_csv_generator.py
Execute the de_identified_csv_generator script, providing the necessary input files and the output folder (which will be automatically created):
This script will process the files, remove PII, and generate an OpenAI de-identified CSV file within an OpenAI_redacted_files folder inside the results folder.
Step 4: Run the de_identified_csv_evaluator.py
To evaluate the accuracy of the de-identification process, run the De-identified CSV Evaluator script:
This script will analyze the de-identified CSV files in the results folder and update the metrics csv with accuracy, precision, recall and kappa values.