De-identification Project

Overview

This repository contains supplementary materials for the following conference paper:

Shreya Singhal, Andres Felipe Zambrano, Maciej Pankiewicz, Xiner Liu, Chelsea Porter and Ryan S. Baker De-Identifying Student Personally Identifying Information with GPT-4 In Proceedings of the 16th International Conference on Educational Data Mining (EDM 2024).

Objective

This research project focuses on utilizing Large Language Models (LLMs) to remove personally identifiable data (PII) from forum posts.

Project Structure

Data

original_files: These files contain the original data with PII.

human_redacted_files: These are the files that humans have de-identified.

results: These are the files that are generated by de_identified_csv_generator.py

prompts.csv: This CSV file includes a list of prompts to be used during the de-identification process and for evaluation.

Code Files

de_identified_csv_generator.py This component takes as input an original file. It generates an OpenAI de-identified CSV file as output for every prompt.
de_identified_csv_evaluator.py This component evaluates the reliability of the de-identification process using GPT4. It takes files from the results/OpenAI_redacted_files folder (created by the De-identified CSV Generator) and human_redacted_files folder as input. The evaluator creates/updates the metrics.csv file with metrics for each file and each prompt.

Prerequisites

Before getting started, ensure you have the following prerequisites:

Python Environment: Make sure you have Python installed on your system.

OpenAI API Key: Obtain an API key from OpenAI to access their GPT-4 model for text generation. You can sign up for an API key on the OpenAI platform.

Input Data: Prepare the following input data:

original_files: These files contain the original data with PII. human_redacted_files: These are the files that humans have de-identified.

Getting Started

Here's how to initiate the project:

Step 1: Organize Data Place your original files with PII in the original_files folder and human redacted files in human_redacted_files folder. We have artificially created some sample files for your reference in the folder.

Step 2: cd to the repository and add OpenAI API Key to de_identified_csv_generator.py

Step 3: Run the de_identified_csv_generator.py Execute the de_identified_csv_generator script, providing the necessary input files and the output folder (which will be automatically created): This script will process the files, remove PII, and generate an OpenAI de-identified CSV file within an OpenAI_redacted_files folder inside the results folder.

Step 4: Run the de_identified_csv_evaluator.py To evaluate the accuracy of the de-identification process, run the De-identified CSV Evaluator script: This script will analyze the de-identified CSV files in the results folder and update the metrics csv with accuracy, precision, recall and kappa values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

De-identification Project

Overview

Objective

Project Structure

Data

Code Files

Prerequisites

Getting Started

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
human_redacted_files		human_redacted_files
original_files		original_files
results		results
Prompts.csv		Prompts.csv
README.md		README.md
de_identified_csv_evaluator.py		de_identified_csv_evaluator.py
de_identified_csv_generator.py		de_identified_csv_generator.py

pcla-code/llm-de-identification

Folders and files

Latest commit

History

Repository files navigation

De-identification Project

Overview

Objective

Project Structure

Data

Code Files

Prerequisites

Getting Started

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages