## Setup
First, let's set up our environment:


In [2]:
!pip install openai



Now, import the required modules:

In [3]:
import os
from openai import OpenAI
import re
import spacy

### Setting up the OpenAI API Key
Before proceeding with the assignment, you'll need to obtain an OpenAI API key. This key is essential for authenticating your requests to the OpenAI API. Here's how to get started:

1. Go to https://platform.openai.com/signup
1. Click on "Sign up" and follow the registration process
1. Once logged in, navigate to the https://platform.openai.com/api-keys
1. You may need to provide additional information or verify your account
1. Generate a new key and make sure to copy it immediately (you won't be able to see it again)
1. New accounts typically receive $5 of free credit from OpenAI. This should be sufficient for completing this assignment

> **Important:** If you're unable to obtain an API key for any reason, please contact us immediately. We have a backup option to provide you with a temporary API key for the duration of this assignment.

> **Security Warning:** Your API key is sensitive information. Never share it publicly or commit it to version control systems.

After obtaining your API key, you'll use it in the code as shown in the next section. Remember to replace 'your-api-key-here' with your actual API key.

In [4]:
# Set your OpenAI API key here
os.environ['OPENAI_API_KEY'] = 'YOUR 0PENAI API_KEY HERE'

# **Function to analyze input data for potential keywords**

In [5]:
client = OpenAI()

# Function to detect types of personal data
def detect_data_types(text):
    data_types = []
    patterns = {
      "name": r"\b(?:Mr\.|Ms\.|Dr\.|[A-Z][a-z]+(?:\s[A-Z][a-z]+){1,2})\b",
      "date": r"\b(\d{1,2}\s\w+\s\d{4})|(\w+\s\d{1,2},\s\d{4})\b",
      "address": r"\b\d+\s[A-Za-z\s]+\s(?:Street|St|Avenue|Ave|Boulevard|Blvd|Road|Rd|Lane|Ln|Drive|Dr)\b",
      "contact": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b|\b\d{10}|\b\d{3}[-.\s]??\d{3}[-.\s]??\d{4}\b",
      "code": r"\bcode\s\w+\b"  # Capture codes (like DAWSON23)
  }

    for data_type, pattern in patterns.items():
        if re.search(pattern, text):
            data_types.append(data_type)
    return data_types

# **Set dynamic system prompt based on detected data types**

In [6]:
# Dynamic system prompt creation based on detected data types
def set_dynamic_system_prompt(data_types):
    prompt = "You are a data masking assistant. Please replace all sensitive personal data with the placeholder [PERSONAL]."
    for data_type in data_types:
        prompt += f"\n- Mask {data_type} with [PERSONAL]."
    return prompt


In [7]:
nlp = spacy.load("en_core_web_sm")  # Load English NLP model

def mask_names(text):
    doc = nlp(text)
    for token in doc:
        if token.ent_type_ == "PERSON":  # Check if the token is identified as a person
            text = text.replace(token.text, '[PERSONAL]')
    return text

In [8]:
def mask_addresses(text):
    # Example regex patterns for various address formats
    address_patterns = [
        r'\d{1,3}\s\w+\s\w+(\s\w+)?',  # Simple street address
        r'\d+\s[a-zA-Z\s&]+(?:,\s[a-zA-Z\s&]+)?',  # Addresses with city/state
        r'Near\s\w+',  # Proximity indicators
        r'\d+\s[A-Za-z]+\sSt',  # Specific patterns (Street)
    ]

    for pattern in address_patterns:
        text = re.sub(pattern, '[PERSONAL]', text)

    return text


In [9]:
def mask_indirect_references(text):
    indirect_references = [
        r'\bmy friend\b',  # Example indirect reference
        r'\bthe patient\b',
        r'\bmy colleague\b',
        r'\bthe teacher\b'
    ]

    for ref in indirect_references:
        text = re.sub(ref, '[PERSONAL]', text)

    return text


## Basic API Interaction

In the context of using the OpenAI API, we need to understand how to structure our requests:

* `system_prompt`: This sets the overall context or instructions for the model. It's like giving the model its role or job description.
* `user_prompt`: This contains the specific input or task for the model to process. In our case, it will contain the text to be masked.
* `temperature`: Controls randomness in the output. Lower values (e.g., 0.1) make the output more deterministic, while higher values introduce more randomness.
* `max_tokens`: The maximum number of tokens the model will generate.

Let's set up the initial prompts and try to mask personal data in this wedding invitation:

In [10]:


# Initialize the OpenAI client
client = OpenAI()

# System prompt for the API call
system_prompt = "You are a personal data masking assistant. Your task is to identify and mask personal information in the provided text with '[PERSONAL]'."

# Function to mask personal data
def mask_personal_data(text):
    # Mask all words after specific keywords until the next newline
    keywords = r'\b(?:by|at|to|code|or|with|for|in|from|about|as|like|during|between)\s(.*?)(?=\n|$)'
    text = re.sub(keywords, r'[PERSONAL]', text)  # Mask after keywords

    # Mask all words after ':' until the newline
    text = re.sub(r'(?<=:\s)(.*?)(?=\n)', lambda m: '[PERSONAL]' * len(m.group(0).split()), text)  # Mask after colons

    # Mask names (placeholder names here for simplicity)
    text = re.sub(r'\b(?:Emma|Christopher|John|Mary|Jane|Doe)\b', '[PERSONAL]', text)  # Known names
    # Mask dates in various formats
    text = re.sub(r'\d{1,2}/\d{1,2}/\d{4}', '[PERSONAL]', text)  # MM/DD/YYYY
    text = re.sub(r'\b\d{1,2} [A-Za-z]+ \d{4}\b', '[PERSONAL]', text)  # Specific date formats
    # Mask phone numbers
    text = re.sub(r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b', '[PERSONAL]', text)  # Phone numbers
    # Mask email addresses
    text = re.sub(r'\b\w+@\w+\.\w+\b', '[PERSONAL]', text)  # Email addresses
    # Mask addresses
    text = re.sub(r'(\d{1,3}\s\w+(\s\w+)*,?\s\w+)', '[PERSONAL]', text)  # Addresses

    # Make API call
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": text}
        ],
        temperature=0.2,
        max_tokens=512
    )

    # Extract and return masked text
    masked_text = response.choices[0].message.content
    return masked_text

# Function to run tests
def run_tests(test_cases):
    results = {}

    for case_name, text in test_cases.items():
        masked_text = mask_personal_data(text)
        results[case_name] = masked_text

    return results

# Sample user input
user_text = """
INVITATION

You are cordially invited to celebrate the wedding of
Emma Louise Parker
&
Christopher James Dawson

Date: August 15, 2023
Time: 3:00 PM
Venue: Rosewood Gardens, 1515 Rose Lane, Sunnyville

RSVP by July 1st to emmaandchris@lovebirds.com or 555-789-0123
Reception to follow at Sunnyville Grand Hotel
Book your stay using code DAWSON23 for a special rate

Wedding registry: www.weddingwishes.com/EmmaAndChris
""".strip()

# Run the masking function
masked_output = mask_personal_data(user_text)
print("Masked Text:")
print(masked_output)


Masked Text:
INVITATION

You are cordially invited [PERSONAL]
[PERSONAL] Louise Parker
&
[PERSONAL] James Dawson

Date: [PERSONAL][PERSONAL][PERSONAL]
Time: [PERSONAL][PERSONAL]
Venue: [PERSONAL][PERSONAL][PERSONAL][PERSONAL][PERSONAL][PERSONAL]

RSVP [PERSONAL]
Reception [PERSONAL]
Book your stay using [PERSONAL]

Wedding registry: www.weddingwishes.com/EmmaAndChris


> **Think about it:** How does the structure of `system_prompt` and `user_prompt` affect the model's understanding of the task? What types of personal information should be masked in this context?

# **Test Cases**

In [11]:
test_cases = {
    "wedding_invitation": """
        You are cordially invited to celebrate the wedding of Emma Louise Parker & Christopher James Dawson.
        Date: August 15, 2023
        Time: 3:00 PM
        Venue: Rosewood Gardens, 1515 Rose Lane, Sunnyville.
        RSVP by July 1st to emmaandchris@lovebirds.com or 555-789-0123.
        Reception to follow at Sunnyville Grand Hotel.
        Book your stay using code DAWSON23 for a special rate.
    """,
    "cv": """
        John Doe
        123 Main St, Springfield, IL 62704
        (555) 123-4567
        john.doe@example.com
        Experience:
        - Software Engineer at Tech Solutions (2018-present)
        - Intern at XYZ Corp (2017-2018)
    """,
    "email_correspondence": """
        Hi Jane,
        I hope you're doing well. I wanted to follow up on our last conversation regarding the project due on April 15, 2024.
        Please send me the documents at jane.doe@company.com or call me at (555) 987-6543.
        Thanks,
        John
    """,
    "medical_record": """
        Patient Name: Mary Smith
        DOB: 01/10/1985
        Appointment Date: 11/02/2024
        Diagnosis: Hypertension
        Doctor's Contact: dr.jones@hospital.com
    """,
    "social_media_post": """
        Just got back from an amazing trip to Bali with my best friend Sarah!
        Can’t believe it’s already November 4, 2023. Can’t wait to share the photos!
    """
}

# **Running Test Cases**

In [12]:
# Function to run masking tests on a set of input test cases
def run_tests(test_cases):
    # Initialize an empty dictionary to store test results
    results = {}

    # Loop through each test case in the provided dictionary
    for case_name, text in test_cases.items():
        # Apply the mask_personal_data function to mask sensitive data in the text
        masked_text = mask_personal_data(text)
        # Store the masked output in the results dictionary, using the case name as the key
        results[case_name] = masked_text

    # Return the dictionary containing masked results for each test case
    return results

# Execute the run_tests function with test cases and store the masked outputs
masked_results = run_tests(test_cases)

# Print the masked output for each test case in a formatted way
for case_name, masked_text in masked_results.items():
    # Print the name of the test case
    print(f"--- {case_name} ---")
    # Print the masked text
    print(masked_text)
    # Print a separator line for readability between test cases
    print("\n" + "-"*40 + "\n")


--- wedding_invitation ---
You are cordially invited [PERSONAL]
Date: [PERSONAL][PERSONAL][PERSONAL]
Time: [PERSONAL][PERSONAL]
Venue: [PERSONAL][PERSONAL][PERSONAL][PERSONAL][PERSONAL][PERSONAL]
RSVP [PERSONAL]
Reception [PERSONAL]
Book your stay using [PERSONAL]

----------------------------------------

--- cv ---
[PERSONAL] [PERSONAL]
[PERSONAL], IL 62704
(555) 123-4567
john.[PERSONAL]
Experience:
[PERSONAL][PERSONAL][PERSONAL][PERSONAL]
- Intern [PERSONAL]

----------------------------------------

--- email_correspondence ---
Hi [PERSONAL],
I hope you're doing well. I wanted [PERSONAL]
Please send me the documents [PERSONAL]
Thanks,
[PERSONAL]

----------------------------------------

--- medical_record ---
Patient Name: [PERSONAL][PERSONAL]  
DOB: [PERSONAL]  
Appointment Date: [PERSONAL]  
Diagnosis: [PERSONAL]  
Doctor's Contact: [PERSONAL]  

----------------------------------------

--- social_media_post ---
Just got back [PERSONAL]
Can’t believe it’s already November 4, 20

## Your Task: Develop an Advanced Personal Data Masking

Now that you've seen a basic example, your task is to develop a more advanced masking system using the `gpt-3.5-turbo` model. Your system should be able to handle various types of documents and effectively mask different kinds of personal information.
Here are the key aspects you need to consider:

1. **Prompt Engineering:** Experiment with different system and user messages to improve masking accuracy. Consider what specific instructions might help the model identify and replace all types of personal information.

1. **Parameter Tuning:** Explore how different parameters like `temperature`, `max_tokens`, or even `top_p`, `frequency_penalty`, and `presence_penalty` affect the output. Find the optimal balance for consistent and accurate masking. See the parameters explanation in the OpenAI API reference: https://platform.openai.com/docs/api-reference/chat/create

1. **Handling Different Document Types:** Your system should be able to mask various document types, such as:
  * Resumes/CVs
  * Email correspondences
  * Medical records
  * Social media posts
  * News articles
  * etc.

1. **Edge Cases:** Consider how your system might handle edge cases, such as:
  * Names that are also common words
  * Addresses with non-standard formats
  * Indirect personal references

## Evaluation Criteria
Your masking system will be evaluated based on the following criteria:
1. Accuracy of masking across different types of personal information
1. Diversity and creativity of identified input scenarios and personal data types
1. Handling of edge cases and challenging scenarios
1. Creativity and effectiveness of prompt engineering
1. Thoughtfulness of parameter choices and their justification

## Expected Output
Your final submission should include:

1. The complete notebook with your final, optimized version of the code, including:
  * The most effective system prompt and user prompt you developed
  * The optimal configuration parameters you found
  * Any additional functions or modifications you made to improve masking
1. A brief reflection and analysis addressing the following points:
  * What combination of prompts and parameters yielded the best results? Why do you think this configuration was most effective?
  * What were the main challenges in achieving consistent masking across different types of text?
  * Imagine you need to evaluate 100 different personal data masking systems like yours, each tested on 100 diverse input documents. How would you design an automated evaluation process to efficiently compare and rank these systems? What methods or techniques would you consider to efficiently assess the performance of these various personal data masking systems?
1. Enumeration of the input scenarios and personal data types you considered, including non-obvious cases

> **Important:** Ensure that your notebook is runnable and reproducible. Before submitting, click the "Share" button in the top-right corner, set the access to "Anyone with the link" and permissions to "Viewer". Copy the sharing link and include it with your submission. This allows reviewers to access your final work.

## Time Management
While there is no strict deadline, we recommend allocating approximately 4-6 hours for this assignment. This should allow sufficient time for experimentation, analysis, and reflection.

## Conclusion
This assignment tasks you with building a personal data masking system using the GPT-3.5-turbo model. You'll be working directly with the OpenAI API, crafting prompts, and developing strategies to handle various types of personal information across different document formats.

While the core task is clear, the implementation is open-ended. We're interested in your approach to prompt engineering, your creativity in generating diverse evaluation cases, and your insights into the strengths and limitations of LLM-based data processing. We're curious to see your solution.