# Resume Anonymization Script

## Overview
This script is designed to **anonymize Personally Identifiable Information (PII)** from resumes. It reads in a CSV file containing resumes, detects sensitive information, and replaces it with anonymized placeholders to protect candidates' privacy.

## Features
- **Reads resume data from a CSV file**.
- **Identifies PII elements** such as names, emails, phone numbers, and addresses.
- **Replaces PII with generic placeholders** to ensure confidentiality.
- **Exports the anonymized data** into a new CSV file for further processing or sharing.

## Why Anonymization?
Protecting PII is crucial in resume processing, especially when handling large datasets for **machine learning models, recruitment analysis, or compliance with data privacy regulations (e.g., GDPR, CCPA)**.

## Installation & Setup
1. **Clone the Repository**:
   ```bash
   git clone https://github.com/hantayc/mirra_matcher.git
   cd mirra_matcher

2. ** Set up a Virtual Env (if needed) ** 
   ```bash
   python -m venv mirra_env
   source mirra_env/bin/activate  # Windows: mirra_env\Scripts\Activate


In [32]:
import re
import pandas as pd
import nltk
from nltk.corpus import words
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# Download word list (only required once)
nltk.download('words')
english_words = set(words.words())

# Initialize Presidio components
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def clean_text_spacing(text):
    """Fixes missing spaces between words, numbers, and punctuation, including emails."""
    if pd.isna(text):
        return text

    text = text.strip()  # Remove extra leading/trailing spaces
    
    # Add space between lowercase and uppercase transitions (e.g., ProjectManagement → Project Management)
    text = re.sub(r'([a-z])([A-Z])', r'\1 \2', text)

    # Add space before numbers that are stuck to letters (e.g., "1049E-mail" → "1049 E-mail")
    text = re.sub(r'(\D)(\d)', r'\1 \2', text)

    # Add space after numbers that are stuck to letters (e.g., "June2022" → "June 2022")
    text = re.sub(r'(\d)(\D)', r'\1 \2', text)

    # Normalize multiple spaces to a single space
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def anonymize_first_words(text, num_words=3):
    """Anonymizes the first `num_words` words if they are not real English words."""
    words_list = text.split()

    for i in range(min(num_words, len(words_list))):
        if words_list[i].lower() not in english_words:  
            words_list[i] = "[REDACTED NAME]"

    return " ".join(words_list)

def anonymize_resume(text):
    """Applies Presidio anonymization after fixing spacing and first-word checks."""
    if pd.isna(text):
        return text

    # Preprocess text to fix spacing issues
    text = clean_text_spacing(text)

    # Aggressively redact the first three words if they are not real English words
    text = anonymize_first_words(text, num_words=3)

    # Analyze text with Presidio
    analyzer_results = analyzer.analyze(
        text=text, 
        entities=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "URL"], 
        language="en"
    )

    # Define anonymization rules
    operators = {
        "DEFAULT": OperatorConfig("replace", {"new_value": "[REDACTED]"}),  
        "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[REDACTED PHONE]"}),
        "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "[REDACTED EMAIL]"}),
        "PERSON": OperatorConfig("replace", {"new_value": "[REDACTED NAME]"}),
        "URL": OperatorConfig("replace", {"new_value": "[REDACTED LINK]"})  
    }

    # Apply Presidio anonymization
    anonymized_result = anonymizer.anonymize(text=text, analyzer_results=analyzer_results, operators=operators)

    return anonymized_result.text

# Load Resume Data
df = pd.read_csv('1200 Resumes 2024.csv')

def preprocess_text(text):
    """Cleans text by removing BOM, fixing spaces, and ensuring name recognition."""
    if pd.isna(text):
        return text
    
    # Remove BOM (Byte Order Mark) if present
    text = text.replace("\ufeff", "").strip()

    # Normalize spaces
    text = re.sub(r'\s+', ' ', text)

    return text

# 1. Apply fix before anonymization
df["resume_clean"] = df["resume_clean"].apply(preprocess_text)

# 2. Apply Cleaning Space Function to Entire Dataset 
df["cleaned_resume"] = df["resume_clean"].apply(clean_text_spacing)

# 3. Apply Anonymization Function to Entire Dataset (Presidio now handles everything)
df["anonymized_resume"] = df["cleaned_resume"].apply(anonymize_resume)

# 4. Save Anonymized Data
df.to_csv('1200 Resumes 2024 Anonymized.csv', index=False)

[nltk_data] Downloading package words to /Users/than/nltk_data...
[nltk_data]   Package words is already up-to-date!
