# Resume Anonymization Script

## Overview
This script is designed to **anonymize Personally Identifiable Information (PII)** from resumes. It reads in a CSV file containing resumes, detects sensitive information, and replaces it with anonymized placeholders to protect candidates' privacy.

## Features
- **Reads resume data from a CSV file**.
- **Identifies PII elements** such as names, emails, phone numbers, and addresses.
- **Replaces PII with generic placeholders** to ensure confidentiality.
- **Exports the anonymized data** into a new CSV file for further processing or sharing.

## Why Anonymization?
Protecting PII is crucial in resume processing, especially when handling large datasets for **machine learning models, recruitment analysis, or compliance with data privacy regulations (e.g., GDPR, CCPA)**.

## Installation & Setup
1. **Clone the Repository**:
   ```bash
   git clone https://github.com/hantayc/mirra_matcher.git
   cd mirra_matcher

2. ** Set up a Virtual Env (if needed) ** 
   ```bash
   python -m venv mirra_env
   source mirra_env/bin/activate  # Windows: mirra_env\Scripts\Activate


In [32]:
import re
import spacy

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Define regex patterns for PII
PHONE_PATTERN = r"(\+?1[-.\s]?)?(\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})"
EMAIL_PATTERN = r"([a-zA-Z0-9._%+-]+@\S+\.[a-zA-Z]{2,7})"
NAME_PATTERN = r"\b[A-Z][a-zA-Z]+\b(?:\s[A-Z][a-zA-Z]+)?"

def clean_text_spacing(text):
    """Fixes spacing issues like missing spaces after names and numbers."""
    text = text.strip()
    text = re.sub(r"([a-z])([A-Z])", r"\1 \2", text)  # Space between lowercase-uppercase transitions
    text = re.sub(r"(\D)(\d)", r"\1 \2", text)  # Space before numbers
    text = re.sub(r"(\d)(\D)", r"\1 \2", text)  # Space after numbers
    text = re.sub(r"\s+", " ", text)  # Remove extra spaces
    return text.strip()

JOB_TITLES = {
    "Cybersecurity Specialist", "Data Engineer", "Software Engineer", "Project Manager",
    "Senior Manager", "Business Analyst", "Product Manager", "Consultant", "Data Scientist"
}  # Expand this list based on real job titles

def anonymize_names(text):
    """Replaces detected names in the first four words with [REDACTED NAME], but skips job titles."""
    text = clean_text_spacing(text)
    words = text.split()
    first_four_words = " ".join(words[:4]) if len(words) >= 4 else " ".join(words)

    doc = nlp(first_four_words)

    detected_names = []

    for ent in doc.ents:
        if ent.label_ == "PERSON" and ent.text.strip() not in JOB_TITLES:
            detected_names.append(ent.text.strip())

    # Backup regex for names
    potential_names = re.findall(NAME_PATTERN, first_four_words)
    for name in potential_names:
        if name not in JOB_TITLES:
            detected_names.append(name)

    # Remove duplicates and redact detected names only if they are not job titles
    for name in set(detected_names):
        text = re.sub(rf"\b{re.escape(name)}\b", "[REDACTED NAME]", text, count=1)

    return text

def anonymize_phone_numbers(text):
    """Replaces phone numbers with [REDACTED PHONE]"""
    return re.sub(PHONE_PATTERN, "[REDACTED PHONE]", text)

def anonymize_emails(text):
    """Replaces email addresses with [REDACTED EMAIL]"""
    return re.sub(EMAIL_PATTERN, "[REDACTED EMAIL]", text)

def anonymize_resume(text):
    """Master function to anonymize PII"""
    text = text.strip()
    text = anonymize_phone_numbers(text)
    text = anonymize_emails(text)
    text = anonymize_names(text)
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces caused by redactions
    return text

In [33]:
# Load the Resume Data
df = pd.read_csv('1200 Resumes 2024.csv')

# Specify the row numbers you want to test
selected_rows = [16, 19, 25, 42]  # Change these row indices as needed

# Select the chosen rows
sample_resumes = df.loc[selected_rows, "resume_clean"]

# Apply anonymization function
df_sample = pd.DataFrame({
    "Original Resume": sample_resumes,
    "Anonymized Resume": sample_resumes.apply(anonymize_resume)
})


In [34]:
# Display anonymized examples in a readable format
for i, row in df_sample.iterrows():
    print(f"\n=== ORIGINAL RESUME (Row {i}) ===")
    print(row["Original Resume"])
    print("\n=== ANONYMIZED RESUME ===")
    print(row["Anonymized Resume"])
    print("=" * 80)  # Separator line



=== ORIGINAL RESUME (Row 16) ===
﻿Shikha MalikSr. Enterprise Data Architect , USA    215-313-1722SUMMARY	Accomplished data leader with a robust background spanning over 21 years, specializing in the establishment and institutionalization of data strategies, policies, and frameworks. Acknowledged for effectively managing data to mitigate operational risks and ensuring strict regulatory compliance through innovative solutions. Skilled in stakeholder management and fostering strong relationships, leading teams to high levels of achievement. Recognized for a sharp ability to identify and streamline inefficiencies, with demonstrated strengths in data governance, strategy formulation, and meeting business needs. Eager to apply this extensive background in a role that drives organizational growth and sets new benchmarks for success.KEY COMPETENCIES· Data Strategy· Enterprise Data Architecture· Policy and Standards· Data Governance· Road-mapping· Regulatory Compliance· Data Analytics· Strateg