# Spotify Data Cleaner

This notebook cleans **personally identifiable information (PII)** from your downloaded Spotify account data so you can use it safely for class projects.

## Prerequisites

- Place your downloaded Spotify data in a folder named **`Spotify Account Data`** in the same directory as this notebook (i.e. extract your Spotify export zip so that files like `Userdata.json`, `Identifiers.json`, and `Follow.json` are inside `Spotify Account Data/`).

## What this notebook does

1. **Replaces** identifiers (username, email, birthdate, names, etc.) with realistic fake values using the [Faker](https://faker.readthedocs.io/) library.
2. **Writes** cleaned JSON files into a new folder, **`Spotify_Account_Data_Cleaned/`**, without modifying your original data.
3. **Zips** that folder into a single file whose name is a **random number** (e.g. `71234987833478823.zip`) for anonymous submission.
4. **Prints a summary** of every replacement made at the end.

Run the cells in order from top to bottom.

In [None]:
# Imports and configuration
import json
import zipfile
import shutil
import random
import re
from pathlib import Path
from faker import Faker

# Paths: read from Spotify Account Data, write to a new cleaned folder
DATA_DIR = Path("Spotify Account Data")
OUTPUT_DIR = Path("Spotify_Account_Data_Cleaned")
ZIP_OUTPUT_DIR = Path(".")  # directory where the .zip file will be created (current folder)

# Create the output folder (removes existing so each run starts clean)
if OUTPUT_DIR.exists():
    shutil.rmtree(OUTPUT_DIR)
OUTPUT_DIR.mkdir(parents=True)

# List to record every PII replacement for the final summary
changes_summary = []

## Faker and consistent replacements

We use **Faker** to generate fake but realistic values. To keep data consistent across files (e.g. the same email in `Userdata.json` and `Identifiers.json`), we use a single **mapping**: the first time we see an original value, we generate a fake replacement and store it; every later time we see that same value, we reuse the same replacement.

You can change or remove the `seed` below to get different fake data on each run; using a seed makes results reproducible.

In [None]:
# Seeded Faker for reproducible replacements (change or remove seed for different results)
fake = Faker(seed=42)
pii_map = {}  # original value -> replacement value

def get_replacement(original, generator):
    """Return a consistent fake replacement for `original`, recording in pii_map."""
    if original is None or (isinstance(original, str) and original.strip() == ""):
        return original
    if original not in pii_map:
        pii_map[original] = generator()
    return pii_map[original]

def record_change(file_name, field, old_val, new_val):
    """Append a change to the summary list for the final report."""
    changes_summary.append({
        "file": file_name,
        "field": field,
        "old": old_val,
        "new": new_val,
    })

## Step 1: Clean `Userdata.json`

`Userdata.json` contains account-level PII: **username**, **email**, **birthdate**, **gender**, **Facebook UID**, and **creation time**. We replace each with a Faker-generated value and write the result to the cleaned folder.

In [None]:
userdata_path = DATA_DIR / "Userdata.json"
with open(userdata_path, "r", encoding="utf-8") as f:
    userdata = json.load(f)

# Replace each PII field and record the change
if userdata.get("username"):
    old_u = userdata["username"]
    userdata["username"] = get_replacement(old_u, fake.user_name)
    record_change("Userdata.json", "username", old_u, userdata["username"])

if userdata.get("email"):
    old_e = userdata["email"]
    userdata["email"] = get_replacement(old_e, fake.email)
    record_change("Userdata.json", "email", old_e, userdata["email"])

if userdata.get("birthdate"):
    old_b = userdata["birthdate"]
    userdata["birthdate"] = get_replacement(old_b, lambda: fake.date_of_birth(minimum_age=18, maximum_age=80).strftime("%Y-%m-%d"))
    record_change("Userdata.json", "birthdate", old_b, userdata["birthdate"])

if userdata.get("gender"):
    old_g = userdata["gender"]
    userdata["gender"] = get_replacement(old_g, lambda: random.choice(["male", "female", "non-binary"]))
    record_change("Userdata.json", "gender", old_g, userdata["gender"])

if userdata.get("facebookUid"):
    old_f = userdata["facebookUid"]
    userdata["facebookUid"] = get_replacement(old_f, lambda: fake.numerify(text="###############"))
    record_change("Userdata.json", "facebookUid", old_f, userdata["facebookUid"])

if userdata.get("creationTime"):
    old_c = userdata["creationTime"]
    userdata["creationTime"] = get_replacement(old_c, lambda: fake.date_between(start_date="-15y", end_date="-1y").strftime("%Y-%m-%d"))
    record_change("Userdata.json", "creationTime", old_c, userdata["creationTime"])

with open(OUTPUT_DIR / "Userdata.json", "w", encoding="utf-8") as f:
    json.dump(userdata, f, indent=2)

print("Userdata.json cleaned and written to", OUTPUT_DIR)

## Step 2: Clean `Identifiers.json`

This file stores login identifiers (e.g. email). We replace `identifierValue` with the **same** fake email used in `Userdata.json` for that original email, so the account stays consistent across files.

In [None]:
identifiers_path = DATA_DIR / "Identifiers.json"
with open(identifiers_path, "r", encoding="utf-8") as f:
    identifiers = json.load(f)

# Replace email with same mapping as Userdata (so same original -> same fake)
if identifiers.get("identifierType") == "email" and identifiers.get("identifierValue"):
    old_val = identifiers["identifierValue"]
    identifiers["identifierValue"] = get_replacement(old_val, fake.email)
    record_change("Identifiers.json", "identifierValue", old_val, identifiers["identifierValue"])

with open(OUTPUT_DIR / "Identifiers.json", "w", encoding="utf-8") as f:
    json.dump(identifiers, f, indent=2)

print("Identifiers.json cleaned and written to", OUTPUT_DIR)

## Step 3: Clean `Follow.json`

`Follow.json` contains display names in **userIsFollowing**, **userIsFollowedBy**, and **userIsBlocking**. We replace each name with a consistent fake name (e.g. first name + last initial).

In [None]:
def replace_follow_names(data, list_key, file_name):
    """Replace each display name in the given list with a fake name; record changes."""
    if list_key not in data or not isinstance(data[list_key], list):
        return
    new_list = []
    for name in data[list_key]:
        if name:
            new_name = get_replacement(name, lambda: f"{fake.first_name()} {fake.last_name()[0]}.")
            record_change(file_name, list_key, name, new_name)
            new_list.append(new_name)
        else:
            new_list.append(name)
    data[list_key] = new_list

follow_path = DATA_DIR / "Follow.json"
with open(follow_path, "r", encoding="utf-8") as f:
    follow = json.load(f)

replace_follow_names(follow, "userIsFollowing", "Follow.json")
replace_follow_names(follow, "userIsFollowedBy", "Follow.json")
replace_follow_names(follow, "userIsBlocking", "Follow.json")

with open(OUTPUT_DIR / "Follow.json", "w", encoding="utf-8") as f:
    json.dump(follow, f, indent=2)

print("Follow.json cleaned and written to", OUTPUT_DIR)

## Step 4: Clean `Inferences.json`

`Inferences.json` contains demographic and behavioral segments (e.g. `demographic_age_50-54`, `demographic_male`). We replace age bands and gender with generic values so the data remains useful for analysis but not identifying.

In [None]:
# Generic replacements for demographic segments (consistent per original value)
GENERIC_AGE_BANDS = ["demographic_age_25-29", "demographic_age_30-34", "demographic_age_35-40"]
GENERIC_GENDERS = ["demographic_male", "demographic_female"]

def replace_inference(original):
    """Replace a single inference string if it is demographic age or gender."""
    if re.match(r"demographic_age_\d+-\d+", original):
        return get_replacement(original, lambda: random.choice(GENERIC_AGE_BANDS))
    if original in ("demographic_male", "demographic_female"):
        return get_replacement(original, lambda: random.choice(GENERIC_GENDERS))
    return original

inferences_path = DATA_DIR / "Inferences.json"
with open(inferences_path, "r", encoding="utf-8") as f:
    inferences_data = json.load(f)

new_inferences = []
for val in inferences_data.get("inferences", []):
    new_val = replace_inference(val)
    if new_val != val:
        record_change("Inferences.json", "inferences", val, new_val)
    new_inferences.append(new_val)
inferences_data["inferences"] = new_inferences

with open(OUTPUT_DIR / "Inferences.json", "w", encoding="utf-8") as f:
    json.dump(inferences_data, f, indent=2)

print("Inferences.json cleaned and written to", OUTPUT_DIR)

## Step 5: Clean playlist names in `Playlist1.json`

Playlist **name** fields can contain first names (e.g. "Rob and Harper Christmas Mix", "Harper's Favorites"). We detect common first names in each name string and replace them with Faker first names, reusing the same mapping so the same name is always replaced the same way.

In [None]:
# Build a set of common first names (including names that often appear in playlist titles)
# Faker's en_US provider; we add a few more that are common in playlist names
COMMON_FIRST_NAMES = {
    "Harper", "Rob", "Jake", "Alex", "Sam", "Jordan", "Taylor", "Morgan", "Casey",
    "Jamie", "Quinn", "Avery", "Riley", "Hayden", "Parker", "Drew", "Blake", "Reese",
    "James", "John", "Robert", "Michael", "David", "William", "Richard", "Joseph",
    "Mary", "Patricia", "Jennifer", "Linda", "Elizabeth", "Barbara", "Susan", "Jessica",
    "Chris", "Nick", "Matt", "Mike", "Dave", "Tom", "Steve", "Joe", "Dan", "Ben",
    "Amy", "Kate", "Sarah", "Emily", "Anna", "Laura", "Lisa", "Rachel", "Megan",
}

def replace_first_names_in_string(text):
    """Replace any common first name in text with a Faker first name; return new string and record changes."""
    if not text or not text.strip():
        return text
    words = text.split()
    new_words = []
    for word in words:
        # Handle possessive: "Harper's" -> base "Harper", suffix "'s"
        base = re.sub(r"'s$", "", word, flags=re.IGNORECASE)
        if base in COMMON_FIRST_NAMES:
            replacement = get_replacement(base, fake.first_name)
            record_change("Playlist1.json", "playlist name", base, replacement)
            # Preserve possessive if original had it
            if word.endswith("'s") or word.endswith("'S"):
                new_words.append(replacement + "'s")
            else:
                new_words.append(replacement)
        else:
            new_words.append(word)
    return " ".join(new_words)

playlist_path = DATA_DIR / "Playlist1.json"
with open(playlist_path, "r", encoding="utf-8") as f:
    playlists_data = json.load(f)

for playlist in playlists_data.get("playlists", []):
    if playlist.get("name"):
        old_name = playlist["name"]
        playlist["name"] = replace_first_names_in_string(old_name)

with open(OUTPUT_DIR / "Playlist1.json", "w", encoding="utf-8") as f:
    json.dump(playlists_data, f, indent=2)

print("Playlist1.json cleaned and written to", OUTPUT_DIR)

## Step 6: Copy all remaining files unchanged

All other files (e.g. **StreamingHistory**, **YourLibrary**, **Marquee**, **Payments**, other playlists) are copied into the cleaned folder so the dataset stays complete and the output has the same structure as the original. Only the five files we modified above are written with obfuscated data; everything else is copied as-is.

In [None]:
# Which files we already wrote (cleaned) â€” only these are skipped when copying
cleaned_files = {"Userdata.json", "Identifiers.json", "Follow.json", "Inferences.json", "Playlist1.json"}

# Copy every file from the source folder (including subdirs) into OUTPUT_DIR, preserving structure.
# Skip only the five files we already wrote with obfuscated data.
for path in DATA_DIR.rglob("*"):
    if not path.is_file():
        continue
    rel = path.relative_to(DATA_DIR)
    # Skip top-level files we already cleaned and wrote
    if len(rel.parts) == 1 and rel.name in cleaned_files:
        continue
    dest = OUTPUT_DIR / rel
    dest.parent.mkdir(parents=True, exist_ok=True)
    shutil.copy2(path, dest)
    print("Copied:", rel)

print("All remaining files copied to", OUTPUT_DIR)

## Step 7: Create ZIP with random numeric filename

The cleaned folder is zipped so you can submit or share it easily. The filename is a **random number** (e.g. `71234987833478823.zip`) so it does not identify you.

In [None]:
# Random numeric filename (e.g. 71234987833478823.zip); same number used as folder name inside zip
zip_stem = str(random.randint(10**16, 10**18 - 1))
zip_name = zip_stem + ".zip"
zip_path = ZIP_OUTPUT_DIR / zip_name

with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zf:
    for f in OUTPUT_DIR.rglob("*"):
        if f.is_file():
            arcname = f"{zip_stem}/{f.relative_to(OUTPUT_DIR)}"  # e.g. 71234987833478823/Userdata.json
            zf.write(f, arcname)

print("Created zip:", zip_path.resolve())

## Summary of changes

Below is a report of every PII replacement made in this run. Use it to verify what was anonymized before you submit or share the zip file.

In [None]:
# Group changes by file for a clear report
from collections import defaultdict
by_file = defaultdict(list)
for c in changes_summary:
    by_file[c["file"]].append((c["field"], c["old"], c["new"]))

print("=" * 60)
print("SUMMARY OF PII REPLACEMENTS")
print("=" * 60)
for file_name in sorted(by_file.keys()):
    print(f"\n--- {file_name} ---")
    for field, old_val, new_val in by_file[file_name]:
        print(f"  {field}:  {old_val!r}  ->  {new_val!r}")
print("\n" + "=" * 60)
print(f"Total replacements: {len(changes_summary)}")
print("=" * 60)