### Business Case

Data anonymization is important when publicizing work. In order to publish my main script (Scoreholio Stat Collection) I wanted to replace my player's names with anonymous names. This will allow me to provide test data for others to run their script on, without risking the privacy of my players. I am not providing the original CSV files for obvious privacy reason.

In [1]:
import pandas as pd
import numpy as np
import names
from faker import Faker
import os
import glob
import re
from ipynb.fs.full.methods import get_combined_list

#### Sets input path with directory of Scoreholio bracket standing outputs directly from the Scoreholio website.

In [2]:
input_path = r'C:\Users\chein\School\Practice\Cornhole Stat Collection\ChiTown Baggers Bracket Standing Outputs' # use your path

In [3]:
output_path = r'C:\Users\chein\School\Practice\Cornhole Stat Collection\Chitown Anonymous Data'

#### This method combines all CSVs in the folder into one dataframe, grouped by player's name.

In [4]:
realnames = get_combined_list(input_path)

#### Rename column for easier referencing.

In [5]:
realnames=realnames.rename(columns={"Player Name": "playername"})

#### Looking to see number of unique playernames.

In [6]:
len(realnames["playername"].unique())

93

#### Sets all_names to the list of unique player names in the realnames dataframe.

In [7]:
all_names = set(realnames["playername"])

This creates a dictionary consistening of player names and a fake name to each. I used last_name for no particular reason. Using fake.unique ensures that every player gets a unique name and no two real player names are errantly combined into one fake name.

In [8]:
#### Create Faker object which is used to generate fake names.
fake = Faker()
# mapper = {k: fake.last_name() for k in all_names }
mapper = {k: fake.unique.last_name() for i, k in enumerate(all_names)}

#### This goes through each file from the input path directory, anonymizes it, and then saves it to the output directory.

In [9]:
all_files = glob.glob(input_path + "/*.csv")
li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    df = df.replace({"Player1Name": mapper})
    df = df.replace({"Player2Name": mapper})
    # Empties team name as it also contains player data.
    df["TeamName"]=" "
    newfilename = "\\"+re.sub('.csv', '', filename).split("\\")[-1] + "-anon.csv"
    df.to_csv(output_path + newfilename, index = False, header=True)

Example output after the anonymization. Excluding null value columns here as the main script will handle the actual data cleaning.

In [10]:
df = df[df.columns[~df.isnull().all()]]
df.head()

Unnamed: 0,GameID,GameName,TeamID,TeamName,Player1Name,Player2Name,Seed,Place,Wins,Losses
0,pKiHADI5c9EsrG7pH1gq,Rizzo's Social Blind Draw,fCyWKFwZouRcBkkLMBpa,,Kirk,Carter,5,1,6,0
1,pKiHADI5c9EsrG7pH1gq,Rizzo's Social Blind Draw,GJAovQFbTBshk30cCyVI,,Singh,Nunez,8,2,5,2
2,pKiHADI5c9EsrG7pH1gq,Rizzo's Social Blind Draw,VKhFgkFZGc8nMQbnLWYC,,Hensley,Kennedy,10,3,3,2
3,pKiHADI5c9EsrG7pH1gq,Rizzo's Social Blind Draw,zSSSt1TXTTsQnkpbstc1,,Montoya,Robertson,6,4,3,2
4,pKiHADI5c9EsrG7pH1gq,Rizzo's Social Blind Draw,c0fkg3xTT0RW9el9QChX,,Mendez,Martinez,4,5,2,2
