## Scrubbing User Data
We want to ensure that the data that we analyze is scrubbed so that no trace-able information is related to individual data points.

Import Dependencies

In [69]:
# import dependencies
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as sci
import pandas as pd
from pathlib import Path

Load our raw data in the form of `CSV` into data frames. Raw data should be located in `./raw-data/`.

In [70]:
# file_name = input()
file_name = "responses_raw.csv"
file_path = "./raw-data/" + file_name
df = pd.read_csv(file_path)

Prune duplicate emails from the dataset

In [71]:
col_name = (
    "(Optional) Provide your email for a chance to win a $20.00 Tim Horton's gift card"
)

df.drop_duplicates(
    subset=[col_name],
    keep="last",
)

Unnamed: 0,Timestamp,What do you think is an acceptable amount of time to wait for services at your pharmacy?,How long do you usually wait at the pharmacy?,Have you used Amazon Lockers or a similar pick-up Lockers for before?,How likely are you to use an automated pick-up locker to pick up your prescriptions?,Email Address,What is your age range?,(Optional) Provide your email for a chance to win a $20.00 Tim Horton's gift card,How often do you usually visit pharmacies in a given year,Think about the last time you went to the pharmacy. How would you best describe your experience?
2,10/5/2022 18:48:45,11-15 minutes,0-3 minutes,No,Somewhat likely,,Under 21,zoe.cushman@protonmail.com,,
3,10/5/2022 19:10:21,4-6 minutes,more than 15 minutes,No,Very likely,,21 - 35,nbudatho@uwaterloo.ca,,
4,10/5/2022 19:20:25,4-6 minutes,7-9 minutes,No,Not very likely,,21 - 35,glmdenney17@gmail.com,,
5,10/5/2022 19:25:40,4-6 minutes,0-3 minutes,No,Likely,,21 - 35,m.balghonaim@gmail.com,,
6,10/5/2022 19:32:27,4-6 minutes,7-9 minutes,No,Very likely,,21 - 35,jjwilkin@uwaterloo.ca,,
...,...,...,...,...,...,...,...,...,...,...
919,10/6/2022 19:38:05,4-6 minutes,0-3 minutes,Yes,Likely,,35 - 40,matildagreenh78@gmail.com,,
920,10/6/2022 19:38:09,4-6 minutes,7-9 minutes,Yes,Likely,,21 - 35,alesandrurakuqi843@gmail.com,,
921,10/6/2022 19:38:13,4-6 minutes,11-15 minutes,Yes,Very likely,,21 - 35,gregoryarroyo099@gmail.com,,
922,10/6/2022 19:38:40,7-9 minutes,7-9 minutes,Yes,Very likely,,35 - 40,anunley661@gmail.com,,


Collect the emails and store them separate from data

In [72]:
emails = pd.DataFrame(df[col_name])
emails.rename(
    columns={col_name: "email"},
    inplace=True,
)
emails.drop_duplicates()
emails = emails.dropna()
email_output_file_name = "emails_for_reward.csv"
filepath = Path("./output/" + email_output_file_name)
filepath.parent.mkdir(parents=True, exist_ok=True)
emails.to_csv(filepath)