## ReadDeletionEventsDataLake.ipynb

#### Authors:
* The first part of this script concerning reading data from Data Lake was forked from the GitHub of 'fsanchez' on April 1, 2020 (https://github.bus.zalan.do/fsanchez/notebooks/blob/master/ReadingCustomerDeletionEventsFromDataLake.ipynb?short_path=f54a222)
* The last part of this script concerning publishing to Nakadi was forked from the GitHub of 'eiunkar' on April 1, 2020
(https://github.com/eiunkar/pyNakadi)
* The rest of this script, including the API call to Qualtrics, was written by Kevin Stine

#### Purpose:
This code was created for the purpose of identifying whether any Zalando Voices members (the research panel maintained by the Voice of Customers team) have chosen to delete all of their Zalando data.
If this is true, the code will print out a list of these Zalando Voices members so that the user can delete them from Zalando Voices.
This code also sends a message (via Nakadi) to Zalando's compliance network to prove that we looked through these deletion requests and acted on them.

##### Specific Tasks of this Code:
1. Read in Zalando data deletion requests from the Data Lake
2. Extract email addresses from those requests and compare them to Zalando Voices data in Qualtrics
3. Print out matching email addresses so that the **user of this code** can manually delete those members' data from Zalando Voices
4. Publish data deletion confirmations to Nakadi (proving that we processed these requests).
5. Record the requests which have been processed so far and their outcomes (deleted or no customer data found) in a local file (called "LoggedDeletionRequests.csv") so that we don't re-process data that we've already checked.


#### Additional Notes:
This script, and the structure of how it is run, should eventually undergo substantial changes to make it more compliant with Zalando's architecture standards:
1. Firstly, this script will ideally not live in DataLab. It should be re-implemented in a service which offers easier automation and easier access for coworkers (such as Amazon Web Services). **Note: This will take longer to implement and I have not prioritized this highly yet. -Kevin, July 29, 2021**

2. The authorization which enables this script to interact with the Data Lake and Nakadi (by generating a ZToken) currently depend on the credentials of the user. Once this script is set up to run automatically, this authorization will not be able to depend on the user's credentials- or there will need to be a 'robot' user whose credentials can be used for authorization.

### Get the User of this Script, and the User's ZToken

In [None]:
import auth
import getpass

user  = getpass.getuser()
#NOTE: If auth.get_valid_token() fails in DataLab, you can click the 'Z Token' button at the top of the screen
#and it should work again.
token = auth.get_valid_token()

### Create a Connection to the Data Lake

In [None]:
from pyhive import presto

connection = presto.connect(
    protocol='https',
    host='datalab.presto.zalando.net',
    port='443',
    username=user,
    password=token
)

### Find Out Which Deletion Requests are New

In [None]:
import pandas as pd
from datetime import date

#First, I need to look within my local CSV to find the last time this script was run and- by extension- the last
#data deletion request this script processed.
path = "LoggedDeletionRequests.csv"#"/teams/team-ur/DataDeletion/LoggedDeletionRequests.csv"
LastDateParsed = pd.read_csv(path,usecols=['dt'],header=0).dt.tolist()
LastDateParsed = max(LastDateParsed)

#Now that I know the last time I processed deletion requests, I know that I can find all the requests from that date until
#the current date- and thereby ensure that I've processed all requests. NOTE: This DOES mean that I could double-process
#requests (for example, if a person requested deletion on the day I last ran this script, they could have been included in
#that previous deletion round, as well as the current deletion round). Example, when this code was run on 28/04/2020, there were 212
#cases which were re-processed out of ~3,500 new requests.
Today = str(date.today())
LastDateParsed

### Read in New Deletion Requests from Data Lake

Please, make sure you have access to the event_confidential.care_api_customer_data_deletion_requested table: https://lake.docs.zalando.net/access/access/

In [None]:
query = '''SELECT metadata_eid,dt,email_address,customer_number
           FROM event_delta_confidential.care_api_customer_data_deletion_requested 
           WHERE dt BETWEEN '''+"'"+LastDateParsed+"'"+''' AND '''+"'"+Today+"'"
df = pd.read_sql_query(query, connection)

In [None]:
len(df) #Seeing how many new cases we have:
df

### Read in QuestionPro data (Zalando Voices panel)

In [None]:
import urllib3
import json

#URL to QuestionPro Directory
url="https://eu.questionpro.com/a/api/questionpro.micropanel.panelDataExport?apiKey={0}"

#Customize this to where you store your personal API token. It should be in your private DataLab space (e.g. "home/kstine/...")
#for security. When you save the file with your API token,the file should contain ONLY this text:
#{"questionpro_api_token":"PUT_API_TOKEN_HERE"}

#api_location = "/nfs/aakyuez_untitled.txt" #Aysenur's file location
api_location = "/nfs/Tokens/API_Token.json" #Kevin's file location

#Reading in API Token & other headers from external file
with open (api_location) as myfile:
    token = json.load(myfile)['questionpro_api_token']
    
url = url.format(token)

http = urllib3.PoolManager()

response = http.request(method='GET',url=url)
response = json.loads(response.data.decode('utf-8'))

if response['status']['id'] == 200:
    members = response['response']['report']['data']
    
# with open ("/nfs/questionPro_dummy.json") as myfile:
#     members = json.load(myfile)['vals']
#print(members)

#Iterating over array of objects to extract email addresses into a list
qpro_emails = [obj['Email Address'] for obj in members]

### Indicate which deletion requests represent Zalando Voices members

In [None]:
df['inQuestionPro'] = ["customer_deleted" if df_email in qpro_emails else "no_customer_data_found" for df_email in df['email_address'].tolist()]
#If this command only returns the column headers (no rows), it means that we didn't find any customers in QuestionPro
#who need to be deleted.
df[df['inQuestionPro']=='customer_deleted']

---
## IMPORTANT:
At this point in this script, you (the user) should delete the Zalando Voices data from any email addresses above (ZV members who have requested deletion). Instructions for how to do this are in this document:
https://docs.google.com/spreadsheets/d/1ZSQ_o2KurGlV1TzJj2FL9Z_7P90onV23xb4B5yuYxRc/edit#gid=1615454121



Once you have finished deleting these members' data, run the rest of this script- which is simply notifying Zalando that we processed the requests & recording which requests we processed in a local file (for us to read in next time).

---

### Posting Confirmations of Processing to Nakadi - Setup

In [None]:
%pip install pynakadi

In [None]:
from pyNakadi import NakadiClient, NakadiException
import auth
import pytz
from datetime import datetime

token = auth.get_valid_token()
#--------!!!--------
#--------!!!--------
#URL to which we need to be posting our event
url = "https://nakadi-live.aruha.zalan.do" #Real, live URL!
#url = "https://nakadi-staging.aruha-test.zalan.do" #Test URL
#--------!!!--------
#--------!!!--------

time = datetime.utcnow().replace(tzinfo=pytz.UTC).isoformat()


### Transforming Data in Preparation for Posting to Nakadi:

In [None]:
#Creating series of JSONs with our data
def jsonification(metadata_eid,inQuestionPro,time):
    return {
        "metadata": {
        "eid": metadata_eid,
        "occurred_at": time
        },
        "deletion_request_event_id": metadata_eid,
        "team_id":"team-ur",
        "deletion_acknowledgment": inQuestionPro
    }

toPost = pd.Series(jsonification(row.metadata_eid,row.inQuestionPro,time) for row in df.itertuples())

#----------
#Chunking requests into sets of 50 for faster pushing to Nakadi:
def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

chunked_Posts = chunker(toPost,50)
print("Posts chunked and ready for Nakadi")

### Publishing Nakadi Confirmations:

In [None]:
#Meta-data needed to publish to Nakadi
event_type = "infosec.customer_data_deleted"
client = NakadiClient(token, url)
NakadiPosts = []

for chunk in chunked_Posts:
    try:
        client.post_events(event_type, chunk.tolist())
        NakadiPosts = NakadiPosts + chunk.tolist()
    except NakadiException as ex:
        print(f'NakadiException[{ex.code}]: {ex.msg}')

print(str(len(NakadiPosts)) + " posts sent to " + url)

### Double-Checking What we Sent to Nakadi:

In [None]:
#Taking a look at what I've posted to Nakadi:
#len(NakadiPosts)
NakadiPosts[-3:]

In [None]:
#===DOUBLE-CHECKING===
#Taking a look at what we told Nakadi that we deleted:
deletedEids = df['metadata_eid'][df['inQuestionPro']=='customer_deleted'].tolist()
#This confirms whether or not the EID we were supposed to delete was in the Nakadi Posts:
deletedPosts = [post for post in NakadiPosts if post['metadata']['eid'] in deletedEids]
deletedPosts

In [None]:
#===DOUBLE-CHECKING===
#Did we tell Nakadi that we deleted the right people?
df[df['inQuestionPro']=='customer_deleted']

### Recording our Activity to Local "LoggedDeletionRequests.csv" file

In [None]:
from csv import writer
import subprocess

#Note: We're going to write changes to the master version of the "LoggedDeletionRequests" file in the 'team' folder
Newdf = df.drop(columns='email_address')
with open('LoggedDeletionRequests.csv','a') as csv_file:
    csv_writer = writer(csv_file)
    [(csv_writer.writerow(newRow[1])) for newRow in Newdf.iterrows()]
print(str(len(df)) + " rows written")