## Introduction

On the following pages is an exercise to give you a flavor of topics which will concern you as a Privacy Hub data analyst. Your answers will show us something of the way you work and think as well as how well you communicate. When in doubt about how to proceed, please use your best judgment. We do require that your answers are entirely your own work and that you cite where any external resources are used.

Your answers should include the code used to produce your findings. Where you describe your findings, do so in a non-technical way, being careful with wording. The reader should not need to execute your code to see your answers.

Good luck!

## Technical Instructions

The remainder of this notebook is divided into 2 sections. Section 1 consists of questions for you to answer directly in the notebook. Your submission should consist of an html or pdf export of this notebook. Send your results by attaching them in an email to the address that you received this exercise from.

Section 2 consists of an appendix that will be referenced in Section 1.

### Personal Hobbies

The following questions involve working with the dataset that was sent with this notebook. This is toy data which has been created within Privacy Hub and does not reflect any real people. The dataset includes a number of variables relating to physical attributes, hobbies and a ‘token’ which is used to distinguish one patient from another. A token is a value which is created from any one or more input values. It looks nonsensical (as it is a ‘hash’) and cannot be re-engineered to the original value. An example is provided alongside a more detailed explanation in the Appendix.

This data has been sent to you from a prospective client who intends to use the data for some analyses on inferring hobby trends across different demographics. The client has informed you that the tokens are created using data within the dataset itself, which is encrypted using a robust and trusted algorithm.

For the following questions, where appropriate, use code and language that is concise, efficient, easy-to-read and annotated to explain what’s going on. Where a question is open-ended, it is encouraged that you are succinct in your explanation.

a) Using the dataset and information above, determine which variables have been used to create the token. Can you comment on how this token might be improved?

In [2]:
import sys
!{sys.executable} -m pip install panda

Collecting panda
  Downloading panda-0.3.1.tar.gz (5.8 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: panda
  Building wheel for panda (setup.py): started
  Building wheel for panda (setup.py): finished with status 'done'
  Created wheel for panda: filename=panda-0.3.1-py3-none-any.whl size=7245 sha256=687cc0b053dc831500f369ffe82a7afd1479c7dd6aa201c0e3ba8e8f10db80cc
  Stored in directory: c:\users\my pc\appdata\local\pip\cache\wheels\df\5c\39\36f8dae25a1e88d6ec4411dec4a143781e64fdff6897758eec
Successfully built panda
Installing collected packages: panda
Successfully installed panda-0.3.1


In [4]:
# Install a conda package in the current Jupyter kernel
import sys
!conda install --yes --prefix {sys.prefix} numpy

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\ProgramData\anaconda3

  added / updated specs:
    - numpy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.12.12 |       haa95532_0         127 KB
    certifi-2023.11.17         |  py311haa95532_0         160 KB
    ------------------------------------------------------------
                                           Total:         286 KB

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2023.11.~ --> pkgs/main::ca-certificates-2023.12.12-haa95532_0 

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge/noarch::certifi-2023.11.1~ --> pkgs/main/win-64::certifi-2023.11.17-py311haa95532_0 



Downloading and Extract

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/main/win-64/current_repodata.json HTTP/1.1" 304 0
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/main/noarch/current_repodata.json HTTP/1.1" 304 0
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/r/noarch/current_repodata.json HTTP/1.1" 304 0
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/msys2/win-64/current_repodata.json HTTP/1.1" 304 0
DEBUG:

b) Calculate the average weight for a person with blue eyes aged between 20 and 50 (inclusive). (Assume every row represents a unique individual).

c) Comment on the distribution of values within the height column, can you explain why there may be such a wide range of values? (hint: it may be useful to research real height distributions in the human population).

d) Notice that row 12 (incl. header) begins to shift columns to the left and leads to misalignment. Further queries reveal that this is an error due to missing age information during manual data entry. Given the format of the file, can you identify a simple fix to ensure this misalignment doesn’t happen and formatting remains consistent?

e) The client wishes to use the dataset partly to analyze the relationship between sex and favorite hobby. Comment on any pre-processing steps you think are necessary before performing such analysis.

f) Upon further conversation, the client reveals to you that the dataset provided is a sample of a larger dataset, which consists of 100,000 unique individuals.

    i) Comment on the validity of using this data sample to draw any statistical conclusions about the entire dataset  
    ii) What advice would you give the client should they ask how best to prepare a data sample?

g) Part of our work at Privacy Hub invovles applying industry-standard guidance to various real life use cases. Such guidance often takes the form of modifications to a dataset in the name of reducing risk of re-identification. Briefly describe how you would implement the following modifications.

    i) Height must be capped at 183cm for females and 198cm for males.
    ii) All marital status values present must be limited to one of: "Married", "Single", "Divorced", "Widowed", or "Unknown". Furthermore, marital status must not be present for patients under 30.

### Appendix

To illustrate the tokenisation process, consider the data found in the table below. To create the associated tokens, the variables var1 and var2 have been fed into a tokenization engine. Thus, rows with the same entries in both these variables have the same tokens, despite potentially representing different individuals (this is known as token colliding (or clashing)).

| var1 | var2 | token     |
|------|------|-----------|
| A    | X    | 2fw4hcl7  |
| A    | Y    | oa3t6saj  |
| B    | X    | a56plm9   |
| A    | X    | 2fw4hcl7  |
