# Take-Home Assignment - Human Native - Walkthrough

## Introduction

We are the ML team at Human Native. We want to build a system that reduces the workload of the operations team: this team curretly reviews hundreds of thousand of media (multimedia: text, image, audio, video, animation).

The engineering team currently has a set up like so: 

```
Dataset(org_id, id, name, type)
Data(dataset_id, id, value, flag)
```



We want to build a model that automatically flags data that might be in violation.

We are doing this to be able to provide a service that will allow consumrs to provide structured information about where, why and how the data is in violation.

Hence, what this small MVP project wishes to show is that:
1. for a specific modality of data (we will use text)
2. we can return structured information about where, why and how the data is in violation
3. we want to build a model that does this
4. we also want to provide recommendations about the current data structures used by the engineering team to make this problem as easy as possible.

Here are some specifications:
- we will note our assumptions as we go along
- we want to build our own model, where we can
- we don't need to worry about persistence
- our model currently has access to all previously flagged data items as well as those verified as corrct by the operations team

### Rough Plan 

We will design a pipeline that is independent of the model we use to classify. We will do this to rapidly iterate. We will then train our own distilbert model for classification.

## MVP of an MVP

Here, we will simply take in a bunch of text of we want to build something that returns an object: this object contains the original text, the text with redactions, structured information about where the PII is 

In [3]:
# Relevant imports
from presidio_analyzer import AnalyzerEngine, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
import json
from pprint import pprint

sample_text = "hello my name is  Mac Walker, my email is macskyewalker@gmail.com and I live at 10 downing street"
sample_entities = ["PHONE_NUMBER", "PERSON", "EMAIL_ADDRESS", "LOCATION", "UK_NINO"]

# Please see a full list of supported entities here: (https://microsoft.github.io/presidio/supported_entities/)


analyser = AnalyzerEngine()

analyser_results = analyser.analyze(text=sample_text, entities = sample_entities, language='en')

print((analyser_results[2]))


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.4 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/mac/Documents/Career/Interviews/HumanNative/humannative/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/mac/Documents/Career/Interviews/HumanNative/humannative/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/mac/Documents/Career/Interviews/HumanNative/humannati

type: LOCATION, start: 80, end: 97, score: 0.85


Great, so we now have an extremely basic system that can detect PII. Right now, this is a tool, so lets build functionality around it to actually go through our datasource.

In [4]:
analyser = AnalyzerEngine()

def presidio_detect_PII(text, entities, language):

    assert isinstance(text, str), "text must be a string"
    assert isinstance(entities, list) and all(isinstance(e, str) for e in entities), "entities must be a list of strings"
    assert isinstance(language, str), "language must be a string"

    analyser_results = analyser.analyze(text=text, entities=entities, language=language)
    return analyser_results

# Small Test
sample_results = presidio_detect_PII(text= sample_text, entities = sample_entities, language = 'en')
print(sample_results)

[type: EMAIL_ADDRESS, start: 42, end: 65, score: 1.0, type: PERSON, start: 18, end: 28, score: 0.85, type: LOCATION, start: 80, end: 97, score: 0.85]


In [5]:
anonymiser = AnonymizerEngine()
def presidio_replace_PII(text, entities, language):

    analyzer_results = presidio_detect_PII(text = text, entities = entities,language=language)

    operators = {
            result.entity_type: OperatorConfig("replace", {"new_value": f"[{result.entity_type}]"}) for result in analyser_results
        } # got this from ChatGPT

    anonymised_results = anonymiser.anonymize(
        text = text,
        analyzer_results = analyzer_results,
        operators = operators
        
    )
    return anonymised_results

#Small Test
sample_anonymised = presidio_replace_PII(text= sample_text, entities = sample_entities, language = 'en')
print(sample_anonymised)

text: hello my name is  [PERSON], my email is [EMAIL_ADDRESS] and I live at [LOCATION]
items:
[
    {'start': 70, 'end': 80, 'entity_type': 'LOCATION', 'text': '[LOCATION]', 'operator': 'replace'},
    {'start': 40, 'end': 55, 'entity_type': 'EMAIL_ADDRESS', 'text': '[EMAIL_ADDRESS]', 'operator': 'replace'},
    {'start': 18, 'end': 26, 'entity_type': 'PERSON', 'text': '[PERSON]', 'operator': 'replace'}
]



Okay, so we have some functions that allow us to take in a piece of text, detect where the PII is, and return the text that has been anonymised.

Right now, we want to test two models. We have our approach above which isn't ours, and a few more ML approaches for models we can train. 

The next step, before training our model, should be construction of a relatively big dataset (1000 blog post entries), so we can start with our model training.

In [21]:
import random
import faker

def generate_blogpost(pii_mode):
    base_sentences = faker.paragraph(nb_sentences=6).split('. ')
    flag = 1 if pii_mode != "none" else 0
    pii_metadata = []

    pii_snippets = [
        ("NAME", faker.name),
        ("EMAIL", faker.email),
        ("LOCATION", faker.city),
        ("PHONE", faker.phone_number),
        ("URL", lambda: f"https://{faker.domain_name()}"),
    ]

    if pii_mode == "none":
        full_text = '. '.join(base_sentences).strip()
        return full_text, flag, pii_mode, pii_metadata

    pii_type, pii_func = random.choice(pii_snippets)
    pii_value = pii_func()

    if pii_mode == "standalone":
        insert_idx = random.randint(0, len(base_sentences))
        base_sentences.insert(insert_idx, pii_value)

    elif pii_mode == "raw":
        idx = random.randint(0, len(base_sentences) - 1)
        base_sentences[idx] += f". {pii_value}"

    elif pii_mode == "embedded":
        embedded_templates = [
            ("NAME", f"my name is {faker.name()}"),
            ("EMAIL", f"you can email me at {faker.email()}"),
            ("LOCATION", f"I live in {faker.city()}"),
            ("PHONE", f"my number is {faker.phone_number()}"),
            ("URL", f"visit my site at https://{faker.domain_name()}"),
        ]
        pii_type, pii_text = random.choice(embedded_templates)
        idx = random.randint(0, len(base_sentences) - 1)
        if random.random() < 0.5:
            base_sentences[idx] = pii_text + '. ' + base_sentences[idx]
        else:
            base_sentences[idx] += '. ' + pii_text
        pii_value = pii_text  # overwrite for consistency

    # Join final text and record location of the PII span
    full_text = '. '.join(base_sentences).strip()

    if pii_value in full_text:
        start = full_text.index(pii_value)
        end = start + len(pii_value)
        pii_metadata.append({
            "type": pii_type,
            "start": start,
            "end": end,
            "value": pii_value
        })

    return full_text, flag, pii_mode, pii_metadata


Almost all of the above was created using ChatGPT - it did this in like 5 minutes of discussion! This is great though. Let's go over what we currently have:

Dataset(ord_ig, id, name, type): I haven't 'created' this yet, because we basically only have 1 entry to it (which is what we will be using to train our model). 
 - ```Dataset(faker_id: (haven't made one), faker_blog_posts_id : a5c85cbf-c960-431f-b2e5-d4ba1a9601b9, faker_blog_posts, text)```


Data(dataset_id, id, value, flag):
- ```Data(a5c85cbf-c960-431f-b2e5-d4ba1a9601b9,bbb0d0ec-ec58-4f68-8047-692a59edfd44, "Value idea trade left. Bit practice already billion call degree. Tax professor mission stock because. Ahead each fish onto.", 0, none)```

So we have an example of our Data and our Dataset. Let us make some observations about our data. There are a few.

---------- 

1. the naming of ```org_id, id, dataset_id and id``` (again?!!!) is confusing to me. Hence, I recommend changing the id naming convention to maintain readability, to this: 
- ```Dataset(org_id, id, name, type)``` -> ```Dataset(org_id, dataset_id, name, type)```
- ```Data(dataset_id, id, value, flag)``` -> ```Data(dataset_id, data_id, value, flag)```

2. We have the value, which is basically the data values (text, images, animation etc). We also have a flag which supposedly tells us whether it contains PII? We should actually include WHERE in the content the PII is, WHAT the PII is (what has been detected) and HOW we have dealt with it (have we removed? anonymised? encrypted?)

3. We currently have a flag. But where is that flag from? Is it from the operations team who have flagged this? or is it from our model? we should keep track of which model gave it whcih flag so we can continuously check against the best models.

In [24]:
# Generate 1000 blog posts
dataset_id = str(uuid4())
data_rows = []

for mode in ["embedded", "raw", "standalone", "none"]:
    for _ in range(250):
        text, flag, pii_mode, pii_spans = generate_blogpost(pii_mode=mode)
        data_rows.append({
            'dataset_id': dataset_id,
            'id': str(uuid4()),
            'value': text,
            'flag': flag,
            'pii_mode': pii_mode,
            'pii_spans': json.dumps(pii_spans)
        })

# Shuffle the dataset
random.shuffle(data_rows)

# Convert to DataFrame
df = pd.DataFrame(data_rows)
df.head()

df.to_csv("Data_with_spans.csv", index=False)
df.to_json("Data_with_spans.json", orient="records", lines=True)



In [27]:
import pandas as pd
from tqdm import tqdm
import time
from presidio_analyzer import AnalyzerEngine

df = pd.read_csv("Data_with_spans.csv")

sample_entities = ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "LOCATION", "URL"]

analyser = AnalyzerEngine()

start = time.time()
found_flags = []

for _, row in tqdm(df.iterrows(), total=len(df), desc="Detecting PII with Presidio"):
    text = row["value"]
    results = analyser.analyze(text=text, entities=sample_entities, language="en")
    found_flags.append(int(len(results) > 0))

df["found_flag"] = found_flags
df.to_csv("Data_with_spans_and_found_flag.csv", index=False)

print(f"Time taken for 1000 paragraphs: {time.time() - start:.2f} seconds")


Detecting PII with Presidio: 100%|██████████| 1000/1000 [00:19<00:00, 50.21it/s]

Time taken for 1000 paragraphs: 19.94 seconds





In [29]:
from sklearn.metrics import classification_report, confusion_matrix

df = pd.read_csv("Data_with_spans_and_found_flag.csv")

print("\n📊 Overall Classification Report:")
print(classification_report(df["flag"], df["found_flag"], target_names=["no_pii", "has_pii"]))

print("\n🔄 Confusion Matrix:")
print(confusion_matrix(df["flag"], df["found_flag"]))



📊 Overall Classification Report:
              precision    recall  f1-score   support

      no_pii       0.80      0.96      0.87       250
     has_pii       0.99      0.92      0.95       750

    accuracy                           0.93      1000
   macro avg       0.89      0.94      0.91      1000
weighted avg       0.94      0.93      0.93      1000


🔄 Confusion Matrix:
[[240  10]
 [ 60 690]]


In [28]:
import pandas as pd
from sklearn.metrics import accuracy_score

df = pd.read_csv("Data_with_spans_and_found_flag.csv")

print("\n📁 Accuracy by PII Mode:\n")

for mode in sorted(df["pii_mode"].unique()):
    subset = df[df["pii_mode"] == mode]
    y_true = subset["flag"]
    y_pred = subset["found_flag"]

    acc = accuracy_score(y_true, y_pred)
    print(f"{mode.title():<12}: Accuracy = {acc:.3f} ({len(subset)} samples)")



📁 Accuracy by PII Mode:

Embedded    : Accuracy = 0.944 (250 samples)
None        : Accuracy = 0.960 (250 samples)
Raw         : Accuracy = 0.920 (250 samples)
Standalone  : Accuracy = 0.896 (250 samples)


Okay, so lets think about what we have - a _baseline_.

This, in my opinion, is a really useful thing to have. It is all well and good fine-tuning a huge transformer model for PII detection (which we will proceed to do), but we should understand _why_ we are doing it and what benefits it will give this process.

Our Presidio-only model (which actually uses a transformer sneakily in the back) achieves an accuracy of 92%. This is okay - not amazing. 
When inspecing further, we can see a difference between the accuracies of the four different types of PII injection we used to create the dummy dataset. Our sample size is too small to make too many conclusions, apart from we can maybe say that the model is better at classifying `None` and `Embedded` relative to `Raw` and `Standalone`. 

I posit that the reason for this is that for both `None` and `Embedded`, these are more 


FINISH LATER
