# Comparison of GCP's PII detection tools vs Presidio

In this notebook we evaluate the PII detection capabilities offered by GCP DLP against Presidio.
To avoid sending sensitive Mozilla data through GCP DLP, we run the evaluation using the public AOL search dataset (see [prepare-aol-data.ipynb](./prepare-aol-data.ipynb)) for preprocessing steps.

__[Presidio](https://microsoft.github.io/presidio/)__
- Open-source PII handling toolkit from Microsoft
- Offers detection of a decent range of PII entities using a combination of pattern matching and public entity recognition models (spaCy)
- Main shortcoming is that it doesn't offer detection of street addresses, which DLP does
- Running locally, it is quite slow (10K queries per minute for numeric PII). This may be related to the fact that the main `analyze` function does not run in batch mode, and must be called separately for each query string. We'll need to evaluate further whether it is fast enough for production.


__[GCP DLP](https://cloud.google.com/dlp/docs/how-to)__
- Closed-source offering in the GCP ecosystem
- Offers detection of a broad range of PII entities, including street addresses and a long list of global ID number formats
- To use, a request containing the query must be made to its REST API. There is a Python client library which wraps the issuance of requests and processing of responses.
- As the API backend can be updated at any time, results may not be exactly the same from one run to the next
- In our evaluations, it was about 10x faster than Presidio (~100K queries in <1 min). It can process text in batches subject to limitations on number of records and total data size.
- For a production deployment, we will need to investigate the policy/legal implications of sending our data through this additional API, as well as performance (eg. latency). There is also an additional [cost](https://cloud.google.com/dlp/pricing) for using the API (~$2-3/GB).
- It offers integration with BigQuery, so another option could be to use it as part of a separate sanitization pipeline for internal data, as opposed to live querying.

## Setup for GCP DLP

This documents the steps needed to set up access to DLP for testing (more details [here](https://cloud.google.com/dlp/docs/inspect-sensitive-text)). These have already been completed and do not need to be repeated.

1. Set up [prototype GCP project](https://docs.telemetry.mozilla.org/cookbooks/gcp-projects.html): `search-sanitization-dev`
2. Enable the DLP API in the console
3. Create service account & key: `suggest-test-dlp`
4. Activate service account with CLI: 
```bash
gcloud auth activate-service-account \
    suggest-test-dlp@search-sanitization-dev.iam.gserviceaccount.com \
    --key-file=<abs_path_to_key>
```

## Running this notebook

In order to run this notebook, first install the client libraries into the environment:
```bash
pip install -r https://raw.githubusercontent.com/googleapis/python-dlp/main/samples/snippets/requirements.txt
```

The GCP credentials must be specified in the environment.
One way to do this is on starting the Jupyter server:
```bash
GOOGLE_APPLICATION_CREDENTIALS=path/to/file jupyter notebook ...
```

In [1]:
# Sibling import
import sys
sys.path.insert(0, "../src")

In [2]:
import pandas as pd
from IPython.display import display

from suggest_search_tools import sanitization_lib

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.show_dimensions", True)

In [3]:
# Dataset of raw search queries
AOL_QUERIES_CSV = "../assets/aol_queries.csv.gz"
# DataFrame subset for numeric evaluation with inspection results
NUMERIC_INSPECTED_PKL = "./aol_numeric_inspected.pkl"
# Small manually labeled subset for numeric evaluation
NUMERIC_LABELED_PKL = "../assets/aol_numeric_labeled.pkl"

As a test set, we use queries pulled from the AOL search dataset.

In [4]:
aol_df = pd.read_csv(AOL_QUERIES_CSV)

In [5]:
print(f"Num queries: {len(aol_df):,}")

Num queries: 1,216,584


In [14]:
aol_df

Unnamed: 0,query,has_num
0,& bayshoredermatology group,False
1,& order,False
2,& secrets hidden in our childish li,False
3,& wg dj n x m iw un x1 vk e qe g w ' 1 g n k nm o o j w q. qb ; -,True
4,&c91904my yahoo mail,True
...,...,...
1216579,÷nenßOn()ßf÷,False
1216580,÷÷÷ (OG·O=f ±·n,False
1216581,ø â áí',False
1216582,ùèõáíö,False


## Numeric-type PII

In this section, we test out PII detection on numeric-type PII (eg. phone numbers, ID numbers, IP addresses, street addresses). We compare Presidio against GCP DLP across several of these PII types (US-only for now).

For evaluation purposes, we create a dataset containing:
- the subset of all AOL queries containing numerals
- a ~5% sample of the queries which don't contain numerals (the majority)

In [6]:
### Reload saved version (includes evaluation results). ###

# from sanitization_lib import EntityResult
# aol_num_eval = pd.read_pickle(NUMERIC_INSPECTED_PKL)

In [537]:
aol_num = aol_df.query("has_num")
# Oops, forgot to set a seed. Rerunning will create a different dataset.
aol_nonnum = aol_df.query("~has_num").sample(50000)

aol_num_eval = pd.concat([aol_num, aol_nonnum])

In [29]:
print(f"Num queries (numeric eval): {len(aol_num_eval):,}")

Num queries (numeric eval): 143,857


In [30]:
aol_num_eval["has_num"].value_counts()

True     93857
False    50000
Name: has_num, Length: 2, dtype: int64

### Run Presidio

We consider all global and US numeric-related entities.

Note that Presidio doesn't have address detection, but we enable location detection instead to see how well that performs.

In [60]:
PRESIDIO_ENTITIES_NUM = [
    "CREDIT_CARD",
    "CRYPTO",
    "IBAN_CODE",
    "IP_ADDRESS",
    "MEDICAL_LICENSE",
    "PHONE_NUMBER",
    "MEDICAL_LICENSE",
    "US_BANK_NUMBER",
    "US_DRIVER_LICENSE",
    "US_ITIN",
    "US_PASSPORT",
    "US_SSN",
    "LOCATION",
]

presidio = sanitization_lib.PresidioScanner(entities=PRESIDIO_ENTITIES_NUM)

In [539]:
%%time

# Timing is approximately linear, 10K queries per minute on local machine

aol_num_eval["presidio"] = presidio.scan_strings(aol_num_eval["query"])

CPU times: user 13min 59s, sys: 15.2 s, total: 14min 14s
Wall time: 14min 51s


How many queries had PII detected?

In [56]:
aol_num_eval["presidio"].notna().sum()

15616

### Run DLP

We enable corresponding numeric-related global and US entities.

We include street address as well as location detection for comparison against Presidio.

In [14]:
GCP_ENTITIES_NUM = [
    "CREDIT_CARD_NUMBER",
    "IBAN_CODE",
    "IP_ADDRESS",
    "LOCATION",
    "PASSPORT",
    "PHONE_NUMBER",
    "STREET_ADDRESS",
    "SWIFT_CODE",
    "VEHICLE_IDENTIFICATION_NUMBER",
    "US_DRIVERS_LICENSE_NUMBER",
    "US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER",
    "US_PASSPORT",
    "US_SOCIAL_SECURITY_NUMBER",   
]

dlp = sanitization_lib.DLPScanner(entities=GCP_ENTITIES_NUM)

In [543]:
%%time

aol_num_eval["dlp"] = dlp.scan_strings(aol_num_eval["query"])

CPU times: user 2.78 s, sys: 496 ms, total: 3.27 s
Wall time: 1min 25s


In [564]:
aol_num_eval["dlp"] = aol_num_eval["dlp"].mask(aol_num_eval["dlp"].isna(), None)

How many queries had PII detected?

In [566]:
aol_num_eval["dlp"].notna().sum()

23285

In [567]:
aol_num_eval[aol_num_eval["presidio"].notna() | aol_num_eval["dlp"].notna()].sample(30)

Unnamed: 0,query,has_num,presidio,dlp
522764,king kong cake topper,False,"[(LOCATION, king kong, 0.85)]",
91847,banks in new york 1993-1996,True,"[(LOCATION, new york, 0.85)]","[(LOCATION, new york, LIKELY)]"
1017576,wilson n6,True,,"[(STREET_ADDRESS, wilson n6, LIKELY)]"
160782,career fairs in miami,False,"[(LOCATION, miami, 0.85)]","[(LOCATION, miami, POSSIBLE)]"
553808,list of resetrunts in the wyndmoor pa. area,False,,"[(LOCATION, wyndmoor, LIKELY), (LOCATION, pa, LIKELY)]"
738536,pms.ccsd.k12.co.us,True,"[(LOCATION, pms.ccsd.k12.co.us, 0.85)]","[(LOCATION, co, POSSIBLE), (LOCATION, us, LIKELY)]"
451120,http rapidshare.de files 13753056 ned-calls preist.rar,True,"[(US_BANK_NUMBER, 13753056, 0.05), (US_DRIVER_LICENSE, 13753056, 0.01)]",
1059903,www.breckenridge home oweners.com,False,,"[(LOCATION, breckenridge, POSSIBLE)]"
709255,paso robles junior golf march 24,True,,"[(LOCATION, paso robles, LIKELY)]"
20162,86th street cinema btwn 2nd and 3rd,True,,"[(STREET_ADDRESS, 86th street, LIKELY)]"


Persist inspection results.

In [574]:
aol_num_eval.to_pickle(NUMERIC_INSPECTED_PKL)

### Locations

How well does the location detection work?

- By looking at some examples, we find it's not a great indicator of PII. It flags general locations, eg "cities in texas" and misses some actual addresses.

We will not use location detection further.

In [25]:
(
    aol_num_eval
    .assign(
        has_location=aol_num_eval["presidio"].map(lambda x: x is not None and "LOCATION" in [y.type for y in x]),
        has_address=aol_num_eval["dlp"].map(lambda x: x is not None and "STREET_ADDRESS" in [y.type for y in x]),
    )
    .query("has_location")
#     .query("has_address")
    .sample(30)
)

Unnamed: 0,query,has_num,presidio,dlp,has_location,has_address
174692,chad africa,False,"[(LOCATION, chad africa, 0.85)]","[(LOCATION, chad, POSSIBLE)]",True,False
850346,smithville pa,False,"[(LOCATION, smithville, 0.85), (LOCATION, pa, 0.85)]","[(LOCATION, smithville, POSSIBLE), (LOCATION, pa, LIKELY)]",True,False
247608,days inn in panama city,False,"[(LOCATION, panama city, 0.85)]","[(LOCATION, panama city, LIKELY)]",True,False
493734,jason gerhard and ohio,False,"[(LOCATION, ohio, 0.85)]","[(LOCATION, ohio, LIKELY)]",True,False
537955,las vegas twenty dollar double eagle 1907,True,"[(LOCATION, las vegas, 0.85)]","[(LOCATION, las vegas, LIKELY)]",True,False
427011,homes for rent in orlando,False,"[(LOCATION, orlando, 0.85)]","[(LOCATION, orlando, LIKELY)]",True,False
151695,california women steals police suv 3 3 2006,True,"[(LOCATION, california, 0.85), (PHONE_NUMBER, 3 3 2006, 0.4)]","[(LOCATION, california, LIKELY)]",True,False
1012049,who was the leader of canada during world war 2,True,"[(LOCATION, canada, 0.85)]","[(LOCATION, canada, POSSIBLE)]",True,False
921315,theatre in tucson arizona in april 2006,True,"[(LOCATION, tucson arizona, 0.85)]","[(LOCATION, tucson, LIKELY), (LOCATION, arizona, LIKELY)]",True,False
562111,los angeles museum of modern art,False,"[(LOCATION, los angeles, 0.85)]","[(LOCATION, los angeles, LIKELY), (LOCATION, museum of modern art, POSSIBLE)]",True,False


Drop location from detection results for both Presidio and GCP.

In [31]:
def drop_locations(r):
    if r is None:
        return None
    r = [x for x in r if x.type != "LOCATION"]
    if not r:
        return None
    return r

aol_num_eval["presidio"] = aol_num_eval["presidio"].map(drop_locations)
aol_num_eval["dlp"] = aol_num_eval["dlp"].map(drop_locations)

### Detection results comparison

How does detection match up for the two libraries?

In [54]:
def pivot_detection_counts(counts_df, index="presidio", columns="dlp"):
    return (
        counts_df
        .pivot_table(
            index=index,
            columns=columns,
            fill_value=0,
            aggfunc=sum,
            margins=True,
            margins_name="total"
        )
        .style.format("{:,}")
    )

In [34]:
pivot_detection_counts(
    aol_num_eval
    .groupby([
        aol_num_eval["presidio"].notna().astype(str),
        aol_num_eval["dlp"].notna().astype(str)
    ])
    .size()
    .reset_index(name="count")
)

Unnamed: 0_level_0,count,count,count
dlp,False,True,total
presidio,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
False,139226,2137,141363
True,1485,1009,2494
total,140711,3146,143857


Presidio doesn't include addresses. How do these results change if we exclude address detection?

In [35]:
def detected(r, exclude=[]):
    if not r:
        return "False"
    return str(bool(len([x for x in r if x.type not in exclude])))

pivot_detection_counts(
    aol_num_eval
    .groupby([
        aol_num_eval["presidio"].notna().astype(str),
        aol_num_eval["dlp"].map(lambda x: detected(x, exclude=["STREET_ADDRESS"])),
    ])
    .size()
    .reset_index(name="count")
)

Unnamed: 0_level_0,count,count,count
dlp,False,True,total
presidio,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
False,140871,492,141363
True,1516,978,2494
total,142387,1470,143857


What do the scores look like for detected entities?

In [17]:
def show_score_distributions(result_series):
    flat_results = result_series.explode().dropna()
    flat_results = pd.concat(
        [
            flat_results.map(lambda r: r.type).rename("entity"),
            flat_results.map(lambda r: r.score).rename("score"),
        ],
        axis="columns"
    )
    score_cts = flat_results.value_counts().sort_index().to_frame(name="count")
        
    
    display(
        score_cts
        .assign(prop=lambda x: x["count"].groupby("entity").transform(lambda s: s / s.sum()))
        .style.format({"prop_within_entity": "{:.1%}"})
    )

    display(
        score_cts
        .groupby("score")
        .agg({"count": "sum"})
        .assign(prop=lambda x: x["count"] / x["count"].sum())
        .style.format({"prop": "{:.1%}"})
    )

Presidio:

In [18]:
show_score_distributions(aol_num_eval["presidio"])

Unnamed: 0_level_0,Unnamed: 1_level_0,count,prop
entity,score,Unnamed: 2_level_1,Unnamed: 3_level_1
CREDIT_CARD,1.0,12,1.0
IP_ADDRESS,0.6,408,0.978417
IP_ADDRESS,0.95,9,0.021583
LOCATION,0.85,15206,1.0
MEDICAL_LICENSE,1.0,48,1.0
PHONE_NUMBER,0.4,1002,0.974708
PHONE_NUMBER,0.75,26,0.025292
US_BANK_NUMBER,0.05,838,0.989374
US_BANK_NUMBER,0.4,9,0.010626
US_DRIVER_LICENSE,0.01,1655,0.98865


Unnamed: 0_level_0,count,prop
score,Unnamed: 1_level_1,Unnamed: 2_level_1
0.01,1655,8.5%
0.05,1127,5.8%
0.3,3,0.0%
0.4,1040,5.3%
0.5,5,0.0%
0.6,408,2.1%
0.75,26,0.1%
0.85,15206,77.8%
0.95,9,0.0%
1.0,60,0.3%


DLP:

In [48]:
show_score_distributions(aol_num_eval["dlp"])

Unnamed: 0_level_0,Unnamed: 1_level_0,count,prop
entity,score,Unnamed: 2_level_1,Unnamed: 3_level_1
CREDIT_CARD_NUMBER,POSSIBLE,7,0.4375
CREDIT_CARD_NUMBER,UNLIKELY,7,0.4375
CREDIT_CARD_NUMBER,VERY_UNLIKELY,2,0.125
IP_ADDRESS,LIKELY,67,0.208075
IP_ADDRESS,POSSIBLE,254,0.78882
IP_ADDRESS,UNLIKELY,1,0.003106
PASSPORT,UNLIKELY,1,1.0
PHONE_NUMBER,LIKELY,10,0.009174
PHONE_NUMBER,POSSIBLE,416,0.381651
PHONE_NUMBER,UNLIKELY,622,0.570642


Unnamed: 0_level_0,count,prop
score,Unnamed: 1_level_1,Unnamed: 2_level_1
LIKELY,1829,55.3%
POSSIBLE,685,20.7%
UNLIKELY,748,22.6%
VERY_LIKELY,20,0.6%
VERY_UNLIKELY,24,0.7%


### Manual labeling

__Note: this section documents the manual labeling process.
It has already been performed and the results can be loaded from file.__

In order to evaluate correctness, we manually label a small subset of queries as PII/non-PII.

To ensure adequate coverage of the different PII categories, the subset is sampled with stratification over the DLP score levels as follows:

- 500 of LIKELY/VERY_LIKELY queries containing numerics
- 500 of POSSIBLE queries containing numerics
- 500 of UNLIKELY/VERY_UNLIKELY queries containing numerics
- 500 of non-flagged queries containing numerics
- 500 of queries not containing numerics (regardless of whether they were flagged)

When a query has multiple PII, the highest detected level is used.

Each query is labeled as one of
- is PII
- maybe PII
- not PII

regardless of PII type.
For the purposes of this analysis, only numeric-related PII types are considered. People's names are _not_ labeled as PII.

In [7]:
### Reload saved version (includes labels). ###

# num_aol_labeled = pd.read_pickle(NUMERIC_LABELED_PKL)

In [632]:
def get_dlp_max(r):
    if not isinstance(r, list):
        return None
    scores = [x.score for x in r]
    if "LIKELY" in scores or "VERY_LIKELY" in scores:
        return "high"
    if "POSSIBLE" in scores:
        return "med"
    return "low"
        

aol_num_eval["dlp_max_score"] = aol_num_eval["dlp"].map(get_dlp_max)

In [634]:
aol_num_eval["dlp_max_score"].value_counts(dropna=False)

None    140711
high      1814
med        673
low        659
Name: dlp_max_score, Length: 4, dtype: int64

In [635]:
num_aol_labeling = pd.concat([
    aol_num_eval.query("has_num & (dlp_max_score == 'high')").sample(500),
    aol_num_eval.query("has_num & (dlp_max_score == 'med')").sample(500),
    aol_num_eval.query("has_num & (dlp_max_score == 'low')").sample(500),
    aol_num_eval.query("has_num & dlp.isna()").sample(500),
    aol_num_eval.query("~has_num").sample(500),
])
num_aol_labeling = num_aol_labeling.sort_values("query")
num_aol_labeling["is_pii"] = None

In [637]:
num_aol_labeling[["has_num", "dlp_max_score"]].value_counts(dropna=False)

has_num  dlp_max_score
False    NaN              500
True     high             500
         low              500
         med              500
         NaN              500
Length: 5, dtype: int64

In [54]:
TMP_LABELING_CSV = "./aol_num_queries.csv"

In [388]:
num_aol_labeling.to_csv(TMP_LABELING_CSV, columns=["query", "is_pii"])

Queries are labelled manually using `t`(true - is PII)/`m` (maybe)/`<blank>` (false - not PII).


- CSV file is opened using the [Spreadsheet Editor](https://github.com/jupyterlab-contrib/jupyterlab-spreadsheet-editor) extension
- The labels are entered in the `is_pii` column.
- The CSV is saved in the editor and reloaded here.

In [650]:
num_aol_labeled = pd.read_csv(TMP_LABELING_CSV, index_col=0)

In [653]:
num_aol_labeled

Unnamed: 0,query,is_pii
335,----- forwarded by jackie sloan res clubcorp us on 02 27 2006 03 28 pm -----,
350,-----------------forwarded message subj fwd some thoughts 4 u date 4 12 2006 1 42 12 p.m. central daylight time from gdvtek to gdvtek -----------------forwarded message subj fwd some thoughts 4 u date 4 12 2006 1 41 30 p.m. central daylight,
547,...............0000.00000.,
679,.7,
680,.71.244.90.56.,t
...,...,...
1214022,zach winslow 12 31 2005,
1214026,zach winslow r.i.p. 12 31 2005,
1215292,zip code 619 o'hare street edinburg texas,t
1215479,zip codes in georgia,


In [654]:
assert num_aol_labeled["query"].isna().sum() == 0
assert len(num_aol_labeled) == len(num_aol_labeling)

Fill out full labels for PII.

In [655]:
PII_LABELS = {
    "t": "yes",
    "m": "maybe",
    np.nan: "no",
}

In [656]:
num_aol_labeled["is_pii"] = num_aol_labeled["is_pii"].map(PII_LABELS)

In [657]:
assert num_aol_labeled["is_pii"].isna().sum() == 0

In [9]:
num_aol_labeled

Unnamed: 0,query,is_pii
335,----- forwarded by jackie sloan res clubcorp us on 02 27 2006 03 28 pm -----,no
350,-----------------forwarded message subj fwd some thoughts 4 u date 4 12 2006 1 42 12 p.m. central daylight time from gdvtek to gdvtek -----------------forwarded message subj fwd some thoughts 4 u date 4 12 2006 1 41 30 p.m. central daylight,no
547,...............0000.00000.,no
679,.7,no
680,.71.244.90.56.,yes
...,...,...
188720,chinese diet tea,no
1023870,wood filing cabinets,no
395072,gria,no
164596,cartoon.network,no


Pull in the inspection results from the full DF.

In [661]:
num_aol_labeled = pd.merge(
    aol_num_eval, num_aol_labeled, left_index=True, right_index=True, suffixes=(None, "_")
)

In [662]:
assert len(num_aol_labeled.query("query != query_")) == 0

In [663]:
num_aol_labeled = num_aol_labeled.drop(columns="query_")

In [10]:
num_aol_labeled.head()

Unnamed: 0,query,has_num,presidio,dlp,dlp_max_score,is_pii
335,----- forwarded by jackie sloan res clubcorp us on 02 27 2006 03 28 pm -----,True,"[(PHONE_NUMBER, 02 27 2006 03 28, 0.4)]","[(PHONE_NUMBER, 02 27 2006 03 28, UNLIKELY)]",low,no
350,-----------------forwarded message subj fwd some thoughts 4 u date 4 12 2006 1 42 12 p.m. central daylight time from gdvtek to gdvtek -----------------forwarded message subj fwd some thoughts 4 u date 4 12 2006 1 41 30 p.m. central daylight,True,,"[(PHONE_NUMBER, 12 2006 1 42 12, UNLIKELY), (PHONE_NUMBER, 12 2006 1 41 30, UNLIKELY)]",low,no
547,...............0000.00000.,True,,"[(PHONE_NUMBER, 0000.00000, UNLIKELY)]",low,no
679,.7,True,,,,no
680,.71.244.90.56.,True,"[(IP_ADDRESS, 71.244.90.56, 0.6)]","[(IP_ADDRESS, 71.244.90.56, POSSIBLE)]",med,yes


Persist labeling results.

In [665]:
num_aol_labeled.to_pickle(NUMERIC_LABELED_PKL)

### Detection results accuracy

How do the two inspection libraries compare?

In [25]:
num_aol_labeled["is_pii"] = pd.Categorical(
    num_aol_labeled["is_pii"], categories=["yes", "maybe", "no"], ordered=True
)

def lab_gp(r):
    if r["dlp_max_score"]:
        return f"num_flagged_{r['dlp_max_score']}"
    if r["has_num"]:
        return "num_unflagged"
    return "nonnum"

num_aol_labeled["grouping"] = pd.Categorical(
    num_aol_labeled.apply(lab_gp, axis="columns"),
    categories=["num_flagged_high", "num_flagged_med", "num_flagged_low", "num_unflagged", "nonnum"],
    ordered=True
)

In [680]:
dlp_presidio_comp = (
    num_aol_labeled
    .groupby(["grouping", "is_pii"], observed=True)
    .agg(
        dlp_flagged=pd.NamedAgg("dlp", lambda s: s.notna().sum()),
        dlp_not_flagged=pd.NamedAgg("dlp", lambda s: s.isna().sum()),
        presidio_flagged=pd.NamedAgg("presidio", lambda s: s.notna().sum()), 
        presidio_not_flagged=pd.NamedAgg("presidio", lambda s: s.isna().sum()), 
    )
)

In [681]:
dlp_presidio_comp

Unnamed: 0_level_0,Unnamed: 1_level_0,dlp_flagged,dlp_not_flagged,presidio_flagged,presidio_not_flagged
grouping,is_pii,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
num_flagged_high,yes,276,0,40,236
num_flagged_high,maybe,13,0,1,12
num_flagged_high,no,211,0,4,207
num_flagged_med,yes,466,0,464,2
num_flagged_med,maybe,12,0,7,5
num_flagged_med,no,19,0,11,8
num_flagged_low,yes,24,0,20,4
num_flagged_low,maybe,125,0,64,61
num_flagged_low,no,354,0,88,266
num_unflagged,maybe,0,12,4,8


In [682]:
(
    num_aol_labeled["is_pii"]
    .value_counts()
    .sort_index()
    .to_frame()
    .assign(prop=lambda x: x["is_pii"] / x["is_pii"].sum())
    .style.format({"prop": "{:.1%}"})
)
    

Unnamed: 0,is_pii,prop
yes,766,30.6%
maybe,165,6.6%
no,1569,62.8%


In [683]:
dlp_presidio_comp.groupby(level=-1).sum()

Unnamed: 0_level_0,dlp_flagged,dlp_not_flagged,presidio_flagged,presidio_not_flagged
is_pii,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
yes,766,0,524,242
maybe,150,15,76,89
no,584,985,106,1463


In [684]:
dlp_presidio_comp.swaplevel().sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,dlp_flagged,dlp_not_flagged,presidio_flagged,presidio_not_flagged
is_pii,grouping,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
yes,num_flagged_high,276,0,40,236
yes,num_flagged_med,466,0,464,2
yes,num_flagged_low,24,0,20,4
maybe,num_flagged_high,13,0,1,12
maybe,num_flagged_med,12,0,7,5
maybe,num_flagged_low,125,0,64,61
maybe,num_unflagged,0,12,4,8
maybe,nonnum,0,3,0,3
no,num_flagged_high,211,0,4,207
no,num_flagged_med,19,0,11,8


#### Observations

__Numeric flagged high:__
- Yes PII: mostly street addresses (some with phone numbers) & IP addresses, 
    * DLP detects pretty much all
    * Presidio doesn't have street address detection and doesn't flag any address
        + generally flags phone numbers (\~0.4) and IP addresses (\~0.6)
        + some false positive flagging ZIP codes or phone numbers as SSN/DL, although with low confidence (0.01-0.05)
- Maybe PII: partial addresses (eg. street only, no city)
    * DLP flags all
    * Presidio none (except for low-confidence false positive)
- Not PII: generally location-related, some URL-like strings
    * eg. city/zip only, institution/city/zip, query/zip, or even general queries containing numbers
    * DLP flags all as LIKELY address
    * Presidio correctly ignores them except for a couple of low-confidence false positives


__Numeric flagged med:__
- Yes PII: IP addresses & phone numbers
    * DLP correctly flags all
    * Presidio generally flags all correctly
        + phone numbers (0.4) and IP addresses (0.6)
- Maybe PII: some long numbers without context: could be financial/ID or else product/record numbers
    * DLP generally flags them as credit card (prob correct) or phone number (prob wrong)
    * Presidio flags a few as phone numbers, some as other ID numbers, the rest unflagged
- Not PII: generally product ID numbers
    * DLP generally flags as phone number
    * Presidio ignores some and flags some as phone/credit card


__Numeric flagged low:__
- Yes PII: IP addresses & phone numbers (non-traditional format), some online order/booking numbers
    * DLP flags almost all as phone numbers (some incorrectly)
    * Presidio generally does a good job, flagging phone numbers and some 
- Maybe PII: various long numeric strings, eg tokens, ID numbers, phone numbers without area codes, 9-digit numbers (could be ZIP codes)
    * DLP flags as phone/credit card/ID
    * Presidio flags in various categories, many with low confidence
- Not PII: general queries which include numbers, eg. dates, model IDs
    * DLP generally flags most as phone number, some as SSN/ID number
    * Presidio flags some as phone number, most as ID numbers with low confidence


__Numeric unflagged:__
- Maybe PII: partial addresses (eg. street only, no city), URL tokens
    * Presidio flags some as low-confidence ID numbers
- Not PII: general queries which include numbers
    * Presidio flags a couple with low confidence


__Non-numeric:__
- Maybe PII: institution name + city/state (eg. hotel)
    * never flagged
- Not PII:
    * never flagged

In [27]:
(
    num_aol_labeled
    .query("grouping == 'num_flagged_high' and is_pii == 'yes'")[:10]
)

Unnamed: 0,query,has_num,presidio,dlp,dlp_max_score,is_pii,grouping
3251,000&1rc l1aaa&cl en&ct na&1si navt&rsres 1&1y us&1ffi &1l &1g &1pl &1v address&1n &1pn &1a 13391 el prado ave&1c garden grove&1s ca&1z 92840-6255&panelbtn 1&2y us&2ffi &2l &2g &2pl &2v &2n &2pn &2a &2c orange&2s ca&2z 92869,True,"[(US_SSN, 92840-6255, 0.05)]","[(PHONE_NUMBER, 92840-6255, UNLIKELY), (STREET_ADDRESS, 13391 el prado ave&1c garden grove&1s ca&1z 92840-6255&panelbtn, LIKELY)]",high,yes,num_flagged_high
3600,1 100 sanjuan ave.redlands ca 92374 92374,True,"[(PHONE_NUMBER, 92374 92374, 0.4)]","[(STREET_ADDRESS, 100 sanjuan ave.redlands ca 92374, LIKELY)]",high,yes,num_flagged_high
4365,1003 tenth street modesto california zip code,True,,"[(STREET_ADDRESS, 1003 tenth street modesto california, LIKELY)]",high,yes,num_flagged_high
4472,1015 yahoola road dahlonega ga,True,,"[(STREET_ADDRESS, 1015 yahoola road dahlonega ga, LIKELY)]",high,yes,num_flagged_high
4566,103 state street marlboro ma,True,,"[(STREET_ADDRESS, 103 state street marlboro ma, LIKELY)]",high,yes,num_flagged_high
4634,104 monroe lane egg harbor twp,True,,"[(STREET_ADDRESS, 104 monroe lane, LIKELY)]",high,yes,num_flagged_high
4689,1040 moore st tribeca ny,True,,"[(STREET_ADDRESS, 1040 moore st tribeca ny, LIKELY)]",high,yes,num_flagged_high
4697,1040 waverly ave holtsville ny 10742,True,,"[(STREET_ADDRESS, 1040 waverly ave holtsville ny 10742, LIKELY)]",high,yes,num_flagged_high
4698,1040 waverly ave hotsville ny 10742,True,,"[(STREET_ADDRESS, 1040 waverly ave hotsville ny 10742, LIKELY)]",high,yes,num_flagged_high
4872,107 briarwood terde soto mo 63020,True,,"[(STREET_ADDRESS, 107 briarwood terde soto mo 63020, LIKELY)]",high,yes,num_flagged_high


### PII types

In [40]:
def scan_results_contain(x, types):
    if not x:
        return False
    if isinstance(types, str):
        types = [types]
    for y in x:
        if y.type in types:
            return True
    return False

In [48]:
num_aol_labeled_types = num_aol_labeled.assign(
    address_dlp=num_aol_labeled["dlp"].map(lambda x: scan_results_contain(x, "STREET_ADDRESS")),
    ip_dlp=num_aol_labeled["dlp"].map(lambda x: scan_results_contain(x, "IP_ADDRESS")),
    phone_dlp=num_aol_labeled["dlp"].map(lambda x: scan_results_contain(x, "PHONE_NUMBER")),
    other_dlp=lambda x: ~x["address_dlp"] & ~x["ip_dlp"] & ~x["phone_dlp"] & x["dlp"].notna(),
    ip_presidio=num_aol_labeled["presidio"].map(lambda x: scan_results_contain(x, "IP_ADDRESS")),
    phone_presidio=num_aol_labeled["presidio"].map(lambda x: scan_results_contain(x, "PHONE_NUMBER")),
    other_presidio=lambda x: ~x["ip_presidio"] & ~x["phone_presidio"] & x["presidio"].notna(),
)

Addresses:

In [50]:
(
    num_aol_labeled_types
    .query("address_dlp")["is_pii"]
    .value_counts()
    .sort_index()
    .to_frame()
    .assign(prop=lambda x: x["is_pii"] / x["is_pii"].sum())
    .style.format({"prop": "{:.1%}"})
)
    

Unnamed: 0,is_pii,prop
yes,251,53.0%
maybe,13,2.7%
no,210,44.3%


IP addresses:

In [69]:
pivot_detection_counts(
(
    num_aol_labeled_types
    .groupby([
        num_aol_labeled_types["ip_dlp"].astype(str),
        num_aol_labeled_types["ip_presidio"].astype(str),
        "is_pii",
    ])
    .size()
    .reset_index(name="count")
)
    ,
    index=["ip_dlp", "ip_presidio"],
    columns="is_pii"
)
    

Unnamed: 0_level_0,Unnamed: 1_level_0,count,count,count,count
Unnamed: 0_level_1,is_pii,yes,maybe,no,total
ip_dlp,ip_presidio,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
False,False,556,165,1569,2290
False,True,0,0,0,0
True,False,0,0,0,0
True,True,210,0,0,210
total,,766,165,1569,2500


In [75]:
(
    num_aol_labeled_types
    .query("is_pii == 'yes'")
    .groupby(["address_dlp", "ip_dlp", "phone_dlp", "other_dlp", "ip_presidio", "phone_presidio", "other_presidio"])
    .size()
    .to_frame(name="count")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,count
address_dlp,ip_dlp,phone_dlp,other_dlp,ip_presidio,phone_presidio,other_presidio,Unnamed: 7_level_1
False,False,False,True,False,False,True,2
False,False,True,False,False,False,False,6
False,False,True,False,False,False,True,1
False,False,True,False,False,True,False,296
False,True,False,False,True,False,False,137
False,True,False,False,True,True,False,72
False,True,True,False,True,False,False,1
True,False,False,False,False,False,False,233
True,False,False,False,False,True,False,2
True,False,True,False,False,False,False,3


In [70]:
num_aol_labeled_types

Unnamed: 0,query,has_num,presidio,dlp,dlp_max_score,is_pii,grouping,address_dlp,ip_dlp,phone_dlp,other_dlp,ip_presidio,phone_presidio,other_presidio
335,----- forwarded by jackie sloan res clubcorp us on 02 27 2006 03 28 pm -----,True,"[(PHONE_NUMBER, 02 27 2006 03 28, 0.4)]","[(PHONE_NUMBER, 02 27 2006 03 28, UNLIKELY)]",low,no,num_flagged_low,False,False,True,False,False,True,False
350,-----------------forwarded message subj fwd some thoughts 4 u date 4 12 2006 1 42 12 p.m. central daylight time from gdvtek to gdvtek -----------------forwarded message subj fwd some thoughts 4 u date 4 12 2006 1 41 30 p.m. central daylight,True,,"[(PHONE_NUMBER, 12 2006 1 42 12, UNLIKELY), (PHONE_NUMBER, 12 2006 1 41 30, UNLIKELY)]",low,no,num_flagged_low,False,False,True,False,False,False,False
547,...............0000.00000.,True,,"[(PHONE_NUMBER, 0000.00000, UNLIKELY)]",low,no,num_flagged_low,False,False,True,False,False,False,False
679,.7,True,,,,no,num_unflagged,False,False,False,False,False,False,False
680,.71.244.90.56.,True,"[(IP_ADDRESS, 71.244.90.56, 0.6)]","[(IP_ADDRESS, 71.244.90.56, POSSIBLE)]",med,yes,num_flagged_med,False,True,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188720,chinese diet tea,False,,,,no,nonnum,False,False,False,False,False,False,False
1023870,wood filing cabinets,False,,,,no,nonnum,False,False,False,False,False,False,False
395072,gria,False,,,,no,nonnum,False,False,False,False,False,False,False
164596,cartoon.network,False,,,,no,nonnum,False,False,False,False,False,False,False
