## Sampling PDPC Decision summaries using spaCy

The PDPC provides a summary for every decision published. 
While the source decision is very useful in terms of detail and depth, 
interesting and important information is available in the summaries with respect to each decision.

This notebook samples summaries from the zeeker database and runs them through the standard 
spaCy English model with displayCy. 

The objective is to allow a user to see relationships between words that can be used 
to create new matcher rules to obtain information from the summaries using natural language processing.

In [1]:
# We set up the environment

import pandas as pd
import spacy
from spacy import displacy

nlp = spacy.load('en_core_web_sm')


In [2]:
# Zeeker database access and setup

%run ../set_up_zeekerDB.ipynb

### The following variables are now available:
* `db_client` -- access the MongoDB Atlas ZeekerDB (Read only) 
* `data_collection` -- collection of documents containing meta information of a decision 
* `corpus_collection` -- collection of documents containing the text of a decision

In [9]:
# Construct our sampled summaries

samples = data_collection.aggregate(
   [ { "$sample": { "size": 7 } } ]
)

summaries_docs = [nlp(sample['summary']) for sample in samples]

summaries_docs

[Breach of the Protection Obligation by Ninja Logistics. Directions, including a financial penalty of $90,000, were imposed on Ninja Logistics for failing to put in place reasonable security arrangements to protect customers’ data in relation to the Tracking Function Page on the Ninja Logistics website.  This resulted in customers’ data on the website to be accessible by the public. ,
 Breach of the Protection Obligation by Creative. A financial penalty of $15,000 was imposed on Creative for failing to put in place reasonable security arrangements to prevent the unauthorised disclosure of personal data of users of its online support forum.,
 Breach of Consent Obligation by a Registered Salesperson. A financial penalty of $500 was imposed on a registered salesperson of a property firm for disclosing personal data of two of his landlord’s tenants to a third party tenant without consent.,
 Breach of Protection Obligation by DataPost. A financial penalty of $3,000 was imposed on DataPost, 

In [10]:
doc = summaries_docs[0]

print(doc.text)
doc_sentences = list(doc.sents)
displacy.render(doc_sentences, style='dep')
    

Breach of the Protection Obligation by Ninja Logistics. Directions, including a financial penalty of $90,000, were imposed on Ninja Logistics for failing to put in place reasonable security arrangements to protect customers’ data in relation to the Tracking Function Page on the Ninja Logistics website.  This resulted in customers’ data on the website to be accessible by the public. 


In [11]:
doc = summaries_docs[1]

print(doc.text)
doc_sentences = list(doc.sents)
displacy.render(doc_sentences, style='dep')

Breach of the Protection Obligation by Creative. A financial penalty of $15,000 was imposed on Creative for failing to put in place reasonable security arrangements to prevent the unauthorised disclosure of personal data of users of its online support forum.


In [12]:
doc = summaries_docs[2]

print(doc.text)
doc_sentences = list(doc.sents)
displacy.render(doc_sentences, style='dep')

Breach of Consent Obligation by a Registered Salesperson. A financial penalty of $500 was imposed on a registered salesperson of a property firm for disclosing personal data of two of his landlord’s tenants to a third party tenant without consent.


In [13]:
doc = summaries_docs[3]

print(doc.text)
doc_sentences = list(doc.sents)
displacy.render(doc_sentences, style='dep')



In [14]:
doc = summaries_docs[4]

print(doc.text)
doc_sentences = list(doc.sents)
displacy.render(doc_sentences, style='dep')

Breach of Protection Obligation by DataPost. A financial penalty of $3,000 was imposed on DataPost, as a data intermediary, for failing to make reasonable security arrangements to prevent the unauthorised disclosure of the personal data of two customers of a bank. DataPost was also directed to review its working procedures relating to data printing and enveloping operations, improve the training of its staff, and review its personal data protection policy.


In [15]:
doc = summaries_docs[5]

print(doc.text)
doc_sentences = list(doc.sents)
displacy.render(doc_sentences, style='dep')

Discontinued Investigations Against My Digital Lock. The investigation on alleged disclosure of personal data by My Digital Lock has been discontinued. An advisory notice has been issued to My Digital Lock. The reasons for discontinuation are explained in the grounds of decision.


In [16]:
doc = summaries_docs[6]

print(doc.text)
doc_sentences = list(doc.sents)
displacy.render(doc_sentences, style='dep')

No Breach of the Access Obligation by MCST 4436 (River Isles). MCST 4436 (River Isles) was not found to be in breach of the PDPA in relation to allowing subsidiary proprietors to view CCTV footage.


# Rules for detecting financial penalties, warnings and directions

In [17]:
# Financial Penalty Pattern

financial_penalty_pattern = [{'LOWER': 'financial'},
                              {'LOWER': 'penalty'},
                              {'POS': 'ADP'},
                              {'LOWER': '$'},
                              {'LIKE_NUM': True}]

def on_match_fp1(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        span = doc[end-1:end]
        print("Matched! ", string_id, span.text)


In [18]:
# Warning pattern

warning = [
    {'LOWER': 'warning'},
    {'POS': 'AUX'},
    {'LOWER': 'issued'}    
]

def on_match_warn(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        print("Matched! ", string_id)

In [19]:
# Directions pattern

directions1 = [
    {'LOWER': 'directions'},
    {'POS': 'AUX'},
    {'LOWER': 'issued'}    
]

def on_match_direction1(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        print("Matched! ", string_id)

In [20]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
matcher.add("Financial", on_match_fp1, financial_penalty_pattern)
matcher.add("Warning", on_match_warn, warning)
matcher.add("Directions #1", on_match_direction1, directions1)

In [21]:
doc = summaries_docs[0]

print("doc: ", doc.text)
matcher(doc);

doc:  Breach of the Protection Obligation by Ninja Logistics. Directions, including a financial penalty of $90,000, were imposed on Ninja Logistics for failing to put in place reasonable security arrangements to protect customers’ data in relation to the Tracking Function Page on the Ninja Logistics website.  This resulted in customers’ data on the website to be accessible by the public. 
Matched!  Financial 90,000


In [22]:
doc = summaries_docs[1]

print("doc: ", doc.text)
matcher(doc);

doc:  Breach of the Protection Obligation by Creative. A financial penalty of $15,000 was imposed on Creative for failing to put in place reasonable security arrangements to prevent the unauthorised disclosure of personal data of users of its online support forum.
Matched!  Financial 15,000


In [23]:
doc = summaries_docs[2]

print("doc: ", doc.text)
matcher(doc);

doc:  Breach of Consent Obligation by a Registered Salesperson. A financial penalty of $500 was imposed on a registered salesperson of a property firm for disclosing personal data of two of his landlord’s tenants to a third party tenant without consent.
Matched!  Financial 500


In [24]:
doc = summaries_docs[3]

print("doc: ", doc.text)
matcher(doc);



In [25]:
doc = summaries_docs[4]

print("doc: ", doc.text)
matcher(doc);

doc:  Breach of Protection Obligation by DataPost. A financial penalty of $3,000 was imposed on DataPost, as a data intermediary, for failing to make reasonable security arrangements to prevent the unauthorised disclosure of the personal data of two customers of a bank. DataPost was also directed to review its working procedures relating to data printing and enveloping operations, improve the training of its staff, and review its personal data protection policy.
Matched!  Financial 3,000


In [26]:
doc = summaries_docs[5]

print("doc: ", doc.text)
matcher(doc);

doc:  Discontinued Investigations Against My Digital Lock. The investigation on alleged disclosure of personal data by My Digital Lock has been discontinued. An advisory notice has been issued to My Digital Lock. The reasons for discontinuation are explained in the grounds of decision.


In [27]:
doc = summaries_docs[6]

print("doc: ", doc.text)
matcher(doc);

doc:  No Breach of the Access Obligation by MCST 4436 (River Isles). MCST 4436 (River Isles) was not found to be in breach of the PDPA in relation to allowing subsidiary proprietors to view CCTV footage.
