## Sampling PDPC Decision summaries using spaCy

The PDPC provides a summary for every decision published. 
While the source decision is very useful in terms of detail and depth, 
interesting and important information is available in the summaries with respect to each decision.

This notebook samples summaries from the zeeker database and runs them through the standard 
spaCy English model with displayCy. 

The objective is to allow a user to see relationships between words that can be used 
to create new matcher rules to obtain information from the summaries using natural language processing.

In [1]:
# We set up the environment

import pandas as pd
import spacy
from pymongo import MongoClient
from spacy import displacy

In [2]:
# Zeeker database access and setup

connection_string = "mongodb+srv://zeeker:{pw}@zeeker-a2-otlvr.gcp.mongodb.net/test?retryWrites=true&w" \
                        "=majority".format(pw='1qRJdWU5A1ro4X2w')
db_client = MongoClient(connection_string)
data_collection = db_client['pdpc-decisions']['decisions']
corpus_collection = db_client['pdpc-decisions']['corpus']

In [3]:
# Construct our sampled summaries

unique_ids = []
text = []
summaries = [result['summary'] for result in data_collection.find({}, {'summary': 1})]

#for summary in summaries:
#    text.append(summary['summary'])
#    unique_ids.append(summary['_id'])

summaries = pd.Series(summaries).sample(n=7)
print(summaries)

74     Breach of Protection Obligation by Dimsum Prop...
85     Breach of Openness and Protection Obligations ...
2      Breach of the Protection Obligation by NTUC In...
96     Breach of Protection and Retention Obligations...
67     Breach of Protection Obligation by SLF Green M...
133    Breach of Protection Obligation by Challenger ...
8      Breach of the Protection Obligation by Creativ...
dtype: object


In [4]:
nlp = spacy.load('en_core_web_sm')

summaries_docs = [nlp(summary) for summary in summaries]

for doc in summaries_docs:
    print(doc.text)
    doc_sentences = list(doc.sents)
    displacy.render(doc_sentences, style='dep')
    



Breach of Openness and Protection Obligations by Habitat for Humanity Singapore. Directions were issued to Habitat for Humanity Singapore for breaches of the PDPA.  The organisation did not make reasonable security arrangements to prevent unauthorised disclosure of its volunteers’ personal data, failed to put in place data protection policies, and omitted to communicate data protection policies and practices to its staff.




Breach of Protection and Retention Obligations by Social Metric. A financial penalty of $18,000 and directions were issued to Social Metric for leaving the personal data exposed to the world wide web via unprotected URL links; and failure to remove personal data of its clients’ customers from its website when they no longer served a legal or business purpose.


Breach of Protection Obligation by SLF Green Maid Agency. Directions were issued to SLF Green Maid Agency for failing to make reasonable security arrangements to prevent the unauthorised disclosure of individuals’ personal data.




Breach of the Protection Obligation by Creative. A financial penalty of $15,000 was imposed on Creative for failing to put in place reasonable security arrangements to prevent the unauthorised disclosure of personal data of users of its online support forum.


# Rules for detecting financial penalties, warnings and directions

In [5]:
from spacy.matcher import Matcher



matcher = Matcher(nlp.vocab)

financial_penalty_pattern = [{'LOWER': 'financial'},
                              {'LOWER': 'penalty'},
                              {'POS': 'ADP'},
                              {'LOWER': '$'},
                              {'LIKE_NUM': True}]

def on_match_fp1(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        span = doc[end-1:end]
        print("Matched! ", string_id, span.text)

warning = [
    {'LOWER': 'warning'},
    {'POS': 'AUX'},
    {'LOWER': 'issued'}    
]

def on_match_warn(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        print("Matched! ", string_id)

directions1 = [
    {'LOWER': 'directions'},
    {'POS': 'AUX'},
    {'LOWER': 'issued'}    
]

def on_match_direction1(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        print("Matched! ", string_id)

matcher.add("Financial", on_match_fp1, financial_penalty_pattern)
matcher.add("Warning", on_match_warn, warning)
matcher.add("Directions #1", on_match_direction1, directions1)

for doc in summaries_docs:
    print("doc: ", doc.text)
    matcher(doc)
    print("End")

End
doc:  Breach of Openness and Protection Obligations by Habitat for Humanity Singapore. Directions were issued to Habitat for Humanity Singapore for breaches of the PDPA.  The organisation did not make reasonable security arrangements to prevent unauthorised disclosure of its volunteers’ personal data, failed to put in place data protection policies, and omitted to communicate data protection policies and practices to its staff.
Matched!  Directions #1
End
End
doc:  Breach of Protection and Retention Obligations by Social Metric. A financial penalty of $18,000 and directions were issued to Social Metric for leaving the personal data exposed to the world wide web via unprotected URL links; and failure to remove personal data of its clients’ customers from its website when they no longer served a legal or business purpose.
Matched!  Financial 18,000
Matched!  Directions #1 issued
Matched!  Financial
Matched!  Directions #1
End
doc:  Breach of Protection Obligation by SLF Green Maid Ag