## Sampling PDPC Decision summaries using spaCy

The PDPC provides a summary for every decision published. 
While the source decision is very useful in terms of detail and depth, 
interesting and important information is available in the summaries with respect to each decision.

This notebook samples summaries from the zeeker database and runs them through the standard 
spaCy English model with displayCy. 

The objective is to allow a user to see relationships between words that can be used 
to create new matcher rules to obtain information from the summaries using natural language processing.

In [1]:
# We set up the environment

import pandas as pd
import spacy
from pymongo import MongoClient
from spacy import displacy

In [2]:
# Zeeker database access and setup

connection_string = "mongodb+srv://zeeker:{pw}@zeeker-a2-otlvr.gcp.mongodb.net/test?retryWrites=true&w" \
                        "=majority".format(pw='1qRJdWU5A1ro4X2w')
db_client = MongoClient(connection_string)
data_collection = db_client['pdpc-decisions']['decisions']
corpus_collection = db_client['pdpc-decisions']['corpus']

In [3]:
# Construct our sampled summaries

unique_ids = []
text = []
summaries = [result['summary'] for result in data_collection.find({}, {'summary': 1})]

#for summary in summaries:
#    text.append(summary['summary'])
#    unique_ids.append(summary['_id'])

summaries = pd.Series(summaries).sample(n=7)
print(summaries)

135    Breach of the Protection Obligation by ERGO In...
119    Breach of Protection Obligation by GMM Technow...
64     Breach of Notification and Consent Obligations...
46     Breach of the Accountability Obligation by Sat...
63     Breach of the Protection Obligation by Bud Cos...
123    Breach of the Protection and Accountability Ob...
100    No Breach of Consent Obligation by MyRepublic....
dtype: object


In [4]:
nlp = spacy.load('en_core_web_sm')

summaries_docs = [nlp(summary) for summary in summaries]

for doc in summaries_docs:
    print(doc.text)
    doc_sentences = list(doc.sents)
    displacy.render(doc_sentences, style='dep')
    



Breach of Protection Obligation by GMM Technoworld. A financial penalty of $3,000 was imposed on GMM Technoworld for failing to implement proper and adequate security measures on its official website, resulting in an unauthorised public disclosure of approximately 190 of its customers' personal data.


Breach of Notification and Consent Obligations by Aventis. A financial penalty of $12,500 was imposed on Aventis for using the personal data of individuals beyond the notified purposes, and for failure to give effect to the withdrawal of consent within a reasonable time.


Breach of the Accountability Obligation by Saturday Club. Saturday Club was found in breach of the PDPA for failing to put in place written policies and practices necessary to ensure its compliance with the PDPA. Saturday Club was directed to put in place a data protection policy to comply with the provisions of the PDPA and to conduct training to ensure its employees are aware of and comply with the requirements of the PDPA.


Breach of the Protection Obligation by Bud Cosmetics. A financial penalty of $11,000 was imposed on Bud Cosmetics for failing to put in place reasonable security arrangements to protect the personal data of its customers from unauthorised disclosure. Directions were also issued to the organisation to engage qualified personnel to conduct a security audit, develop an IT security policy, and implement a training policy.


Breach of the Protection and Accountability Obligations by Society of Tourist Guides. Directions, including a financial penalty of $20,000, were imposed on Society of Tourist Guides for breaches of the PDPA. First, the organisation failed to put in place reasonable measures to protect its members’ personal data. Second, it did not appoint a data protection officer. Lastly, it did not have written policies and practices necessary to ensure its compliance with the PDPA.


No Breach of Consent Obligation by MyRepublic. MyRepublic was found not in breach of the consent obligation with respect to the use of an individual’s personal data for debt recovery purposes.


# Rules for detecting financial penalties, warnings and directions

In [5]:
from spacy.matcher import Matcher



matcher = Matcher(nlp.vocab)

financial_penalty_pattern = [{'LOWER': 'financial'},
                              {'LOWER': 'penalty'},
                              {'POS': 'ADP'},
                              {'LOWER': '$'},
                              {'LIKE_NUM': True}]

def on_match_fp1(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        span = doc[end-1:end]
        print("Matched! ", string_id, span.text)

warning = [
    {'LOWER': 'warning'},
    {'POS': 'AUX'},
    {'LOWER': 'issued'}    
]

def on_match_warn(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        print("Matched! ", string_id)

directions1 = [
    {'LOWER': 'directions'},
    {'POS': 'AUX'},
    {'LOWER': 'issued'}    
]

def on_match_direction1(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        print("Matched! ", string_id)

matcher.add("Financial", on_match_fp1, financial_penalty_pattern)
matcher.add("Warning", on_match_warn, warning)
matcher.add("Directions #1", on_match_direction1, directions1)

for doc in summaries_docs:
    print("doc: ", doc.text)
    matcher(doc)
    print("End")

End
doc:  Breach of Protection Obligation by GMM Technoworld. A financial penalty of $3,000 was imposed on GMM Technoworld for failing to implement proper and adequate security measures on its official website, resulting in an unauthorised public disclosure of approximately 190 of its customers' personal data.
Matched!  Financial 3,000
End
doc:  Breach of Notification and Consent Obligations by Aventis. A financial penalty of $12,500 was imposed on Aventis for using the personal data of individuals beyond the notified purposes, and for failure to give effect to the withdrawal of consent within a reasonable time.
Matched!  Financial 12,500
End
doc:  Breach of the Accountability Obligation by Saturday Club. Saturday Club was found in breach of the PDPA for failing to put in place written policies and practices necessary to ensure its compliance with the PDPA. Saturday Club was directed to put in place a data protection policy to comply with the provisions of the PDPA and to conduct train