## Sampling PDPC Decision summaries using spaCy

The PDPC provides a summary for every decision published. 
While the source decision is very useful in terms of detail and depth, 
interesting and important information is available in the summaries with respect to each decision.

This notebook samples summaries from the zeeker database and runs them through the standard 
spaCy English model with displayCy. 

The objective is to allow a user to see relationships between words that can be used 
to create new matcher rules to obtain information from the summaries using natural language processing.

In [42]:
# We set up the environment

import pandas as pd
import spacy
from pymongo import MongoClient
from spacy import displacy

In [43]:
# Zeeker database access and setup

connection_string = "mongodb+srv://zeeker:{pw}@zeeker-a2-otlvr.gcp.mongodb.net/test?retryWrites=true&w" \
                        "=majority".format(pw='1qRJdWU5A1ro4X2w')
db_client = MongoClient(connection_string)
data_collection = db_client['pdpc-decisions']['decisions']
corpus_collection = db_client['pdpc-decisions']['corpus']

In [44]:
# Construct our sampled summaries

unique_ids = []
text = []
summaries = [result['summary'] for result in data_collection.find({}, {'summary': 1})]

#for summary in summaries:
#    text.append(summary['summary'])
#    unique_ids.append(summary['_id'])

summaries = pd.Series(summaries).sample(n=7)
print(summaries)

135    Breach of the Protection Obligation by ERGO In...
74     Breach of Protection Obligation by Aviva. A fi...
104    Breach of Protection Obligation by Tech Mahind...
83     No Breach of Consent and Notification Obligati...
92     Breach of Protection Obligation by DataPost. A...
111    Breach of Openness and Protection Obligations ...
121    Breach of the Protection Obligation by L'Oreal...
dtype: object


In [45]:
nlp = spacy.load('en_core_web_sm')

summaries_docs = [nlp(summary) for summary in summaries]

for doc in summaries_docs:
    print(doc.text)
    doc_sentences = list(doc.sents)
    displacy.render(doc_sentences, style='dep')
    

Breach of Protection Obligation by Aviva. A financial penalty of $6,000 was imposed on Aviva for failing to make reasonable security arrangements to prevent the unauthorised disclosure of the personal data of its insurance policyholder and his dependent.
Breach of Protection Obligation by Tech Mahindra (Singapore). A financial penalty of $10,000 was imposed on Tech Mahindra (Singapore), as a data intermediary, for failing to make reasonable security arrangement to prevent unauthorised access and unauthorised modification of Singapore Telecommunications Limited customers' personal data via MyBill (mybill.singtel.com) and MyAccount (myaccount.singtel.com) webpages.
No Breach of Consent and Notification Obligations by Comfort Transportation and CityCab. Two complaints were made against Comfort Transportation and CityCab for disclosing taxi drivers’ mobile phone numbers as part of the taxi booking process. Both organisations were not found to be in breach of the consent or notification obl

# Rules for detecting financial penalties, warnings and directions

In [46]:
from spacy.matcher import Matcher



matcher = Matcher(nlp.vocab)

financial_penalty_pattern = [{'LOWER': 'financial'},
                              {'LOWER': 'penalty'},
                              {'POS': 'ADP'},
                              {'LOWER': '$'},
                              {'LIKE_NUM': True}]

def on_match_fp1(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        span = doc[end-1:end]
        print("Matched! ", string_id, span.text)

warning = [
    {'LOWER': 'warning'},
    {'POS': 'AUX'},
    {'LOWER': 'issued'}    
]

def on_match_warn(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        print("Matched! ", string_id)

directions1 = [
    {'LOWER': 'directions'},
    {'POS': 'AUX'},
    {'LOWER': 'issued'}    
]

def on_match_direction1(matcher, doc, id, matches):
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        print("Matched! ", string_id)

matcher.add("Financial", on_match_fp1, financial_penalty_pattern)
matcher.add("Warning", on_match_warn, warning)
matcher.add("Directions #1", on_match_direction1, directions1)

for doc in summaries_docs:
    print("doc: ", doc.text)
    matcher(doc)
    print("End")

End
doc:  Breach of Protection Obligation by Aviva. A financial penalty of $6,000 was imposed on Aviva for failing to make reasonable security arrangements to prevent the unauthorised disclosure of the personal data of its insurance policyholder and his dependent.
Matched!  Financial 6,000
End
doc:  Breach of Protection Obligation by Tech Mahindra (Singapore). A financial penalty of $10,000 was imposed on Tech Mahindra (Singapore), as a data intermediary, for failing to make reasonable security arrangement to prevent unauthorised access and unauthorised modification of Singapore Telecommunications Limited customers' personal data via MyBill (mybill.singtel.com) and MyAccount (myaccount.singtel.com) webpages.
Matched!  Financial 10,000
End
doc:  No Breach of Consent and Notification Obligations by Comfort Transportation and CityCab. Two complaints were made against Comfort Transportation and CityCab for disclosing taxi drivers’ mobile phone numbers as part of the taxi booking process. B