# Challenge - GDPR Compliant

![](http://eleanorglanvillecentre.lincoln.ac.uk/assets/images/content/_large/adalovelacehero.jpg)

In the `ada_lovelace.txt` file, located in the same folder, contains some informations about Ada Lovelace. This problem is that this file is full of identifying informations about people, and as such, is really not GDPR-compliant 😱 (info : the [General Data Protection Regulation](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) is a regulation in EU law on data protection and privacy)

## Guidelines
The objective of this exercice is to write a function that will clean up a file, by remplacing all mentions of people's names by "\[REDACTED\]", in order to comply with European law.

In [2]:
# TODO : Imports
import spacy

!spacy download en_core_web_md
nlp = spacy.load('en_core_web_md')

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
     ---------------------------------------- 0.0/42.8 MB ? eta -:--:--
     ---------------------------------------- 0.2/42.8 MB 4.6 MB/s eta 0:00:10
     ---------------------------------------- 0.4/42.8 MB 5.2 MB/s eta 0:00:09
     ---------------------------------------- 0.5/42.8 MB 4.9 MB/s eta 0:00:09
     ---------------------------------------- 0.5/42.8 MB 4.9 MB/s eta 0:00:09
     ---------------------------------------- 0.5/42.8 MB 2.3 MB/s eta 0:00:19
      --------------------------------------- 0.8/42.8 MB 2.9 MB/s eta 0:00:15
     - -------------------------------------- 1.1/42.8 MB 3.6 MB/s eta 0:00:12
     - -------------------------------------- 1.4/42.8 MB 4.1 MB/s eta 0:00:10
     - -------------------------------------- 1.4/42.8 MB 4.1 MB/s eta 0:00:10
     - ----------------------------------

In [3]:
# TODO : load file and have a look at it
with open('ada lovelace.txt', 'r') as f:
    text = f.read()

print(text)

Augusta Ada King, Countess of Lovelace (nÃ©e Byron; 10 December 1815 â€“ 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation, and published the first algorithm intended to be carried out by such a machine. As a result, she is sometimes regarded as the first to recognise the full potential of a "computing machine" and one of the first computer programmers. 

Lovelace became close friends with her tutor Mary Somerville, who introduced her to Charles Babbage in 1833. She had a strong respect and affection for Somerville, and they corresponded for many years. Other acquaintances included the scientists Andrew Crosse, Sir David Brewster, Charles Wheatstone, Michael Faraday and the author Charles Dickens.


**Q1.** Using the SpaCy NER tools, identify the **entities** in this document, and their relating tags.

In [5]:
# TODO : Named Entities Recognition
txt_doc = nlp(text)

for entities in txt_doc.ents:
    print(entities.text, entities.label_)

Augusta Ada King PERSON
Lovelace PERSON
10 December 1815 DATE
27 November 1852 DATE
English NORP
Charles Babbage's PERSON
the Analytical Engine ORG
first ORDINAL
first ORDINAL
first ORDINAL
one CARDINAL
first ORDINAL
Lovelace PERSON
Mary Somerville PERSON
Charles Babbage PERSON
1833 DATE
Somerville GPE
many years DATE
Andrew Crosse PERSON
David Brewster PERSON
Charles Wheatstone PERSON
Michael Faraday PERSON
Charles Dickens PERSON


**Q2.** Display the identified entities in a more visual manner.

In [7]:
# TODO : NER visualization
from spacy import displacy

displacy.render(txt_doc, style='ent')

**Q3.** Write a function `replace_name_by_redacted`that will modify the document in order to replace all occurences of names by "\[REDACTED\]", and apply it to the file.

In [8]:
# TODO : `replace_name_by_redacted`
def replace_name_by_redacted(token):
    if token.ent_type_ == "PERSON":
        return "[REDACTED]"
    else:
        return token.text

Q4. Write a function make_doc_GDPR_compliant that will modify the document in order to replace all occurencies of names by "[REDACTED]", and apply it to the file.

In [10]:
def make_doc_GDPR_compliant(txt_doc):
    replaced_redact = [replace_name_by_redacted(tok) for tok in txt_doc]
    return " ".join(replaced_redact)

make_doc_GDPR_compliant(txt_doc)

'[REDACTED] [REDACTED] [REDACTED] , Countess of [REDACTED] ( nÃ © e Byron ; 10 December 1815 â€ “ 27 November 1852 ) was an English mathematician and writer , chiefly known for her work on [REDACTED] [REDACTED] [REDACTED] proposed mechanical general - purpose computer , the Analytical Engine . She was the first to recognise that the machine had applications beyond pure calculation , and published the first algorithm intended to be carried out by such a machine . As a result , she is sometimes regarded as the first to recognise the full potential of a " computing machine " and one of the first computer programmers . \n\n [REDACTED] became close friends with her tutor [REDACTED] [REDACTED] , who introduced her to [REDACTED] [REDACTED] in 1833 . She had a strong respect and affection for Somerville , and they corresponded for many years . Other acquaintances included the scientists [REDACTED] [REDACTED] , Sir [REDACTED] [REDACTED] , [REDACTED] [REDACTED] , [REDACTED] [REDACTED] and the au