# Challenge - GDPR Compliant

![](http://eleanorglanvillecentre.lincoln.ac.uk/assets/images/content/_large/adalovelacehero.jpg)

In the `ada_lovelace.txt` file, located in the same folder, contains some informations about Ada Lovelace. This problem is that this file is full of identifying informations about people, and as such, is really not GDPR-compliant 😱 (info : the [General Data Protection Regulation](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) is a regulation in EU law on data protection and privacy)

## Guidelines
The objective of this exercice is to write a function that will clean up a file, by remplacing all mentions of people's names by "\[REDACTED\]", in order to comply with European law.

In [6]:
!pip install spacy --user
!python -m spacy download en_core_web_md

Looking in links: /usr/share/pip-wheels
Collecting spacy
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/b2/ce/d732ab63e2431cfc02f4282e773a33cf1341546b67cd8d65abdb692b5567/spacy-3.7.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading spacy-3.7.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Obtaining dependency information for spacy-legacy<3.1.0,>=3.0.11 from https://files.pythonhosted.org/packages/c3/55/12e842c70ff8828e34e543a2c7176dac4da006ca6901c9e8b43efab8bc6b/spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Obtaining dependency information for spacy-loggers<2.0.0,>=1.0.0 from https://files.pythonhosted.org/packages/33/78/d1a1a026ef3af911159398c939b1509d5c36fe524c7b644f34a5146c4e16/spacy_loggers-1.0.5

In [90]:
# TODO : Imports
import nltk
import numpy as np
import pandas as pd
import spacy

In [92]:
# TODO : load file and have a look at it
with open('ada lovelace.txt', 'r') as file:
    file_contents=file.read()
print(file_contents)

Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 – 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation, and published the first algorithm intended to be carried out by such a machine. As a result, she is sometimes regarded as the first to recognise the full potential of a "computing machine" and one of the first computer programmers. 

Lovelace became close friends with her tutor Mary Somerville, who introduced her to Charles Babbage in 1833. She had a strong respect and affection for Somerville, and they corresponded for many years. Other acquaintances included the scientists Andrew Crosse, Sir David Brewster, Charles Wheatstone, Michael Faraday and the author Charles Dickens.


**Q1.** Using the SpaCy NER tools, identify the **entities** in this document, and their relating tags.

In [94]:
# TODO : Named Entities Recognition
nlp=spacy.load('en_core_web_md')
doc=nlp(file_contents)
for ent in doc.ents:
    print(ent.text, ent.label_)

Augusta Ada King PERSON
Lovelace PERSON
née Byron PERSON
10 December 1815 DATE
27 CARDINAL
November 1852 DATE
English NORP
Charles Babbage's PERSON
the Analytical Engine ORG
first ORDINAL
first ORDINAL
first ORDINAL
one CARDINAL
first ORDINAL
Lovelace PERSON
Mary Somerville PERSON
Charles Babbage PERSON
1833 DATE
Somerville GPE
many years DATE
Andrew Crosse PERSON
David Brewster PERSON
Charles Wheatstone PERSON
Michael Faraday PERSON
Charles Dickens PERSON


**Q2.** Display the identified entities in a more visual manner.

In [95]:
# TODO : NER visualization
spacy.displacy.render(doc, style="ent",jupyter=True)

**Q3.** Write a function `replace_name_by_redacted`that will modify the document in order to replace all occurences of names by "\[REDACTED\]", and apply it to the file.

In [97]:
# TODO : `replace_name_by_redacted`
def replace_name_by_redacted(token):
    if token.ent_type_=="PERSON":
        return "[REDACTED]"
    else:
        return token.text_with_ws

Q4. Write a function make_doc_GDPR_compliant that will modify the document in order to replace all occurencies of names by "[REDACTED]", and apply it to the file.

In [108]:
def make_doc_GDPR_compliant(doc):
    redacted_text=""
    for token in doc:
        redacted_text+=replace_name_by_redacted(token)
    with open('ada lovelace.txt', 'w') as file:
        file.write(redacted_text)
    return redacted_text

make_doc_GDPR_compliant(doc)

'[REDACTED][REDACTED][REDACTED], Countess of [REDACTED]([REDACTED][REDACTED]; 10 December 1815 – 27 November 1852) was an English mathematician and writer, chiefly known for her work on [REDACTED][REDACTED][REDACTED]proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation, and published the first algorithm intended to be carried out by such a machine. As a result, she is sometimes regarded as the first to recognise the full potential of a "computing machine" and one of the first computer programmers. \n\n[REDACTED]became close friends with her tutor [REDACTED][REDACTED], who introduced her to [REDACTED][REDACTED]in 1833. She had a strong respect and affection for Somerville, and they corresponded for many years. Other acquaintances included the scientists [REDACTED][REDACTED], Sir [REDACTED][REDACTED], [REDACTED][REDACTED], [REDACTED][REDACTED]and the author [REDACTED][REDACTED].'