This file is created to provide an additional description of the system. In this notebook I'll discuss the system details and demonstrate it's capabilities asking 10 questions.

## System assumptions

The system is a simple chatbot capable of answering GDPR related questions based on `GDPR Art 1-21.pdf` file with CL interface. This version leverages OpenAI API for language model (`gpt-3.5-turbo-0125`) and FAISS for similarity search with OpenAI embeddings (`text-embedding-3-large`). You can replace these models easily with other OpenAI models by defining them as environment variables. 

Considering limited time constraint several simplifications of the system are made:
* PDF file data processing was relatively simple, there could be more steps for better text processing;
* There were not much of experimenting with different LLMs and embedding models, the choice of the models comes from the experience;
* Processed documents are stored in JSON format within the project, not in the external database to not add additional dependency to project;
* Chatbot stores chat history in-memory and not in the database. So, the chatbot remembers only messages from the current session. It can be easily changed by leveraging [LangChain integrations with DBs](https://python.langchain.com/v0.1/docs/integrations/memory/);
* There were no experimenting with different chunking approaches for RAG vectorization;
* There were no exhaustive quality testing of RAG. I used 10 questions to demonstrate system capabilities in this notebook.

## System design

There two main components of the system: RAG and chatbot. RAG is a FAISS index storing in-memory embeddings of documents. When the user asks question, the question is used as a prompt to retrieve few relevant (default 4) documents from the index. To create an index several text processing steps were made: text was divided to articles, some redundant and repeating elements were removed, finally each article was processed by chunking on max length of 1000 characters and 100 characters overlap. Afterwards obtained documents were vectorized using `text-embedding-3-large` embedder from OpenAI.

A chatbot is implemented using LangChain and has a pretty basic logic. It receives user's message, asks RAG for several relevant documents, checks whether the documents are relevant to a question. If they are, than responds the question based on information available, if not, than just responds "I don't know". Chatbot stores interaction history, so remembers all the historical messages within session.

When it comes to deployment, I used [poetry](https://python-poetry.org/) for venv management. All dependencies are specified in `pyproject.toml`. The service is containerized with Docker to make it's setup very easy.

## System capabilities with examples

Let's see how the chatbot answers questions and what information it gets from the RAG. For this we'll do some magic with a logger to see only log we need. And wee need to see what documents RAG picks to add to context for the response. We'll check not only GDPR related questions but the questions that are not related to the document to make sure the bot responds `"I don't know.` to such questions.

In [1]:
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.DEBUG)
logging.getLogger('httpx').setLevel(level=logging.WARNING)
logging.getLogger('langchain_text_splitters').setLevel(level=logging.ERROR)
logging.getLogger('openai').setLevel(level=logging.WARNING)
logging.getLogger('httpcore').setLevel(level=logging.WARNING)

from engine.chatbot import Chatbot

In [2]:
bot = Chatbot()

DEBUG:engine.retriever:Starting index setup.
DEBUG:engine.retriever:Trying to read raw documents file.
DEBUG:engine.retriever:Raw documents file was found.
DEBUG:engine.retriever:Converting documents to LangChain documents for convenient processing.
DEBUG:engine.retriever:Documents were converted to LangChain documents format for convenient processing.
DEBUG:engine.retriever:Splitting documents for vectorization.
DEBUG:engine.retriever:Documents were splitted.
DEBUG:engine.retriever:Creating FAISS index with OpenAI embeddings.
DEBUG:faiss.loader:Environment variable FAISS_OPT_LEVEL is not set, so let's pick the instruction set according to the current CPU
INFO:faiss.loader:Loading faiss.
INFO:faiss.loader:Successfully loaded faiss.
DEBUG:engine.retriever:FAISS index with OpenAI embeddings was created.
DEBUG:engine.retriever:Index setup was successful.


### Question 1

Let's start with a first question. Below we can see logs with 4 most relevant documents found by RAG.

In [3]:
question = "What is the principle of transparency stands for?"
response = bot.respond(question)

DEBUG:engine.chatbot:Found relevant document:
 page_content='data subject

Recitals

(58) The principle of transparency requires that any information addressed to the public or to the data subject be
concise, easily accessible and easy to understand, and that clear and plain language and, additionally, where
appropriate, visualisation be used. Such information could be provided in electronic form, for example, when
addressed to the public, through a website. This is of particular relevance in situations where the proliferation of
actors and the technological complexity of practice make it difficult for the data subject to know and understand
whether, by whom and for what purpose personal data relating to him or her are being collected, such as in the case
of online advertising. Given that children merit specific protection, any information and communication, where
processing is addressed to a child, should be in such a clear and plain language that the child can easily understand.' met

In [4]:
# Response
print(response)

The principle of transparency, as outlined in the GDPR, requires that any information addressed to the public or to the data subject be concise, easily accessible, and easy to understand. It emphasizes the use of clear and plain language, and where appropriate, visualization. This principle ensures that individuals are informed about the collection, use, and processing of their personal data, including the purposes of the processing and their rights in relation to such processing.


You can see, that the bot has the access to chat history.

In [6]:
print(bot.memory)

Human: What is the principle of transparency stands for?
AI: The principle of transparency, as outlined in the GDPR, requires that any information addressed to the public or to the data subject be concise, easily accessible, and easy to understand. It emphasizes the use of clear and plain language, and where appropriate, visualization. This principle ensures that individuals are informed about the collection, use, and processing of their personal data, including the purposes of the processing and their rights in relation to such processing.


### Question 2

Let's now ask question, that is not relevant to GDPR.

In [7]:
question = "How big is the moon?"
response = bot.respond(question)

DEBUG:engine.chatbot:Found relevant document:
 page_content='[5] Commission Recommendation of 6 May 2003 concerning the definition of micro, small and medium-sized
https://eur-lex.europa.eu/legal-
L
enterprises
content/EN/AUTO/?uri=OJ:L:2003:124:TOC

20.5.2003,

(C(2003)

1422)

124,

36).

(OJ

p.' metadata={'article_number': 1, 'article_summary': "Subject-matter and objectives: This article outlines the GDPR's purpose, which is to protect individuals' rights regarding their personal data and to regulate the processing of such data.\n"}
DEBUG:engine.chatbot:Found relevant document:
 page_content='personal data, and equivalent sanctions in all Member States as well as effective cooperation between the supervisory
authorities of different Member States. The proper functioning of the internal market requires that the free
movement of personal data within the Union is not restricted or prohibited for reasons connected with the protection
of natural persons with regard to the processing of

In [9]:
# Great, he doesn't know :)
print(response)

I don't know.


### Question 3

Let's try something more tricky, something that mentions GDPR.

In [12]:
question = "What is the main differences between GDPR and US Data Privacy Laws?"
response = bot.respond(question)

DEBUG:engine.chatbot:Found relevant document:
 page_content='Guidelines & Case Law

WP29, Update of Opinion on applicable law in light of the CJEU judgement in Google Spain (2010).

EDPB, Guidelines 3/2018 on the Territorial Scope of the GDPR (2019).

EDPB, Guidelines 05/2021 on the Interplay between the application of Article 3 and the provisions on international
transfers as per Chapter V of the GDPR (2021).

Case Law

CJEU, Pammer and Hotel Alpenhof GesmbH/Reederei Karl Schlüter GmbH & Co. KG and Heller, C-585/08 and
C-144/09 (2010).

CJEU, Google Spain SL/Agencia española de protección de datos, C-131/12 (2014).

CJEU, Verein für Konsumenteninformation/Amazon EU Sàrl, C-191/15 (2015).

CJEU, Weltimmo s.r.o./Nemzeti Adatvédelmi és Információszabadság Hatóság, C-230/14 (2015).

CJEU, Unabhängiges Landeszentrum für Datenschutz Schleswig-Holstein/Wirtschaftsakademie Schleswig-Holstein
GmbH, C-210/16 (2018).' metadata={'article_number': 3, 'article_summary': 'Territorial scope: Defines 

In [13]:
print(response)

I don't know.


### Question 4

In [10]:
question = "What are the additional GDPR requirements for financial institutions?"
response = bot.respond(question)

DEBUG:engine.chatbot:Found relevant document:
 page_content='parties or organisations, churches or parishes. Trade union membership may be revealed by the deduction of an annual

membership fee from a person’s bank account. Personal data concerning health may be gathered from analysing medical bills

paid by a data subject. Finally, information on certain purchases may reveal information concerning a person’s sex life or

sexual orientation.

Moreover, through the sum of financial transactions, different kinds of behavioural patterns could be revealed, including

special categories of personal data and additional services that are facilitated by account information services might rely on

profiling as defined by article 4 (4) of the GDPR. Therefore, the chances are considerable that a service provider processing

information on financial transactions of data subjects also processes special categories of personal data.

2. Paragraph 1 shall not apply if one of the following applies:

Re

In [11]:
print(response)

Financial institutions processing personal data are subject to the same GDPR requirements as other organizations. However, due to the nature of the data they handle, such as financial transactions and potentially special categories of personal data, they may have additional obligations to ensure the security and confidentiality of this sensitive information. Financial institutions must also comply with sector-specific regulations and guidelines, such as those related to payment services and the processing of financial data.


### Question 5

In [14]:
question = "What is a breach of personal data?"
response = bot.respond(question)

DEBUG:engine.chatbot:Found relevant document:
 page_content='– “Integrity breach” – where there is an unauthorised or accidental alteration of personal data.

– “Availability breach” – where there is an accidental or unauthorised loss of access to, or destruction of, personal data.

(13) ‘genetic data’ means personal data relating to the inherited or acquired genetic characteristics of a natural
person which give unique information about the physiology or the health of that natural person and which result, in
particular, from an analysis of a biological sample from the natural person in question;

(34) Genetic data should be defined as personal data relating to the inherited or acquired genetic characteristics of a natural person

Recitalswhich result from the analysis of a biological sample from the natural person in question,

in particular chromosomal,

deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) analysis, or from the analysis of another element enabling equivalent

inform

In [15]:
print(response)

A breach of personal data, as defined in the GDPR, refers to a breach of security leading to the accidental or unlawful destruction, loss, alteration, unauthorized disclosure of, or access to personal data. This can include breaches such as unauthorized access to personal data, accidental deletion of data, or any other incident that compromises the security and integrity of personal data.


### Question 6

In [16]:
question = "Let's say I want to do a research and I need to process personal data, how to get a consent from data subjects?"
response = bot.respond(question)

DEBUG:engine.chatbot:Found relevant document:
 page_content='Consent should cover all processing activities carried out for the same purpose or purposes. When the processing has multiple

purposes, consent should be given for all of them. If the data subject's consent is to be given following a request by electronic means,

the request must be clear, concise and not unnecessarily disruptive to the use of the service for which it is provided.

(33) It is often not possible to fully identify the purpose of personal data processing for scientific research purposes at the time of

data collection. Therefore, data subjects should be allowed to give their consent to certain areas of scientific research when in

keeping with recognised ethical standards for scientific research. Data subjects should have the opportunity to give their consent onlyto certain areas of research or parts of research projects to the extent allowed by the intended purpose.

Guidelines & Case Law

Document

EDPB, Guid

In [17]:
print(response)

When obtaining consent from data subjects for research purposes, it is important to ensure that the consent process complies with the GDPR requirements. The consent should cover all processing activities carried out for the research purposes and should be given for all of them. If the data subject's consent is to be given following a request by electronic means, the request must be clear, concise, and not unnecessarily disruptive to the use of the service for which it is provided.

In the context of scientific research, where it may not be possible to fully identify the purpose of personal data processing at the time of data collection, data subjects should be allowed to give their consent to certain areas of research in line with recognized ethical standards. Data subjects should have the opportunity to give their consent only to certain areas of research or parts of research projects to the extent allowed by the intended purpose.

It is essential to clearly explain to data subjects t

### Question 7

In [18]:
question = "Do I always need to remove data when data subject asks to remove his data?"
response = bot.respond(question)

DEBUG:engine.chatbot:Found relevant document:
 page_content='1. The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or
her without undue delay and the controller shall have the obligation to erase personal data without undue delay
where one of the following grounds applies:

(a) the personal data are no longer necessary in relation to the purposes for which they were collected or otherwise
processed;

(b) the data subject withdraws consent on which the processing is based according to point (a) of Article 6(1), or
point (a) of Article 9(2), and where there is no other legal ground for the processing;

(c) the data subject objects to the processing pursuant to Article 21(1) and there are no overriding legitimate
grounds for the processing, or the data subject objects to the processing pursuant to Article 21(2);

(d) the personal data have been unlawfully processed;' metadata={'article_number': 17, 'article_summary': "Right to e

In [19]:
print(response)

In certain circumstances, you are obligated to remove personal data when a data subject requests its erasure. According to the GDPR, the data subject has the right to obtain from the controller the erasure of personal data concerning them without undue delay in specific situations. These situations include when the personal data are no longer necessary for the purposes for which they were collected, when the data subject withdraws consent and there is no other legal ground for processing, when the data subject objects to the processing and there are no overriding legitimate grounds, or when the personal data have been unlawfully processed.

However, there are exceptions to this right to erasure. For example, if the processing of personal data is necessary for exercising the right of freedom of expression and information, for compliance with a legal obligation, for the performance of a task carried out in the public interest, for archiving purposes in the public interest, scientific or 

### Question 8

In [20]:
question = "When it comes to crimes, is the authority required to get a permission to process data from the convicted person?"
response = bot.respond(question)

DEBUG:engine.chatbot:Found relevant document:
 page_content='Processing of personal data relating to criminal convictions and offences or related security measures based on
Article 6(1) shall be carried out only under the control of official authority or when the processing is authorised by
Union or Member State law providing for appropriate safeguards for the rights and freedoms of data subjects. Any
comprehensive register of criminal convictions shall be kept only under the control of official authority.

General Data Protection Regulation (EU GDPR)

The latest consolidated version of the Regulation with corrections by Corrigendum, OJ L 127, 23.5.2018, p. 2
((EU) 2016/679). Source: EUR-lex.Related information Article 10. Processing of personal data relating

to criminal convictions and offences

Legislation

Guidelines & Case Law' metadata={'article_number': 10, 'article_summary': 'Processing of personal data relating to criminal convictions and offences: States that processing perso

In [21]:
print(response)

Processing of personal data relating to criminal convictions and offences or related security measures must be carried out under the control of official authority or when authorized by Union or Member State law providing appropriate safeguards for the rights and freedoms of data subjects. In the context of crimes, the authority processing such data does not necessarily need permission from the convicted person, as the processing is typically carried out for specific legal purposes related to the prevention, investigation, detection, or prosecution of criminal offenses. However, the processing must be in accordance with the relevant laws and regulations that provide safeguards for the rights of data subjects.


### Question 9

In [22]:
question = "I need to write an email to restrict the processing of my data. Give me an example or a template."
response = bot.respond(question)

DEBUG:engine.chatbot:Found relevant document:
 page_content='The duration of the restriction is also unclear. The European legal texts
mention that data must not be processed “for a period enabling the controller
to verify [their] accuracy”, for example, or “pending the verification” whether
exists a legitimate ground overriding a person’s objection to processing. The
time during which data cannot be processed will be a question of facts or of
individual situations. One thing is clear though: a person must be informed
before the restriction of processing is lifted.

Data Subject Request Letter Sample

Author

Expert commentary

Concern: Request to restrict the processing of my personal data

Dear Madam, Dear Sir,

I am entitled to ask you to restrict the processing of my personal data under
Article 18(1) of the General Data Protection Regulation (GDPR).' metadata={'article_number': 18, 'article_summary': 'Right to restriction of processing: Provides data subjects the right to restrict 

In [23]:
print(response)

Certainly! Here is an example template for an email requesting the restriction of processing of personal data:

Subject: Request to Restrict Processing of Personal Data

Dear [Recipient's Name],

I am writing to formally request the restriction of processing of my personal data in accordance with Article 18(1) of the General Data Protection Regulation (GDPR).

I hereby request that you cease processing my personal data until the accuracy of the data can be verified. Please refrain from any further processing of my personal data pending the verification process.

I trust that you will comply with this request promptly and inform me once the restriction of processing has been implemented. Please acknowledge receipt of this request at your earliest convenience.

Thank you for your attention to this matter.

Sincerely,
[Your Name]


### Question 10

In [24]:
question = "If we have an application for children and we process their data, is it possible from GDPR point of view?"
response = bot.respond(question)

DEBUG:engine.chatbot:Found relevant document:
 page_content='minimise the amount of personal data the controller has to process itself.

3. Paragraph 1 shall not affect the general contract law of Member States such as the rules on the validity,
formation or effect of a contract in relation to a child.

General Data Protection Regulation (EU GDPR)

The latest consolidated version of the Regulation with corrections by Corrigendum, OJ L 127, 23.5.2018, p. 2
((EU) 2016/679). Source: EUR-lex.Related information Article 8. Conditions applicable to child's

consent in relation to information society services

Recitals' metadata={'article_number': 8, 'article_summary': "Conditions applicable to child's consent in relation to information society services: Sets the age of consent for data processing related to information society services at 16, with the possibility for member states to lower it to no less than 13 years.\n"}
DEBUG:engine.chatbot:Found relevant document:
 page_content='1. Where 

In [25]:
print(response)

Under the GDPR, processing the personal data of children is subject to specific requirements and safeguards. When offering information society services directly to a child, the processing of their personal data is lawful if the child is at least 16 years old. If the child is below the age of 16, processing their data is lawful only if and to the extent that consent is given or authorized by the holder of parental responsibility over the child.

Therefore, if you have an application specifically targeted at children, you must ensure compliance with the GDPR requirements regarding obtaining valid consent for processing their personal data. This may involve obtaining parental consent for children below the age of 16 or any lower age set by Member States, as well as implementing appropriate measures to protect the privacy and rights of children when processing their data.
