1. Process data

    1. Read documents
    2. Chunk documents
    3. Store and index documents

In [1]:
import os
import json
import openai
import pandas as pd
from dotenv import load_dotenv
from langchain.document_loaders import Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from demo_utils import demo_utils
from openai.embeddings_utils import cosine_similarity
from IPython.display import Markdown

utils = demo_utils()
collection = utils.getorcreatecollection()

if collection is not None and collection.count() > 0:
    utils.deletecollection()
    collection = utils.getorcreatecollection()
    print("Collection count: ", collection.count())
else:
    print("Collection count: ", collection.count())


Collection count:  0


1.1. Read document

`CHUNK_SIZE` can be a nob to tune the test results

In [2]:
CHUNK_SIZE = 500
CHUNK_OVERLAP = CHUNK_SIZE*0.4

# get chromadb collnection
def docx2txt(filepath, extract_page: bool=False, target_path: str="./chunks-txt", collection=None):
    if extract_page:
        loader = Docx2txtLoader(filepath)
        data = loader.load()
        print(data)

        prefixfilename  = os.path.splitext(os.path.basename(filepath))[0]

        for i in range(0, len(data)):
            with open(f"./{target_path}/{prefixfilename}_{str(i).zfill(2)}.txt", "w") as f:
                f.write(data[i].page_content)
    else:
        loader = Docx2txtLoader(filepath)
        data = loader.load()

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = CHUNK_SIZE,
            chunk_overlap  = CHUNK_OVERLAP,
            length_function = len,
        )

        # text_splitter = CharacterTextSplitter(separator="\n\n", 
        #                                       chunk_size = CHUNK_SIZE, 
        #                                       chunk_overlap  = CHUNK_OVERLAP, 
        #                                       length_function = len,)

        texts = text_splitter.split_documents(data)

        i = 0 
        prefixfilename  = os.path.splitext(os.path.basename(filepath))[0]        
        for t in texts:
            # save text as txt file
            target_file_name = f"./{target_path}/{prefixfilename}_{str(i).zfill(2)}"
            with open(f"./{target_file_name}.txt", "w") as f:
                f.write(t.page_content)
            if collection is not None:
                # save text to chromadb
                collection.add(
                    documents = [t.page_content],
                    ids = [target_file_name]
                    )
            i += 1


1.2. Chunk document 

Set source and target path to process documents

In [5]:
source_path = "doc=docx"
source_files = []

for subfolder in os.listdir(source_path):
    for file in os.listdir(f"{source_path}/{subfolder}"):
        # if file extension is docx
        if file.endswith(".docx"):
            source_file_path = f"{source_path}/{subfolder}/{file}"
            version = subfolder.split("version=")[1]
            source_files.append({"source_file_path": source_file_path, "version": version})

print(source_files)

# create a folder named contents=chunks/doc=txt
target_path = "contents=chunks/doc=txt"
if not os.path.exists(f"./{target_path}"):
    os.makedirs(f"./{target_path}")

for item in source_files:
    # delete all files in the folder in the target_path 
    os.makedirs(f"./{target_path}/version={item['version']}", exist_ok=True)
    for file in os.listdir(f"./{target_path}/version={item['version']}"):
        # remove files recursively        
        os.remove(f"./{target_path}/version={item['version']}/{file}")

    if collection is not None:
        docx2txt(item['source_file_path'], extract_page=False, target_path=f"./{target_path}/version={item['version']}", collection=collection)
        print("Collection documents count", collection.count())
    else:
        docx2txt(item['source_file_path'], extract_page=False, target_path=f"./{target_path}/version={item['version']}")


Insert of existing embedding ID: ././contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_00
Add of existing embedding ID: ././contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_00


[{'source_file_path': 'doc=docx/version=1/Data Protection in Relational Databases v1.docx', 'version': '1'}, {'source_file_path': 'doc=docx/version=2/Data Protection in Relational Databases v2.docx', 'version': '2'}]


Insert of existing embedding ID: ././contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_01
Add of existing embedding ID: ././contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_01
Insert of existing embedding ID: ././contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_02
Add of existing embedding ID: ././contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_02
Insert of existing embedding ID: ././contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_03
Add of existing embedding ID: ././contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_03
Insert of existing embedding ID: ././contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_04
Add of existing embedding ID: ././contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_04
Insert of existing embedding ID: ././contents=chunks/doc=txt/version=1/Data 

Collection documents count 20
Collection documents count 41


In [6]:
# if collection is not None:
#     # print("collection is not None")
#     # save collection to chromadb
#     # collection.get(ids=["Data Protection in Relational Databases v1_0"])
#     print(collection.query(
#         query_texts=["encryption"],
#         include=["metadatas", "documents", "distances"]
#     ))

Check versions

In [7]:
# get a list of versions from source_files 
versions = []
for item in source_files:
    versions.append(item['version'])
print(versions)

['1', '2']


Get the files for both version 1 and version 2

In [8]:
# read txt from both versions
# get the list of chuck files
version1_files = []
version2_files = []

for version in versions:
    for file in os.listdir(f"./{target_path}/version={version}"):
        if os.path.isfile(f"./{target_path}/version={version}/{file}"):
            if version == "1":
                version1_files.append(f"./{target_path}/version={version}/{file}")
            elif version == "2":
                version2_files.append(f"./{target_path}/version={version}/{file}")

# print the number of files in each version
print("Version 1 has", len(version1_files), "files")
print("Version 2 has", len(version2_files), "files")

Version 1 has 20 files
Version 2 has 21 files


2. Analysis

Use GPT4 to find differences between two documents

![WordDocument](../../images/WordDocuments_ReviewSampleImg.png)

In [9]:
# system_message = """
# You are a compliance reviewer in a security team. You are responsible for reviewing the compliance policies and providing insights. 

# ## Review process
# There are two versions of the documents with cosine similarity. 
# If the cosine similarity is less than 0.9, then the reviewer needs to review the differences.
# Version 1 has base line. 
# Version 2 has updated document. 

# ## Review guideline
# Make a markdown table to show the difference that is found during the process.
# The table includes Line, Versions, and the exact differences in bold.
# If there is no difference, then do not return a table and say "no difference is found"
# If two documents are completely different then return "Not able to compare"
# For example, if version 2 has completely different topic or topics than version 1, then return "Not able to compare"

# ## Response Example
# [Use a table to summarize the answers]
# |Item Number|Line|Version 1|Version 2|  
# |-|-|-|-|  
# |1|5|Access control can be implemented at different levels of the database, such as the schema, table, column, row, or cell level.|Access control can be implemented at different levels of the database, such as the schema, table, column, row, or cell level. **The view and the stored procedures also need to have controlled access.**|  
# """

Define Meta Prompt and Prompt

In [10]:
system_message = """
You are a compliance reviewer in a security team. You are responsible for reviewing the compliance policies and providing insights. 

## Review process
There are two versions of the documents with cosine similarity. 
If the cosine similarity is less than 0.9, then the reviewer needs to review the differences.
Version 1 is baseline. 
Version 2 is new document. 
The two documents are similar and have overlapping sentences.
Find overlapping sentences between the two documents. 
And compare two document from the overlapping sentences.

## Review guideline
### 1. Rule
Response in json format that wraped with '```'
In the 'note' field, you can add additional information about the difference. added, removed, or modified.

### 2. When the two versions are identical
Use 'No difference is found' in the note field. 
Do not retuen 'N/A' in fields, version1 and version2.

### 3. When the two versions are off the topic or completely different
Use "Not able to compare" in the note field.

### 4. When the two versions are similar but not identical
Return the difference in the field, version1 and version2.

## Examples
### Example When the two versions are similar but not identical
The cosine similarity between the two documents is 0.993.

---version 1 --- version1 file name Databases v1_0.txt
Access control can be implemented at different levels of the database, such as the schema, table, column, row, or cell level.
------

---version 2 --- version2 file name Databases v2_0.txt
Access control can be implemented at different levels of the database, such as the schema, table, column, row, or cell level. The view and the stored procedures also need to have controlled access.
------

Your Answer:
```
{
    "similarity": 0.993,
    "version1_file_name":"Databases v1_0.txt",
    "version2_file_name":"Databases v2_0.txt",
    "version1":"**A** B C D E",
    "version2":"B **F G** E",
    "note":"Rmoved and replaced"
}
```
### Example When the two versions are identical
The cosine similarity between the two documents is 0.992.

---version 1 --- version1 file name Databases v1_0.txt
Access control can be implemented at different levels of the database.
------

---version 2 --- version2 file name Databases v2_0.txt
Access control can be implemented at different levels of the database.
------

Your Answer:
```
{
    "similarity": 0.992,
    "version1_file_name":"Databases v1_0.txt",
    "version2_file_name":"Databases v2_0.txt",
    "version1":"N/A",
    "version2":"N/A",
    "note":"No difference"
}
```
### Example When the two versions are off the topic or completely different
The cosine similarity between the two documents is 0.990.

---version 1 --- version1 file name Databases v1_0.txt
Access control can be implemented at different levels of the database.
------

---version 2 --- version2 file name Databases v2_0.txt
Americano is made with espresso and hot water.
------

Your Answer:
```
{
    "similarity": 0.990,
    "version1_file_name":"Databases v1_0.txt",
    "version2_file_name":"Databases v2_0.txt",
    "version1":"N/A",
    "version2":"N/A",
    "note":"Not able to compare"
}
```
"""

In [11]:
user_prompt = """
The cosine similarity between the two documents is {{similarity}}.

---version 1 --- version1 file name {{version1_file_name}}
{{version1}}
------

---version 2 --- version2 file name {{version2_file_name}}
{{version2}}
------

Your Answer:
"""

In [12]:
for i in range(0, len(version1_files)):
    print(version1_files[i], version2_files[i])

./contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_00.txt ./contents=chunks/doc=txt/version=2/Data Protection in Relational Databases v2_00.txt
./contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_01.txt ./contents=chunks/doc=txt/version=2/Data Protection in Relational Databases v2_01.txt
./contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_02.txt ./contents=chunks/doc=txt/version=2/Data Protection in Relational Databases v2_02.txt
./contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_03.txt ./contents=chunks/doc=txt/version=2/Data Protection in Relational Databases v2_03.txt
./contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_04.txt ./contents=chunks/doc=txt/version=2/Data Protection in Relational Databases v2_04.txt
./contents=chunks/doc=txt/version=1/Data Protection in Relational Databases v1_05.txt ./contents=chunks/doc=txt/version=2/Data Protection in

In [13]:
# utils.similarity(version1_files[0], version2_files[0])
analysis_results = []

for i in range(0, len(version1_files)):
    # get the similarity score
    similarity = utils.similarity(version1_files[i], version2_files[i])
    # get the text
    with open(version1_files[i], 'r') as f:
        version1 = f.read()
    with open(version2_files[i], 'r') as f:
        version2 = f.read()
    # generate the prompt
    updated_user_prompt = user_prompt.replace("{{similarity}}", str(similarity)).\
        replace("{{version1}}", version1).\
            replace("{{version2}}", version2).\
                replace("{{version1_file_name}}", os.path.basename(version1_files[i])).\
                    replace("{{version2_file_name}}", os.path.basename(version2_files[i]))
    # generate the response
    system_msg = {"role":"system","content":system_message}
    user_msg = {"role":"user","content":updated_user_prompt}
    prompt = [system_msg, user_msg]
    response = utils.run(prompt, temperature=0.0, max_tokens=1000, top_p=0.0)
    # analysis_results.append(response)
    analysis_results.append(json.loads(response.split("```")[1]))

In [14]:
results = pd.DataFrame(analysis_results)
results

Unnamed: 0,similarity,version1_file_name,version2_file_name,version1,version2,note
0,0.994271,Data Protection in Relational Databases v1_00.txt,Data Protection in Relational Databases v2_00.txt,,,No difference is found
1,0.996183,Data Protection in Relational Databases v1_01.txt,Data Protection in Relational Databases v2_01.txt,,,No difference is found
2,0.995027,Data Protection in Relational Databases v1_02.txt,Data Protection in Relational Databases v2_02.txt,,,No difference is found
3,0.99545,Data Protection in Relational Databases v1_03.txt,Data Protection in Relational Databases v2_03.txt,that users or roles should only have the minim...,that users or roles should only have the minim...,Added information about controlling access to ...
4,0.994719,Data Protection in Relational Databases v1_04.txt,Data Protection in Relational Databases v2_04.txt,,,No difference is found
5,0.994354,Data Protection in Relational Databases v1_05.txt,Data Protection in Relational Databases v2_05.txt,,,No difference is found
6,0.995011,Data Protection in Relational Databases v1_06.txt,Data Protection in Relational Databases v2_06.txt,,,No difference is found
7,0.994964,Data Protection in Relational Databases v1_07.txt,Data Protection in Relational Databases v2_07.txt,,,No difference is found
8,0.99548,Data Protection in Relational Databases v1_08.txt,Data Protection in Relational Databases v2_08.txt,,Personal Identifiable Information (PII) must b...,Added
9,0.995237,Data Protection in Relational Databases v1_09.txt,Data Protection in Relational Databases v2_09.txt,,We take snapshots of the database every 15 min...,Added


In [15]:
pd.set_option("display.max_colwidth", None) # default is 50
display(results[["version1","version2"]])
pd.set_option("display.max_colwidth", 50) # default is 50

Unnamed: 0,version1,version2
0,,
1,,
2,,
3,"that users or roles should only have the minimum level of access required to perform their tasks. Access control can be implemented at different levels of the database, such as the schema, table, column, row, or cell level.","that users or roles should only have the minimum level of access required to perform their tasks. Access control can be implemented at different levels of the database, such as the schema, table, column, row, or cell level. **The view and the stored procedures also need to have controlled access.**"
4,,
5,,
6,,
7,,
8,,Personal Identifiable Information (PII) must be deleted in 7 days when a user requests opt-out from our service.
9,,"We take snapshots of the database every 15 minutes, and incremental backups of the database are taken daily. A full backup of the database is performed once a month."


In [16]:
system_message_insights="""
You are a compliance reviewer in a security team. You are responsible for reviewing the compliance policies and providing insights.

## Review process
### 1. Find differences
Identify differences. Show what is addedd, removed or modified as a Markdown table.

### 2. Extract differences
If there are changes regarding specification, business or technical requirements, then extract them as key phrases.
Extract key phrases of changes and rephrase them as clear terms to describe the changes.

### 3. Provide insights
Provide additional expalination around the changes
What should be done to complice with the changes? 
What feature should be on or off


## Response
Follow the 'Review process' and reorganize into categories 'Access control', 'Entryption' and 'Audit', summarize using bullet points.
Use Markdown to format the response.

## Access control
### Differences
* Version1:
 - Focuses on recording access, modification, deletion of data, and execution of queries or commands.
 - Does not specify the duration for which logs should be kept.
* Version2:
 - Includes authentication and authorization in the list of activities to be logged.
 - Specifies that logs must be kept for 24 months.

### Key Phrases
* keyword

### Insights
* Actionable items 
 - Step by step guide to apply the changes or rules

 
## Entryption
### Differences
* Version1:
* Version2:

### Key Phrases
* keyword

### Insights
* Actionable items 
 - Step by step guide to apply the changes or rules


## Audit
### Differences
* Version1:
* Version2:

### Key Phrases
* keyword

### Insights
* Actionable items 
 - Step by step guide to apply the changes or rules
"""

In [17]:
user_prompt_insights="""
--- version 1 ---
{{version1}}
------

--- version 2 ---
{{version2}}
------
"""

In [18]:
diff_ver1 = ""
diff_ver2 = ""
for item in results.iterrows():
    diff_ver1 += item[1]['version1'] + "\n"
    diff_ver2 += item[1]['version2'] + "\n"

system_msg_insights = {"role":"system","content":system_message_insights}
user_msg_insights = {"role":"user","content":user_prompt_insights.replace("{{version1}}", diff_ver1).replace("{{version2}}", diff_ver2)}
prompt_insights = [system_msg_insights, user_msg_insights]
res_insights = utils.run(prompt_insights, temperature=0.0, max_tokens=2500, top_p=0.0)
display(Markdown(res_insights))

## Access control
### Differences
| Version1 | Version2 |
| --- | --- |
| N/A | The view and the stored procedures also need to have controlled access. |

### Key Phrases
* Controlled access to views and stored procedures

### Insights
* Actionable items 
 - Implement access control on views and stored procedures in addition to the existing levels (schema, table, column, row, or cell level).

## Encryption
### Differences
| Version1 | Version2 |
| --- | --- |
| Use algorithms such as AES, RSA, or ECC | Use algorithms such as AES256, RSA, or ECC |
| N/A | PII must be encrypted. |

### Key Phrases
* Use AES256 for encryption
* Encrypt PII

### Insights
* Actionable items 
 - Upgrade encryption algorithm to AES256.
 - Ensure all PII data is encrypted.

## Audit
### Differences
| Version1 | Version2 |
| --- | --- |
| the access, modification, or deletion of the data | the authentication, authorization, access, modification, or deletion of the data |
| N/A | The log must be kept for 24 months and make sure the logs are stored in an immutable storage to prevent alteration. |
| Use a scalable and efficient audit system to manage and analyze the audit logs. | Use an efficient audit system to manage and analyze the audit logs. |
| Use a proactive and reactive audit strategy to review and act on the audit logs. | N/A |

### Key Phrases
* Log authentication and authorization activities
* Keep logs for 24 months
* Store logs in immutable storage

### Insights
* Actionable items 
 - Extend logging to include authentication and authorization activities.
 - Retain logs for a minimum of 24 months.
 - Store logs in an immutable storage to prevent alteration.

In [19]:
# for item in results.iterrows():    
#     system_msg_insights = {"role":"system","content":system_message_insights}
#     user_msg_insights = {"role":"user","content":user_prompt_insights.replace("{{version1}}", item[1]['version1']).replace("{{version2}}", item[1]['version2'])}
#     prompt_insights = [system_msg_insights, user_msg_insights]
#     res_insights = utils.run(prompt_insights, temperature=0.0, max_tokens=2500, top_p=0.0)
#     display(Markdown(res_insights))

Automate

Based on the update, generate new guidance

In [23]:
system_message_automate="""
You are a compliance policy maker in a security team.

## Review insights
### 1. Indentify goal of changes
Identify differences. Show what is addedd, removed or modified as a Markdown table.

### 2. Use the insights to share updated rules
Provide additional expalination around the changes
What should be done to complice with the changes? 
What feature should be on or off

## Response
Generate updated guidance based on the insights
Quoute and provide cite from the 'Insights' and generate a new policy. 
Use Markdown to format the response.

## Summary of the changes
[Provide insightful summary of the changes]

## Access Control
### Summary
[Summary of the changes in Access Control]
### How to apply
[Provide step by step guide how to apply new, update or modified guides]
### How to collect evidence
[Describe how to collect evidence]

## Encryption
### Summary
[Summary of the changes in Encryption]
### How to apply
[Provide step by step guide how to apply new, update or modified guides]
### How to collect evidence
[Describe how to collect evidence]

## Audit
### Summary
[Summary of the changes in Audit]
### How to apply
[Provide step by step guide how to apply new, update or modified guides]
### How to collect evidence
[Describe how to collect evidence]

"""

In [24]:
user_prompt_automate="""
--- Insights ---
{{results}}
------
"""

In [25]:
system_msg_automate = {"role":"system","content":system_message_automate}
user_msg_automate = {"role":"user","content":user_prompt_automate.replace("{{results}}", res_insights)}
prompt_automate = [system_msg_automate, user_msg_automate]
res_automate = utils.run(prompt_automate, temperature=1.0, max_tokens=2500, top_p=1.0)
display(Markdown(res_automate))

## Summary of the changes
The changes in the security policy addressed three main areas: Access control, encryption, and audit.

## Access Control
### Summary
The main update in the Access control policy is to extend control to views and stored procedures. 

### How to apply
To comply with the new policy, controlled access should be implemented on views and stored procedures. This can be done through the database management system or database service that is being used. Access control should be updated to include these levels and appropriately managed and monitored.

### How to collect evidence
Evidence of compliance can be demonstrated through database configurations and access control lists. Screen shots, system output, or other data showing that views and stored procedures are included in the access control policies will suffice.

## Encryption
### Summary
The updates in the Encryption policy are two-fold: use of AES256 algorithm for encryption and encryption of all PII data.

### How to apply
Upgrade your encryption methodologies to make use of AES256. This could mean updating the database encryption settings or application level encryption services. Additionally, ensure all Personally Identifiable Information (PII) is encrypted.

### How to collect evidence
Evidence can be collected by documenting the encryption mechanisms and settings being used by your applications and databases. Demonstrate that AES256 is being used and that PII data is encrypted.

## Audit
### Summary
The audit policy now requires logging of authentication and authorization activities, retention of logs for a minimum of 24 months, and storing logs in an immutable storage.

### How to apply
Modify your logging process to ensure it logs authentication and authorization activities. Implement a log retention policy to keep logs for at least 24 months. Store these logs in a secure, tamper-proof storage, such as write-once-read-many (WORM) devices or digital storage services such as Amazon S3 with object lock enabled.

### How to collect evidence
Evidence should include configuration files, settings and policies that indicate authentication and authorization activities are being logged, logs are kept for at least 24 months, and logs are stored in a tamper-proof method.