<a href="https://colab.research.google.com/github/jamilu-as/itz-jameel/blob/master/2_Files_Vector_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Multi Document mapping

####Libraries and Imports

In [1]:
import pandas as pd
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import torch

In [2]:
# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


### Load the data

In [3]:
file1 = pd.read_excel('SCF.xlsx')
file2 = pd.read_excel('aeIAS.xlsx')

### Display the first few rows to verify successful loading

In [4]:
print("File1 Data Sample:")
display(file1.head())

print("\nFile2 Data Sample:")
display(file2.head())

File1 Data Sample:


Unnamed: 0,SCF Domain,Title,SCF #,Description,Evidence Request List (ERL) #,Possible Solutions & Considerations\nMicro-Small Business (<10 staff)\nBLS Firm Size Classes 1-2,Possible Solutions & Considerations\nSmall Business (10-49 staff)\nBLS Firm Size Classes 3-4,Possible Solutions & Considerations\nMedium Business (50-249 staff)\nBLS Firm Size Classes 5-6,Possible Solutions & Considerations\nLarge Business (250-999 staff)\nBLS Firm Size Classes 7-8,"Possible Solutions & Considerations\nEnterprise (> 1,000 staff)\nBLS Firm Size Class 9",...,Saudi Arabia\nSACS-002,Saudi Arabia\nECC-12018,Saudi Arabia\nOTCC-1 2022,UAE\nDIFC Data Protection Law,UAE \nIAS,UAE \nNESA Control Name,UAE \nADSIC\nv2.0,UAE\nCBUAE Regulations,MITRE\nATT&CK,OWASP\nTop 10
0,Cybersecurity & Data Protection Governance,Cybersecurity & Data Protection Governance Pro...,GOV-01,facilitate the implementation of cybersecurity...,E-GOV-01\nE-GOV-02,∙ ComplianceForge - Cybersecurity & Data Prote...,∙ ComplianceForge - Cybersecurity & Data Prote...,∙ Steering committee\n∙ ComplianceForge - Digi...,∙ Steering committee\n∙ ComplianceForge - Digi...,∙ Steering committee\n∙ ComplianceForge - Digi...,...,TPC-25,1-2-1\n1-3-2,1-1,Sec 15\nSec 16,M1.4.1,RESOURCES,SG.1.9,,,
1,Cybersecurity & Data Protection Governance,Steering Committee & Program Oversight,GOV-01.1,"coordinate cybersecurity, data protection and ...",E-GOV-03,∙ Third-party advisors (subject matter experts),∙ Third-party advisors (subject matter experts),∙ Steering committee / advisory board,∙ Steering committee / advisory board,∙ Steering committee / advisory board,...,,,,,,,,,,
2,Cybersecurity & Data Protection Governance,Status Reporting To Governing Body,GOV-01.2,provide governance oversight reporting and rec...,E-CPL-05\nE-CPL-09\nE-GOV-03\nE-GOV-04\nE-GOV-...,∙ Quarterly Business Review (QBR),∙ Quarterly Business Review (QBR),∙ Quarterly Business Review (QBR),∙ Quarterly Business Review (QBR),∙ Quarterly Business Review (QBR),...,,,,,,,,,,
3,Cybersecurity & Data Protection Governance,Publishing Cybersecurity & Data Protection Doc...,GOV-02,"establish, maintain and disseminate cybersecur...",E-GOV-08\nE-GOV-09\nE-GOV-11,∙ ComplianceForge - Cybersecurity & Data Prote...,∙ ComplianceForge - Cybersecurity & Data Prote...,∙ ComplianceForge - Digital Security Program (...,∙ ComplianceForge - Digital Security Program (...,∙ ComplianceForge - Digital Security Program (...,...,TPC-25,1-3-1\n1-3-3,1-1\n1-1-1,,T9.1.1,INFORMATION SYSTEMS CONTINUITY MANAGEMENT POLICY,IC.1,,,
4,Cybersecurity & Data Protection Governance,Exception Management,GOV-02.1,"prohibit exceptions to standards, except when ...",,∙ Manual exception management process\n∙ SCFCo...,∙ Manual exception management process\n∙ Gover...,∙ Manual exception management process\n∙ Gover...,"∙ Governance, Risk & Compliance (GRC) solution...","∙ Governance, Risk & Compliance (GRC) solution...",...,,,,,,,,,,



File2 Data Sample:


Unnamed: 0,ID,Title,Description
0,M1,strategy and planning,To provide an adequate organizational environm...
1,M1.1,entity context and leadership,To establish leadership and a management frame...
2,M1.1.1,understanding the entity and its context,determine external and internal factors that ...
3,M1.1.1.1,understanding the entity and its context,determine interested parties that are relevan...
4,M1.1.1.2,understanding the entity and its context,determine the requirements of these intereste...


### Initialize the model and tokenizer

In [5]:
model_name = "sentence-transformers/all-mpnet-base-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

# Function to generate embeddings for Title + Description

In [6]:
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

### Concatenate Title and Description for embedding generation

In [7]:
file1['text_combined'] = file1['Title'] + " " + file1['Description']
file2['text_combined'] = file2['Title'] + " " + file2['Description']

In [8]:
print("File1 Data Sample with Combined Text:")
display(file1['text_combined'].head())

print("\nFile2 Data Sample with Combined Text:")
display(file2['text_combined'].head())

File1 Data Sample with Combined Text:


Unnamed: 0,text_combined
0,Cybersecurity & Data Protection Governance Pro...
1,Steering Committee & Program Oversight coordin...
2,Status Reporting To Governing Body provide gov...
3,Publishing Cybersecurity & Data Protection Doc...
4,Exception Management prohibit exceptions to st...



File2 Data Sample with Combined Text:


Unnamed: 0,text_combined
0,strategy and planning To provide an adequate o...
1,entity context and leadership To establish lea...
2,understanding the entity and its context dete...
3,understanding the entity and its context dete...
4,understanding the entity and its context dete...


### Generate embeddings for the combined text for each control

In [9]:
file1['embedding'] = file1['text_combined'].apply(lambda x: get_embedding(x))
file2['embedding'] = file2['text_combined'].apply(lambda x: get_embedding(x))


### Check that embeddings have the correct shape


In [10]:
for idx, emb in enumerate(file1['embedding']):
    print(f"File1 embedding at row {idx}: Shape {emb.shape}")
for idx, emb in enumerate(file2['embedding']):
    print(f"File2 embedding at row {idx}: Shape {emb.shape}")

File1 embedding at row 0: Shape (768,)
File1 embedding at row 1: Shape (768,)
File1 embedding at row 2: Shape (768,)
File1 embedding at row 3: Shape (768,)
File1 embedding at row 4: Shape (768,)
File1 embedding at row 5: Shape (768,)
File1 embedding at row 6: Shape (768,)
File1 embedding at row 7: Shape (768,)
File1 embedding at row 8: Shape (768,)
File1 embedding at row 9: Shape (768,)
File1 embedding at row 10: Shape (768,)
File1 embedding at row 11: Shape (768,)
File1 embedding at row 12: Shape (768,)
File1 embedding at row 13: Shape (768,)
File1 embedding at row 14: Shape (768,)
File1 embedding at row 15: Shape (768,)
File1 embedding at row 16: Shape (768,)
File1 embedding at row 17: Shape (768,)
File1 embedding at row 18: Shape (768,)
File1 embedding at row 19: Shape (768,)
File1 embedding at row 20: Shape (768,)
File1 embedding at row 21: Shape (768,)
File1 embedding at row 22: Shape (768,)
File1 embedding at row 23: Shape (768,)
File1 embedding at row 24: Shape (768,)
File1 embe

### Initialize columns for top 3 matches in file1


In [11]:
for i in range(1, 4):  # Top 3 matches
    file1[f'ID (Iso) - Match {i}'] = ""
    file1[f'Title (Iso) - Match {i}'] = ""
    file1[f'Similarity Score(Iso) - Match {i}'] = 0.0

### Find the top 3 matches for each control in file1

In [None]:
for idx1, row1 in file1.iterrows():
    similarities = []

    for idx2, row2 in file2.iterrows():
        sim_score = cosine_similarity([row1['embedding']], [row2['embedding']])[0][0]

        # Store each match with its similarity score
        similarities.append((sim_score, row2['ID'], row2['Description']))

    # Sort to get the top 3 matches
    top_matches = sorted(similarities, key=lambda x: x[0], reverse=True)[:3]


## # Append the top 3 matches to file1's new columns

In [None]:
    for i, match in enumerate(top_matches):
        file1.at[idx1, f'ID (File 2) - Match {i + 1}'] = match[1]  # ID
        file1.at[idx1, f'Title (File 2) - Match {i + 1}'] = match[2]  # Title
        file1.at[idx1, f'Similarity Score - Match {i + 1}'] = match[0]  # Similarity score


### Drop intermediate columns and save the final output


In [None]:
file1.drop(columns=['text_combined', 'embedding'], inplace=True)
file1.to_excel('CSF-aeIASv6.xlsx', index=False)

In [None]:
output = pd.read_excel('CSF-aeIASv6.xlsx')
print("Output:")
display(output.head())