# 1. RAG with ITU's GCI report

**Code Reference:** https://colab.research.google.com/drive/1IVQkSGwS5kdTiKBwz85PO6vg_WaNx15c?usp=sharing

The GCI report is a PDF file containing a large amount of text and complex structures such as tables and graphs.

By utilising a tool framework named LlamaIndex to parsed and processed the report file, we are able to perform Retrieval-Augmented Generation (RAG). 

OpenAI's "text-embedding-ada-002" is used as the embedding moedel to convert text to numerical representations.  

Llama3 model is used as the LLM to respond to prompts and queries.

In [1]:
import nest_asyncio
import os
from dotenv import load_dotenv

load_dotenv()
nest_asyncio.apply()

os.environ["LLAMA_CLOUD_API_KEY"] = os.environ.get("LLAMA_CLOUD_API_KEY")
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY")

In [2]:
from llama_parse import LlamaParse

pillars = "legal measures, technical measures, organizational measures, capacity development measures, and cooperative measures"

def parse_data(pdf_filepaths):    
    
    instruction = f"""
    The provided document is a report on Global Cybersecurity Index 2020, which contained text, graphics and tables.  
    Section 3.1 has a few tables representing Global scores and Global ranking of countries.  Each row of these tables contains a country name and its global score and global rank.
    Section 3.2 has a few tables representing Overall scores and Regional ranking of countries.  Each row of these tables contains a country name and its overall score and regional rank.
    Section 4 contains country profiles. Each country profile contains a country name first, then followed by
    'Area(s) of Relative Strength', which states one or many out of: {pillars}, then followed by
    'Area(s) of Potential Growth', which states one or many out of {pillars}.
    Your job is to index and organize the information appropriately.
    """
    
    instruction = ""
    parser = LlamaParse(
        result_type="markdown",
        verbose=True,
        language="en",
        num_workers=1,
        parsing_instruction=instruction
    )
    
    documents = parser.load_data(pdf_filepaths)

    return documents

In [3]:
from llama_index.core import Settings
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.embeddings.ollama import OllamaEmbedding


ollama_llm = Ollama(model="llama3")
Settings.llm = ollama_llm
Settings.embed_model = HuggingFaceEmbedding(
    model_name="mixedbread-ai/mxbai-embed-large-v1"
)

# Settings.embed_model = HuggingFaceEmbedding(
#     model_name="BAAI/bge-small-en-v1.5"
# )

azure_llm = AzureOpenAI(
    model=os.environ.get("AZURE_OPENAI_MODEL"),
    # deployment_name=os.environ.get("AZURE_OPENAI_DEPLOYMENT_GPT35TURBO"),
    # api_key=os.environ.get("AZURE_OPENAI_KEY"),
    # azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    # deployment_name=os.environ.get("AZURE_OPENAI_DEPLOYMENT_GPT4O"),
    # api_key=os.environ.get("AZURE_OPENAI_KEY_GPT4O"),
    # azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT_GPT4O"),
    deployment_name=os.environ.get("AZURE_OPENAI_DEPLOYMENT_GPT4"),
    api_key=os.environ.get("AZURE_OPENAI_KEY"),
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    api_version=os.environ.get("AZURE_OPENAI_VERSION"),
)
# Settings.llm = azure_llm


# Settings.embed_model = AzureOpenAIEmbedding(
#     model=os.environ.get("AZURE_OPENAI_EMBEDDING_MODEL"),
#     deployment_name=os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT"),
#     api_key=os.environ.get("AZURE_OPENAI_KEY"),
#     azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
#     api_version=os.environ.get("AZURE_OPENAI_VERSION"),
# )



In [4]:
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.core import VectorStoreIndex
import pickle

def store_data(base_nodes, objects, recursive_index, filename):
    with open(filename, "wb") as f:
        pickle.dump({'base_nodes': base_nodes, 'objects': objects, 'recursive_index': recursive_index}, f)
        
def load_data(filename):
    with open(filename, "rb") as f:
        data = pickle.load(f)
    return data['base_nodes'], data['objects'], data['recursive_index']
    
def check_and_parse(pdf_filepaths, filename, model):
    """Check if parsed data exists, parse if not, and load otherwise."""
    if os.path.exists(filename):
        print(f"{filename} already exists. Exisintg parsed data is used.")
        return load_data(filename)
    else:
        documents = parse_data(pdf_filepaths)
        node_parser = MarkdownElementNodeParser(llm=model, num_workers=1)
        nodes = node_parser.get_nodes_from_documents(documents=documents)
        base_nodes, objects = node_parser.get_nodes_and_objects(nodes)
        recursive_index = VectorStoreIndex(nodes=base_nodes+objects)
        store_data(base_nodes, objects, recursive_index, filename)
        return base_nodes, objects, recursive_index

In [5]:
# Using the function to either load or parse the PDF
pdf_filepaths=["./ITU.pdf"]
storage_filename = "ITU-20240624.pkl"
base_nodes, objects, recursive_index = check_and_parse(pdf_filepaths, storage_filename, ollama_llm)

ITU-20240624.pkl already exists. Exisintg parsed data is used.


In [6]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

reranker = FlagEmbeddingReranker(
    top_n=10,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker],
    verbose=True
)

The QA can be carried out using local LLM, which save costs.

A fine-tuned Llama3 model called "llama3-chatqa" is used.

In [7]:
# ollama_llm = Ollama(model="llama3-chatqa")
Settings.llm = azure_llm

country = "United States"

queries = ["Provide a brief introduction to Global Cybersecurity Index (GCI).", 
           "Summarise the five pillars of GCI.",
           f"What is the global score and global rank of {country}?", 
           f"What is the regional rank of {country}?", 
           f"What is the Development Level of {country}?", 
           f"What is the Area(s) of Relative Strength of {country}?", 
           f"What is the Area(s) of Potential Growth of {country}?",
           f"What is {country}'s overall score?",
           f"What are {country}'s scores in {pillars} respectively?"]

In [8]:
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
import logging

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

responses = []
# for query in queries:
#     response = recursive_query_engine.query(query)
#     responses.append(response)
#     # time.sleep(5)

def process_query(query):
    try:
        # time.sleep(15)
        response = recursive_query_engine.query(query)
        return response.response
    except Exception as e:
        logger.error(f"Error querying '{query}': {e}")
        return None

with ThreadPoolExecutor(max_workers=len(queries)) as executor:
    future_to_query = {
        executor.submit(process_query, query): query for query in queries
    }
    for future in as_completed(future_to_query):
        query = future_to_query[future]
        try:
            result = future.result()
            if result:
                responses.append(result)
        except Exception as e:
            logger.error(f"Error processing query '{query}': {e}")

for response in responses:
    print(response)

INFO:httpx:HTTP Request: POST https://karlopenai.openai.azure.com//openai/deployments/embedding/embeddings?api-version=2024-03-01-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://karlopenai.openai.azure.com//openai/deployments/embedding/embeddings?api-version=2024-03-01-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://karlopenai.openai.azure.com//openai/deployments/embedding/embeddings?api-version=2024-03-01-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://karlopenai.openai.azure.com//openai/deployments/embedding/embeddings?api-version=2024-03-01-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://karlopenai.openai.azure.com//openai/deployments/embedding/embeddings?api-version=2024-03-01-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://karlopenai.openai.azure.com//openai/deployments/embedding/embeddings?api-version=2024-03-01-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://karlopenai.openai.azure.co

[1;3;38;2;11;159;203mRetrieval entering 3d58c19c-7b73-4839-a328-86b77aef45e0: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Provide a brief introduction to Global Cybersecurity Index (GCI).
[0m[1;3;38;2;11;159;203mRetrieval entering cfbaedaa-b856-46be-8aa2-44486ea36885: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is the Development Level of United States?
[0m[1;3;38;2;11;159;203mRetrieval entering 26840133-263b-4fa7-a073-b23b3107dc59: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is the Development Level of United States?
[0m[1;3;38;2;11;159;203mRetrieval entering 92090a89-b8a0-4e0c-8b88-54daa0ced9c5: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is the Development Level of United States?
[0m[1;3;38;2;11;159;203mRetrieval entering 0b198ee8-9ea9-4dee-8527-a71424afbb83: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode 

INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Legal.
Area(s) of Potential Growth: Organizational Measures
According to the provided context information, the table "Measures Overview" shows the following scores for the United States:

- Legal Measures: Score related to legal measures (not explicitly mentioned but can be inferred from other tables)
- Cooperative Measures: Score related to cooperative measures = 20.00
- Technical Measures: Score related to technical measures = 0.0
- Organizational Measures: Score related to organizational measures = 0.0
- Capacity Development: Score related to capacity development = 0.0

So, the United States's scores in legal measures, technical measures, organizational measures, capacity development measures, and cooperative measures are respectively:

20.00, 0.0, 0.0, 0.0, 20.00
According to the provided table, the Development Level of the United States is a Developed Country.
According to the given context information, the overall score of the United States of America is 100.0.
Based on the provi

# 2. RAG with NCSI website

The NCSI website provides the scores and ranks of countries.

With the use of LLM, we can easily query useful data from the website.  Again "llama3-chatqa" is used here.

In [28]:
from crewai_tools import ScrapeWebsiteTool
from langchain_community.llms import Ollama as lc_Ollama

llm = lc_Ollama(model="llama3-chatqa")

# Initialize the tool with the website URL, so the agent can only scrap the content of the specified website
scrape_tool = ScrapeWebsiteTool(website_url='https://ncsi.ega.ee/ncsi-index/')

ncsi = scrape_tool.run()

def ncsi_table(queries):
    prompt = f'''You are given scraped content from the National Cyber Security Index (NCSI) website.
        The webpage presents a table containing data: Rank, Country, National Cyber Security Index, Digital Development Level, Difference
        'Rank' refers to the country's rank in the index.
        'National Cyber Security Index' refers to the country's index score.
        'Digital Development Level' refers to the country's score in Digital Development Level.
        'Difference' is simply the difference between the scores in 'National Cyber Security Index' and 'Digital Development Level'.
        Answer the question by referring to the scraped content as follows: \n {ncsi}'''
    responses = []
    # for query in queries:
    #     response = azure_gpt35turbo.invoke(
    #         [
    #             SystemMessage(content=prompt),
    #             HumanMessage(content=query),
    #         ])
    #     responses.append(response.content)
    # return responses            
    for query in queries:
        response = llm.invoke(
            [
                SystemMessage(content=prompt),
                HumanMessage(content=query),
            ]
        )
        responses.append(response)

    return responses     

Using Tool: Read website content


In [29]:
country = 'China'

q1 = f"What is the rank of {country}?"
q2 = f"What is {country}'s National Cyber Security Index?"
q3 = f"What is the rank of Australia?"
q4 = f"What is Australia's National Cyber Security Index?"
q5 = f"Does Australia or {country} score higher in National Cyber Security Index and what is the difference?"
q6 = f"Is Australia or {country} ranked higher, and what is the difference in their ranks?"

queries = [q1, q2, q3, q4, q5, q6]

responses = ncsi_table(queries)

for response in responses:
    print(response)

 China's rank in the index is 26, according to the scraped data from NCSI.
 China's national cyber security index score is 60.
 Australia's rank according to the scraped content on NCSI website is 3
 Australia has a national cyber security index of 87.5 out of 100
 Australia scores a 87.5 compared to Chinas score of 60, therefore the difference between them is 27.5 points.
 Australia is ranked higher than China.
Australia has a rank of 3 compared to China's rank of 26, so they have a difference of 23 places.
