# Project Development

How do I start out with this project? Lets take the following approach:
- Build a minimal standard RAG system using llamaindex, running on some public MPRs and such.
- Spend a bunch of time on benchmarking and testing the system using different Models, datasets, etc.

### Prompts for copilot

In [1]:
import glob
import PyPDF2

from pathlib import PurePosixPath
from llama_index.core import SimpleDirectoryReader
from collections import Counter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, VectorStoreIndex

  from .autonotebook import tqdm as notebook_tqdm


# load data

In [2]:


def extract_text_from_pdfs(pdf_files):
    text = ''
    for file_path in pdf_files:
        with open(file_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            for page in pdf_reader.pages:
                text += page.extract_text()
    return text


pdf_files = glob.glob('/Users/lukasalemu/Documents/00. Bank of England/00. Degree/Dissertation/structured-rag/data/01_raw/*.pdf')
text = extract_text_from_pdfs(pdf_files)
print(text)

Bank of England
Monetary Policy Report
Monetary Policy Committee 
February 2024Monetary policy at the Bank of England
The objectives of monetary policy
The Bank’s Monetary Policy Committee (MPC) sets monetary policy to keep inflation low and stable,
which supports growth and jobs. Subject to maintaining price stability , the MPC is also required to
support the Government’s economic policy.
The Government has set the MPC a target for the 12-month increase in the Consumer Prices Index
of 2%.
The 2% inflation target is symmetric and applies at all times.
The MPC’s remit recognises, however, that the actual inflation rate will depart from its target as a
result of shocks and disturbances, and that attempts to keep inflation at target in these circumstances
may cause undesirable volatility in output. In exceptional circumstances, the appropriate horizon for
returning inflation to target can vary. The MPC will communicate how and when it intends to return
inflation to the target.
The instrum

In [3]:
# Lets just use the llama_index utility to save all the boilerplate...


pdf_path = PurePosixPath(r"/Users/lukasalemu/Documents/00. Bank of England/00. Degree/Dissertation/structured-rag/data/01_raw")

reader = SimpleDirectoryReader(input_dir=pdf_path)
documents = reader.load_data()

In [4]:
for snippet in documents:
    print(snippet.text)

Bank of England
Monetary Policy Report
Monetary Policy Committee 
February 2024
Monetary policy at the Bank of England
The objectives of monetary policy
The Bank’s Monetary Policy Committee (MPC) sets monetary policy to keep inflation low and stable,
which supports growth and jobs. Subject to maintaining price stability , the MPC is also required to
support the Government’s economic policy.
The Government has set the MPC a target for the 12-month increase in the Consumer Prices Index
of 2%.
The 2% inflation target is symmetric and applies at all times.
The MPC’s remit recognises, however, that the actual inflation rate will depart from its target as a
result of shocks and disturbances, and that attempts to keep inflation at target in these circumstances
may cause undesirable volatility in output. In exceptional circumstances, the appropriate horizon for
returning inflation to target can vary. The MPC will communicate how and when it intends to return
inflation to the target.
The instru

## With metadata added to the documents

In [5]:

# We can add metadata to our documents by using the metadata parameter in the SimpleDirectoryReader class.
# Here is an example implementation which we could

def get_metadata(file_path: str):
    """Get extra metadata from the pdf file

    Args:
        file_path (str): The file path

    Returns:
        dict: Returns a dictionary containing the metadata as key-value pairs
    """
    metadata = {}

    # Read the document and extract metadata as a dictionary
    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)

        # Get the number of characters in the pdf
        text = ''
        for page in pdf_reader.pages:
            text += page.extract_text()
        metadata['num_characters'] = len(text)

        # Get the number of words in the pdf
        words = text.split()
        metadata['num_words'] = len(words)

        # Get the most common 5 words in the pdf
        word_counts = Counter(words)
        metadata['most_common_words'] = dict(word_counts.most_common(5))

    return metadata

reader = SimpleDirectoryReader(input_dir=pdf_path, file_metadata=get_metadata)

documents = reader.load_data()

In [6]:
documents[0].metadata

{'page_label': 'Cover',
 'file_name': '/Users/lukasalemu/Documents/00. Bank of England/00. Degree/Dissertation/structured-rag/data/01_raw/monetary-policy-report-february-2024.pdf',
 'num_characters': 193118,
 'num_words': 30984,
 'most_common_words': {'the': 1741,
  'of': 1019,
  'to': 979,
  'in': 862,
  'and': 700}}

# Construct the index

In [7]:

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

In [8]:
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])


384
[-0.0032757113222032785, -0.011690814979374409, 0.04155921936035156, -0.038148146122694016, 0.024183066561818123]


In [9]:
# Could use https://docs.llamaindex.ai/en/stable/examples/index_structs/knowledge_graph/KnowledgeGraphDemo.html
# this to create a knowledge graph from the documents

Settings.embed_model = embed_model
Settings.chunk_size = 512

# if running for the first time, will download model weights first!
index = VectorStoreIndex.from_documents(documents)

In [10]:
type(index)

llama_index.core.indices.vector_store.base.VectorStoreIndex

In [11]:
index_path = PurePosixPath(r"/Users/lukasalemu/Documents/00. Bank of England/00. Degree/Dissertation/structured-rag/data/02_processed")
# index.save(index_path)

AttributeError: 'VectorStoreIndex' object has no attribute 'save'

# Testing

In [12]:
# Now we can retrieve similar documents to a given query
query_text = "What is the state of unemployment"

results = index.as_retriever().retrieve(query_text)

print(results[0].text)
print(results[0].metadata)

The ONS’s alternative experimental statistic for the unemployment rate remained flat at 4.2% in the
three months to November, having increased by 0.7 percentage points from its trough in 2022.
Recent increases in unemployment appear to have been attributable largely to higher flows from
inactivity to unemployment, as some people have begun to look for work. This will not necessarily
have been picked up in business surveys and is unlikely to have af fected benefit claims materially. As
a result, there are limits to which these indicators can be drawn upon to track unemployment
(Broadbent (2023) ).
Estimates of the medium-term equilibrium unemployment rate have also been drifting up (Section 3).
This implies that not all of the increase in unemployment observed so far has been associated with a
looser labour market.
A range of evidence nevertheless points to labour market tightness continuing to ease to some
degree. One of the key indicators of labour market tightness, the ONS vacancies 