# Wiki - Central Bank Speeches

## Solution

We'll build a knowledge retrieval solution that will embed a corpus of knowledge (in our case a database of Wikipedia manuals) and use it to answer user questions.

- **Setup:** Initiate variables and connect to a vector database.
- **Storage:** Configure the database, prepare our data and store embeddings and metadata for retrieval.
- **Search:** Extract relevant documents back out with a basic search function and use an LLM to summarise results into a concise reply.
- **Answer:** Add a more sophisticated agent which will process the user's query and maintain a memory for follow-up questions.
- **Evaluate:** Take a sample evaluated question/answer pairs using our service and plot them to scope out remedial action.

In [7]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Setup

Import libraries and set up a connection to a Redis vector database for our knowledge base.

You can substitute Redis for any other vectorstore or database - there are a [selection](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html) that are supported by Langchain natively, while other connectors will need to be developed yourself.

In [9]:
#!pip install redis
#!pip install openai
#!pip install tiktoken

In [2]:
from ast import literal_eval
import concurrent
import openai
import os
import pandas as pd
import numpy as np
from numpy import array, average
import pandas as pd
from tenacity import retry, wait_random_exponential, stop_after_attempt
import tiktoken
from tqdm import tqdm
from typing import List, Iterator

# Redis imports
from redis import Redis as r
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField,
    NumericField
)
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)

# Langchain imports
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA

CHAT_MODEL = "gpt-3.5-turbo"

In [3]:
pd.set_option('display.max_colwidth', 0)

In [4]:
# load csv
cwd = os.getcwd()
df = pd.read_csv(cwd + '/CSV/us_2022.csv')

In [10]:
df.head(2)

Unnamed: 0,reference,country,date,title,author,is_gov,text
0,r220218a_FOMC,united states,18/2/22,Preparing for the Financial System of the Future,brainard,0,"The financial system is undergoing fast-moving changes associated with digitalization and decentralization. Some of these innovations hold considerable promise to reduce transaction costs and frictions, increase competition, and improve financial inclusion, but there are also potential risks. With technology driving profound change, it is important we prepare for the financial system of the future and not limit our thinking to the financial system of today. In recent years, there has been explosive growth in the development and adoption of new digital assets that leverage distributed ledger technologies and cryptography. The market capitalization of cryptocurrencies grew from less than $100 billion five years ago to a high of almost $3 trillion in November 2021 and is currently around $2 trillion. In parallel, we have seen rapid growth in the platforms that facilitate the crypto platforms facilitate a variety of activities, including lending, trading, and custodying crypto-assets, in some cases outside the traditional regulatory guardrails for investor and consumer protection, market integrity, and transparency. The growth in the crypto finance ecosystem is fueling demand for stablecoins-- digital assets that are intended to maintain stable value relative to reference assets, such as the U.S. dollar. Stablecoin supply grew nearly sixfold in 2021, from roughly $29 billion in January 2021 to $165 billion in January 2022. There is a high degree of concentration among a few dollar-pegged stablecoins: As of January 2022, the largest stablecoin by market capitalization made up almost half of the market, and the four largest stablecoins together made up almost 90 percent. Today, stablecoins are being used as collateral on DeFi and other crypto platforms, as well as in facilitating trading and monetization of cryptocurrency positions on and between crypto and other platforms. In the future, some issuers envision that stablecoins will also have an expanded reach in the payment system and be commonly used for everyday transactions, both domestic and cross-border. So it is important to have strong frameworks for the quality and sufficiency of reserves and risk management and governance. As noted in a recent report on stablecoins by the President's Working Group on Financial Markets, it is important to guard against run risk, whereby the prospect of an issuer not being able to promptly and adequately meet redemption requests for the stablecoin at par could result in a sudden surge in redemption demand. It is also important to address settlement risk, whereby funds settlement is not certain and final when expected, and systemic risk, whereby the failure or distress of a stablecoin provider could adversely affect the broader financial system. The prominence of crypto advertisements during the Super Bowl highlighted the growing engagement of retail investors in the crypto ecosystem. Research found that 16 percent of survey respondents reported having personally invested in, traded, or otherwise used a cryptocurrency--up from less than 1 percent of respondents in 2015. There is also rising interest among institutional investors. So it is perhaps not surprising that established financial intermediaries are undertaking efforts to expand the crypto services and products they offer. If the past year is any guide, the crypto financial system is likely to continue to grow and evolve in ways that increase interconnectedness with the traditional financial system. As a result, officials in many countries are undertaking efforts to understand and adapt to the transformation of the financial system. Many jurisdictions are making efforts to ensure statutory and regulatory frameworks apply like rules to like risks, and some jurisdictions are issuing or contemplating issuing central bank currency in digital form. Preparing for the Payment System of the Future The Federal Reserve needs to be preparing for the payment landscape of the future even as we continue to make improvements to meet today's needs. In light of the rapid digitalization of the financial system, the Federal Reserve has been thinking critically about whether there is a role for a potential U.S. central bank digital currency (CBDC) in the digital payment landscape of the future and about its potential properties, costs, and benefits. Our financial and payment system delivers important benefits today and is continuing to improve with developments like real-time payments. Nonetheless, certain challenges remain, such as a lack of access to digital banking and payment services for some Americans and expensive and slow cross-border payments. Growing interest in the digital financial ecosystem suggests that technology is enabling potential improvements that merit consideration. In addition, it is important to consider how new forms of crypto-assets and digital money may affect the Federal Reserve's responsibilities to maintain financial stability, a safe and efficient payment system, household and business access to safe central bank money, and maximum employment and price stability. It is prudent to explore whether there is a role for a CBDC to preserve some of the safe and effective elements of the financial system of the present in a way that is complementary to the private sector innovations transforming the financial landscape of the future. The public and private sector play important complementary roles within the over a century of experience working to improve the infrastructure of the U.S. payment system to provide a resilient and adaptable foundation for dynamic private sector activity. In parallel, private sector banks and nonbanks have competed to build the best possible products and services on top of that foundation and to meet the dollar- denominated needs of consumers and investors at home and around the world. The result is a resilient payment system that is responsive to the changing needs of businesses, consumers, and investors. While the official sector provides a stable currency, operates some important payment rails, and undertakes regulation and oversight of financial intermediaries and critical financial market infrastructures, the private sector brings competitive forces encouraging efficiency and new product offerings and driving innovation. Responsible innovation has the potential to increase financial inclusion and efficiency and to lower costs within guardrails that protect consumers and investors and safeguard financial stability. As we assess the range of future states of the financial system, it is prudent to consider how to preserve ready public access to government-issued, risk-free currency in the digital financial system--the digital equivalent of the Federal Reserve's issuance of physical currency. The Board recently issued a discussion paper that outlines the Federal Reserve's current thinking on the potential benefits, risks, and policy considerations of a The paper does not advance any specific policy outcome and does not signal that the Board will make any imminent decisions about the appropriateness of issuing a U.S. CBDC. It lays out four CBDC design principles that analysis to date suggests would best serve the needs of the United States if one were created. Those principles are that a potential CBDC should be privacy-protected, so consumer data and privacy are safeguarded; intermediated, such that financial intermediaries rather than the Federal Reserve interface directly with consumers; widely transferable, so the payment system is not fragmented; and identity-verified, so law enforcement can continue to combat money laundering and funding of terrorism. Given the Federal Reserve's mandate to promote financial stability, any consideration of a CBDC must include a robust evaluation of its impact on the stability of the financial system--not only as it exists today but also as it may evolve in the future. In consideration of the financial system today, it would be important to explore design features that would ensure complementarity with established financial intermediation. A CBDC--depending on its features--could be attractive as a store of value and means of payment to the extent it is seen as the safest form of money. This could make it attractive to risk-averse users, perhaps leading to increased demand for the CBDC at the expense of other intermediaries during times of stress. So it is important to undertake research regarding the tools and design features that could be introduced to limit such risks, such as offering a non-interest bearing CBDC and limiting the amount of CBDC an end user could hold or transfer. As I noted at the start, the digital asset and payment ecosystem is evolving at a rapid pace. Thus, it is also important to contemplate the potential role of a CBDC to promote financial stability in a future financial system in which a growing range of consumer payment and financial transactions would be conducted via digital currencies such as stablecoins. If current trends continue, the stablecoin market in the future could come to be dominated by just one or two issuers. Depending on the characteristics of these stablecoins, there could be large shifts in desired holdings between these stablecoins and deposits, leading to large-scale redemptions by risk-averse users at times of stress that could prove disruptive to financial stability. In such a future state, the coexistence of CBDC alongside stablecoins and commercial bank money could prove complementary, by providing a safe central bank liability in the digital financial ecosystem, much like cash currently coexists with commercial bank money. It is essential that policymakers, including the Federal Reserve, plan for the future of the payment system and consider the full range of possible options to bring forward the potential benefits of new technologies, while safeguarding stability. Analysis of the potential future state of the financial system is not limited to the domestic implications. The dollar is important to global financial markets: It is not only the predominant global reserve currency, but the dollar is also the most widely used currency in international payments. Decisions by other major jurisdictions to issue CBDCs could bring important changes to global financial markets that may prove more or less disruptive and that could influence the potential risks and benefits of a U.S. CBDC. Thus, it is wise to consider what the future states of global financial markets and transactions would look like both with and without a Federal Reserve-issued CBDC. For example, the People's Bank of China has been piloting the digital yuan, also known as e-CNY, in numerous Chinese cities over the past two years. The substantial early progress on the digital yuan may have implications for the evolution of cross-border payments and payment systems. And it may influence the development of norms and standards for cross-border digital financial transactions. It is prudent to consider how the potential absence or issuance of a U.S. CBDC could affect the use of the dollar in payments globally in future states where one or more major foreign currencies are issued in CBDC form. A U.S. CBDC may be one potential way to ensure that people around the world who use the dollar can continue to rely on the strength and safety of U.S. currency to transact and conduct business in the digital financial system. More broadly, it is important to consider how the United States can continue to play a lead role in the development of standards governing international digital financial transactions involving CBDCs consistent with norms such as privacy and security. Given the dollar's important role as a payment instrument across the world, it is essential that the United States be on the frontier of research and policy development regarding CBDC, as international developments related to CBDC can have implications for the global financial system. Given the range of possible future states with significant digitization of the financial system, it is important that the Federal Reserve is actively engaging with the underlying technologies. Our work to build 24x7x365 instant payments rails leverages lessons from some of today's most resilient, high-performing, and large-scale technology platforms across the globe. It is providing important insights on the clearing and settlement models associated with real time payments as well as on fraud, cyber resilience, cloud computing, and related technologies. In parallel with the Board's public consultation on CBDC, the Federal Reserve Bank of Boston, in collaboration with the Massachusetts Institute of Technology, has developed a theoretical high-performance transaction processor for CBDC. recently published the resulting software under an open-source license as a way of engaging with the broader technical community and promoting transparency and verifiability. Moreover, the Board is studying how innovations, such as distributed ledger technology, could improve the financial system. This work includes experimentation with stablecoin interoperability and testing of retail payments across multiple distributed payment ledger systems. The Federal Reserve Bank of New York recently established an Innovation Center, focused on validating, designing, building, and launching new financial technology products and services for the central bank community. These technology research and development initiatives are vital to our responsibilities to promote a safe and efficient payment system and financial stability, whatever the future may bring. The financial system is not standing still, and neither can we. The digital financial ecosystem is evolving rapidly and becoming increasingly connected with the traditional financial system. It is prudent for the Board to understand the evolving payment landscape, the technological advancements and consumer demands driving this evolution, and the consequent policy choices as it seeks to fulfill its congressionally- mandated role to promote a safe, efficient, and inclusive system for U.S. dollar transactions. To prepare for the financial system of the future, the Federal Reserve is engaging in research and experimentation with these new technologies and consulting closely with public and private sector partners."
1,r220221a_FOMC,united states,21/2/22,High Inflation and the Outlook for Monetary Policy,bowman,0,"Before we get to our conversation on community banking, I would like to briefly discuss my outlook for the U.S. economy and my view of appropriate monetary policy. As I see it, the main challenge for monetary policy now is to bring inflation down without harming the ongoing economic expansion. Inflation is much too high. Last year I noted that inflationary pressures associated with strong demand and constrained supply could take longer to subside than many expected. Since then, those problems have persisted and inflation has broadened, reaching the highest rate that Americans have faced in forty years. High inflation is a heavy burden for all Americans, but especially for those with limited means who are forced to pay more for everyday items, delay purchases, or put off saving for the future. I intend to support prompt and decisive action to lower inflation, and today I will explain how the Fed is pursuing this goal. In the near term, I expect that uncomfortably high inflation will persist at least through the first half of 2022. We may see signs of inflation easing in the second half of the year, but there is a substantial risk that high inflation could persist. In January, the Consumer Price Index rose to a 12-month rate of 7.5 percent, which, consistent with other recent monthly readings, was even higher than expected. Employment costs for businesses, as measured by average hourly wages, also rose last month. And continued tightness in the labor market indicates that upward pressure on wages and other employment compensation is not likely to moderate soon. My base case is that inflation will moderate later this year, which will depend, in wage growth lagging behind inflation for the past year, many families may find it challenging to make ends meet and continued rising home prices will likely prevent many from entering the housing market. In addition, rising costs and hiring difficulties continue to be burdens for small businesses. Turning to the labor market, which continues to tighten, indications are that the Omicron infection surge earlier this year has not left a negative imprint on the economy or slowed job creation. I expect to see continued strength in the job market this year, with further gains in employment, and my hope is that more Americans return to the labor force and find work. The strength in job creation is a big positive for those seeking employment and for their families. Even with the improving labor market, I still hear from businesses that qualified workers are difficult to find, and labor shortages remain a drag on hiring and on economic growth. Now let me turn to the implications of this outlook for monetary policy. In my view, conditions in the labor market have been and are currently consistent with the FOMC's goal of maximum employment, and as such, my focus has been on the persistently high inflation. In part, the high inflation reflects supply chain disruptions associated with the economic effects of the pandemic and efforts made to contain it. Unfortunately, monetary policy isn't well-suited to address supply issues. But strong demand and a very tight labor market have also contributed to inflation pressures, and the FOMC can help alleviate those pressures by removing the extraordinary monetary policy accommodation that is no longer needed. In our most recent monetary policy statement--which was released following our January meeting--we indicated that ""with inflation well above 2 percent and a strong labor market,"" we expected that it would ""soon be appropriate to raise the target range for the federal funds rate."" I fully supported that assessment, and the data we have seen since then have only increased the urgency to get on with the process of normalizing our interest rate stance and significantly reducing the size of the Federal Reserve's balance sheet. I support raising the federal funds rate at our next meeting in March and, if the economy evolves as I expect, additional rate increases will be appropriate in the coming months. I will be watching the data closely to judge the appropriate size of an increase at the March meeting. In early March, the FOMC will finally stop expanding the Federal Reserve's balance sheet. The resulting end of our pandemic asset purchases will remove another source of unneeded stimulus for the economy. In the coming months, we need to take the next step, which is to begin reducing the Fed's balance sheet by ceasing the reinvestment of maturing securities already held in the portfolio. Returning the balance sheet to an appropriate and manageable level will be an important additional step toward addressing high inflation. I expect that these steps will contribute to an easing in inflation pressures in the coming months, but further steps will likely be needed this year to tighten monetary policy. Looking beyond this spring, my views on the appropriate pace of interest rate increases and balance sheet reduction for this year and beyond will depend on how the economy evolves. I will be particularly focused on how much progress we make on bringing down inflation. My intent would be to take forceful action to help reduce inflation, bringing it back toward our 2 percent goal, while keeping the economy on track to continue creating jobs and economic opportunity for Americans. I appreciate the opportunity to share my views on monetary policy with you this morning. But since we are here to talk about community banking, let's get back to that important topic. Certainly prior to, but especially over the course of the pandemic, we have seen a heightened focus and urgency in incorporating technology and innovation into community banking. The adoption of technology and innovation is really at the heart of the major issues facing community banks. We see banks, fintech companies, and tech firms exploring various technologies to enhance their payments systems, expand consumer access, improve back-office operations, and create new financial products and services. This interest and the increasing interest in crypto- and digital assets have created a need to work together with the other federal banking agencies to give the industry better and more useful regulatory feedback as banks consider approaches to integrating crypto- and digital asset related activities into their service offerings. Given the popularity of these types of assets, and the growing interest of banks in participating in the market, it's increasingly necessary for regulators to be able to engage with the industry on these issues. Evolving financial services, a sharper focus on efficiency and timeliness in the industry, and the rapid increase in technology advances have also led the Federal Reserve to explore the potential benefits and risks of a central bank digital currency (CBDC). We recently issued a discussion paper as a first step in fostering a broad and transparent public dialogue about CBDCs. The paper is not intended to advance any specific policy outcome and no decisions have been made at this time. We are genuinely committed to hearing a wide range of voices on this issue. The paper was published earlier this year with a 120-day comment period. We encourage your comments and feedback-- generally, and in response to specific questions posed in the paper. As we engage in this dialogue and evaluation process, and throughout this initiative, I intend to keep an open mind about the usefulness of and potential business case for a CBDC. I strongly encourage community bankers and all of the other stakeholders who would be impacted by the creation of a CBDC to submit your comments and views to the Fed by May 20, the end of the scheduled public comment period. Another area of intense interest is the expansion of financial activities beyond the traditional chartered banking institution construct. We are seeing an increase in the proposal of novel charter types under consideration across the country. These changes, and the coming availability of the Fed Now instant payment service, have the potential to vastly change the landscape of financial services and opportunities in the market. In anticipation of this evolution, our Federal Reserve Banks are receiving an increased number of requests for membership and access to Reserve Bank master accounts from institutions with these novel charters. Recognizing the importance of clarity and transparency in this space, and to facilitate and evaluate these activities in a consistent manner, the Board is in the process of issuing clearer guidance around the application and review process for novel bank charters and account access at the Federal Reserve. I look forward to discussing these and other issues with you in just a few minutes, so I will stop there. It's such a pleasure to be in person with you again at the ABA's Conference for Community Banks, and I am looking forward to our conversation."


## Storage

initialise our vector database

#### How much data to store
How much metadata do you want to include in the index. Metadata can be used to filter your queries or to bring back more information upon retrieval for your application to use, but larger indices will be slower so there is a trade-off.

There are two common design patterns here:
- **All-in-one:** Store your metadata with the vector embeddings so you perform semantic search and retrieval on the same database. This is easier to setup and run, but can run into scaling issues when your index grows.
- **Vectors only:** Store just the embeddings and any IDs/references needed to locate the metadata that goes with the vector in a different database or location. In this pattern the vector database is only used to locate the most relevant IDs, then those are looked up from a different database. This can be more scalable if your vector database is going to be extremely large, or if you have large volumes of metadata with each vector.

 need the full Redis Stack to enable use of Redisearch

To set this up locally, you will need to:
- Install an appropriate version of [Docker](https://docs.docker.com/desktop/) for your OS
- Ensure Docker is running i.e. by running ```docker run hello-world```
- Run the following command: ```docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest```.

The code used here draws heavily on [this repo](https://github.com/RedisAI/vecsim-demo).

After setting up the Docker instance of Redis Stack, you can follow the below instructions to initiate a Redis connection and create a Hierarchical Navigable Small World (HNSW) index for semantic search.

In [5]:
# Setup Redis

REDIS_HOST = 'localhost'
REDIS_PORT = '6379'
REDIS_DB = '0'

redis_client = r(host=REDIS_HOST, port=REDIS_PORT, db=REDIS_DB,decode_responses=False)

# Constants
VECTOR_DIM = 1536 # length of the vectors
PREFIX = "wiki" # prefix for the document keys
DISTANCE_METRIC = "COSINE" # distance metric for the vectors (ex. COSINE, IP, L2)

In [6]:
# Create search index

# Index
INDEX_NAME = "wiki-index"           # name of the search index
VECTOR_FIELD_NAME = 'content_vector'

# Define RediSearch fields for each of the columns in the dataset
# This is where you should add any additional metadata you want to capture
reference = TextField("reference")
country = TextField("country")

date = TextField("date") # date Filed?

title = TextField("title")
author = TextField("author")
is_gov = TextField("is_gov")

text_chunk = TextField("text")
file_chunk_index = NumericField("file_chunk_index")

# define RediSearch vector fields to use HNSW index

text_embedding = VectorField(VECTOR_FIELD_NAME,
    "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC
    }
)
# Add all our field objects to a list to be created as an index
fields = [reference,country,date,title,author,is_gov,text_chunk,file_chunk_index,text_embedding]

redis_client.ping()

True

Optional step to drop the index if it already exists

```redis_client.ft(INDEX_NAME).dropindex()```

If you want to clear the whole DB use:

```redis_client.flushall()```

In [7]:
# Check if index exists
try:
    redis_client.ft(INDEX_NAME).info()
    print("Index already exists")
except Exception as e:
    print(e)
    # Create RediSearch Index
    print('Not there yet. Creating')
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
    )

Unknown Index name
Not there yet. Creating


### Data preparation

The next step is to prepare your data. There are a few decisions to keep in mind here:

#### Chunking your data

In this context, "chunking" means cutting up the text into reasonable sizes so that the content will fit into the context length of the language model you choose. If your data is small enough or your LLM has a large enough context limit then you can proceed with no chunking, but in many cases you'll need to chunk your data. I'll share two main design patterns here:
- **Token-based:** Chunking your data based on some common token threshold i.e. 300, 500, 1000 depending on your use case. This approach works best with a grid-search evaluation to decide the optimal chunking logic over a set of evaluation questions. Variables to consider are whether chunks have overlaps, and whether you extend or truncate a section to keep full sentences and paragraphs together.
- **Deterministic:** Deterministic chunking uses some common delimiter, like a page break, paragraph end, section header etc. to chunk. This can work well if you have data of reasonable uniform structure, or if you can use GPT to help annotate the data first so you can guarantee common delimiters. However, it can be difficult to handle your chunks when you stuff them into the prompt given you need to cater for many different lengths of content, so consider that in your application design.

#### Which vectors should you store

It is critical to think through the user experience you're building towards because this will inform both the number and content of your vectors. Here are two example use cases that show how these can pan out:
- **Tool Manual Knowledge Base:** We have a database of manuals that our customers want to search over. For this use case, we want a vector to allow the user to identify the right manual, before searching a different set of vectors to interrogate the content of the manual to avoid any cross-pollination of similar content between different manuals. 
    - **Title Vector:** Could include title, author name, brand and abstract.
    - **Content Vector:** Includes content only.
- **Investor Reports:** We have a database of investor reports that contain financial information about public companies. I want relevant snippets pulled out and summarised so I can decide how to invest. In this instance we want one set of content vectors, so that the retrieval can pull multiple entries on a company or industry, and summarise them to form a composite analysis.
    - **Content Vector:** Includes content only, or content supplemented by other features that improve search quality such as author, industry etc.
    
For this walkthrough we'll go with 1000 token-based chunking of text content with no overlap, and embed them with the article title included as a prefix.

In [8]:
# We'll use 1000 token chunks with some intelligence to not split at the end of a sentence
TEXT_EMBEDDING_CHUNK_SIZE = 1000
EMBEDDINGS_MODEL = "text-embedding-ada-002"

In [11]:
## Chunking Logic

# Split a text into smaller chunks of size n, preferably ending at the end of a sentence
def chunks(text, n, tokenizer):
    tokens = tokenizer.encode(text)
    """Yield successive n-sized chunks from text."""
    i = 0
    while i < len(tokens):
        # Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens
        j = min(i + int(1.5 * n), len(tokens))
        while j > i + int(0.5 * n):
            # Decode the tokens and check for full stop or newline
            chunk = tokenizer.decode(tokens[i:j])
            if chunk.endswith(".") or chunk.endswith("\n"):
                break
            j -= 1
        # If no end of sentence found, use n tokens as the chunk size
        if j == i + int(0.5 * n):
            j = min(i + n, len(tokens))
        yield tokens[i:j]
        i = j
        
def get_unique_id_for_file_chunk(reference, chunk_index):
    return str(reference+"-!"+str(chunk_index))

def chunk_text(x,text_list):

    reference = x["reference"]
    country = x["country"]
    date = x["date"]
    title = x["title"]
    author = x["author"]
    is_gov = x["is_gov"]

    file_body_string = x['text']
        
    """Return a list of tuples (text_chunk, embedding) for a text."""
    token_chunks = list(chunks(file_body_string, TEXT_EMBEDDING_CHUNK_SIZE, tokenizer))
    text_chunks = [f'Title: {title};\n'+ tokenizer.decode(chunk) for chunk in token_chunks]
    
    #embeddings_response = openai.Embedding.create(input=text_chunks, model=EMBEDDINGS_MODEL)

    #embeddings = [embedding["embedding"] for embedding in embeddings_response['data']]
    #text_embeddings = list(zip(text_chunks, embeddings))

    # Get the vectors array of triples: file_chunk_id, embedding, metadata for each embedding
    # Metadata is a dict with keys: filename, file_chunk_index
    
    for i, text_chunk in enumerate(text_chunks):
        id = get_unique_id_for_file_chunk(reference, i)
        text_list.append(({'id': id
                         , 'metadata': {"reference": reference
                                        , "country": country
                                        , "date": date
                                        , "title": title
                                        , "author": author
                                        , "is_gov": is_gov
                                        , "content": text_chunk
                                        , "file_chunk_index": i}}))

In [None]:
## Chunking Logic using Langchain

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(text)


In [12]:
## Batch Embedding Logic

# Simple function to take in a list of text objects and return them as a list of embeddings
@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(10))
def get_embeddings(input: List):
    response = openai.Embedding.create(
        input=input,
        model=EMBEDDINGS_MODEL,
    )["data"]
    return [data["embedding"] for data in response]

def batchify(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx : min(ndx + n, l)]

# Function for batching and parallel processing the embeddings
def embed_corpus(
    corpus: List[str],
    batch_size=64,
    num_workers=8,
    max_context_len=8191,
):

    # Encode the corpus, truncating to max_context_len
    encoding = tiktoken.get_encoding("cl100k_base")
    encoded_corpus = [
        encoded_article[:max_context_len] for encoded_article in encoding.encode_batch(corpus)
    ]

    # Calculate corpus statistics: the number of inputs, the total number of tokens, and the estimated cost to embed
    num_tokens = sum(len(article) for article in encoded_corpus)
    cost_to_embed_tokens = num_tokens / 1_000 * 0.0004
    print(
        f"num_articles={len(encoded_corpus)}, num_tokens={num_tokens}, est_embedding_cost={cost_to_embed_tokens:.2f} USD"
    )

    # Embed the corpus
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        
        futures = [
            executor.submit(get_embeddings, text_batch)
            for text_batch in batchify(encoded_corpus, batch_size)
        ]

        with tqdm(total=len(encoded_corpus)) as pbar:
            for _ in concurrent.futures.as_completed(futures):
                pbar.update(batch_size)

        embeddings = []
        for future in futures:
            data = future.result()
            embeddings.extend(data)

        return embeddings

In [13]:
%%time
# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

# List to hold vectors
text_list = []

# Process each PDF file and prepare for embedding
x = article_df.apply(lambda x: chunk_text(x, text_list),axis = 1)

CPU times: user 1.04 s, sys: 131 ms, total: 1.17 s
Wall time: 1.2 s


In [14]:
text_list[0]

{'id': 'Photon-!0',
 'metadata': {'url': 'https://simple.wikipedia.org/wiki/Photon',
  'title': 'Photon',
  'content': 'Title: Photon;\nPhotons  (from Greek φως, meaning light), in many atomic models in physics,  are particles which transmit light. In other words, light is carried over space by photons. Photon is an elementary particle that is its own antiparticle. In quantum mechanics each photon has a characteristic quantum of energy that depends on frequency: A photon associated with light at a higher frequency will have more energy (and be associated with light at a shorter wavelength).\n\nPhotons have a rest mass of 0 (zero). However, Einstein\'s theory of relativity says that they do have a certain amount of momentum. Before the photon got its name, Einstein revived the proposal that light is separate pieces of energy (particles). These particles came to be known as photons. \n\nA photon is usually given the symbol γ (gamma),\n\nProperties \n\nPhotons are fundamental particles. A

In [15]:
# Batch embed our chunked text - this will cost you about $0.50
embeddings = embed_corpus([text["metadata"]['content'] for text in text_list])

num_articles=2693, num_tokens=1046988, est_embedding_cost=0.42 USD


2752it [00:10, 271.48it/s]                                                                               


In [16]:
# Join up embeddings with our original list
embeddings_list = [{"embedding": v} for v in embeddings]
for i,x in enumerate(embeddings_list):
    text_list[i].update(x)
text_list[0]

{'id': 'Photon-!0',
 'metadata': {'url': 'https://simple.wikipedia.org/wiki/Photon',
  'title': 'Photon',
  'content': 'Title: Photon;\nPhotons  (from Greek φως, meaning light), in many atomic models in physics,  are particles which transmit light. In other words, light is carried over space by photons. Photon is an elementary particle that is its own antiparticle. In quantum mechanics each photon has a characteristic quantum of energy that depends on frequency: A photon associated with light at a higher frequency will have more energy (and be associated with light at a shorter wavelength).\n\nPhotons have a rest mass of 0 (zero). However, Einstein\'s theory of relativity says that they do have a certain amount of momentum. Before the photon got its name, Einstein revived the proposal that light is separate pieces of energy (particles). These particles came to be known as photons. \n\nA photon is usually given the symbol γ (gamma),\n\nProperties \n\nPhotons are fundamental particles. A

In [17]:
# Create a Redis pipeline to load all the vectors and their metadata
def load_vectors(client:r, input_list, vector_field_name):
    p = client.pipeline(transaction=False)
    for text in input_list:    
        #hash key
        key=f"{PREFIX}:{text['id']}"
        
        #hash values
        item_metadata = text['metadata']
        #
        item_keywords_vector = np.array(text['embedding'],dtype= 'float32').tobytes()
        item_metadata[vector_field_name]=item_keywords_vector
        
        # HSET
        p.hset(key,mapping=item_metadata)
            
    p.execute()

In [18]:
batch_size = 100  # how many vectors we insert at once

for i in tqdm(range(0, len(text_list), batch_size)):
    # find end of batch
    i_end = min(len(text_list), i+batch_size)
    meta_batch = text_list[i:i_end]
    
    load_vectors(redis_client,meta_batch,vector_field_name=VECTOR_FIELD_NAME)

100%|████████████████████████████████████████████████████████████████████| 27/27 [00:07<00:00,  3.40it/s]


In [19]:
redis_client.ft(INDEX_NAME).info()['num_docs']

'2693'

### Search

We can now use our knowledge base to bring back search results. This is one of the areas of highest friction in enterprise knowledge retrieval use cases, with the most common being that the system is not retrieving what you intuitively think are the most relevant documents. There are a few ways of tackling this - I'll share a few options here, as well as some resources to take your research further:

#### Vector search, keyword search or a hybrid

Despite the strong capabilities out of the box that vector search gives, search is still not a solved problem, and there are well proven [Lucene-based](https://en.wikipedia.org/wiki/Apache_Lucene) search solutions such Elasticsearch and Solr that use methods that work well for certain use cases, as well as the sparse vector methods of traditional NLP such as [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). If your retrieval is poor, the answer may be one of these in particular, or a combination:
- **Vector search:** Converts your text into vector embeddings which can be searched using KNN, SVM or some other model to return the most relevant results. This is the approach we take in this workbook, using a RediSearch vector DB which employs a KNN search under the hood.
- **Keyword search:** This method uses any keyword-based search approach to return a score - it could use Elasticsearch/Solr out-of-the-box, or a TF-IDF approach like BM25.
- **Hybrid search:** This last approach is a mix of the two, where you produce both a vector search and keyword search result, before using an ```alpha``` between 0 and 1 to weight the outputs. There is a great example of this explained by the Weaviate team [here](https://weaviate.io/blog/hybrid-search-explained).

#### Hypothetical Document Embeddings (HyDE)

This is a novel approach from [this paper](https://arxiv.org/abs/2212.10496), which states that a hypothetical answer to a question is more semantically similar to the real answer than the question is. In practice this means that your search would use GPT to generate a hypothetical answer, then embed that and use it for search. I've seen success with this both as a pure search, and as a retry step if the initial retrieval fails to retrieve relevant content. A simple example implementation is here:
```
def answer_question_hyde(question,prompt):
    
    hyde_prompt = '''You are OracleGPT, an helpful expert who answers user questions to the best of their ability.
    Provide a confident answer to their question. If you don't know the answer, make the best guess you can based on the context of the question.

    User question: USER_QUESTION_HERE
    
    Answer:'''
    
    hypothetical_answer = openai.Completion.create(model=COMPLETIONS_MODEL,prompt=hyde_prompt.replace('USER_QUESTION_HERE',question))['choices'][0]['text']
    
    search_results = get_redis_results(redis_client,hypothetical_answer)
    
    return search_results
```

#### Fine-tuning embeddings

This next approach leverages the learning you gain from real question/answer pairs that your users will generate during the evaluation approach. It works by:
- Creating a dataset of positive (and optionally negative) question and answer pairs. Positive examples would be a correct retrieval to a question, while negative would be poor retrievals.
- Calculating the embeddings for both questions and answers and the cosine similarity between them.
- Train a model to optimize the embeddings matrix and test retrieval, picking the best one.
- Perform a matrix multiplication of the base Ada embeddings by this new best matrix, creating a new fine-tuned embedding to do for retrieval.

There is a great walkthrough of both the approach and the code to perform it in [this cookbook](./Customizing_embeddings.ipynb).

#### Reranking

One other well-proven method from traditional search solutions that can be applied to any of the above approaches is reranking, where we over-fetch our search results, and then deterministically rerank based on a modifier or set of modifiers.

An example is investor reports again - it is highly likely that if we have 3 reports on Apple, we'll want to make our investment decisions based on the latest one. In this instance a ```recency``` modifier could be applied to the vector scores to sort them, giving us the latest one on the top even if it is not the most semantically similar to our search question. 

For this walkthrough we'll stick with a basic semantic search bringing back the top 5 chunks for a user question, and providing a summarised response using GPT.

In [20]:
# Make query to Redis
def query_redis(redis_conn,query,index_name, top_k=5):
    
    

    ## Creates embedding vector from user query
    embedded_query = np.array(openai.Embedding.create(
                                                input=query,
                                                model=EMBEDDINGS_MODEL,
                                            )["data"][0]['embedding'], dtype=np.float32).tobytes()

    #prepare the query
    q = Query(f'*=>[KNN {top_k} @{VECTOR_FIELD_NAME} $vec_param AS vector_score]').sort_by('vector_score').paging(0,top_k).return_fields('vector_score','url','title','content','text_chunk_index').dialect(2) 
    params_dict = {"vec_param": embedded_query}

    
    #Execute the query
    results = redis_conn.ft(index_name).search(q, query_params = params_dict)
    
    return results

# Get mapped documents from Redis results
def get_redis_results(redis_conn,query,index_name):
    
    # Get most relevant documents from Redis
    query_result = query_redis(redis_conn,query,index_name)
    
    # Extract info into a list
    query_result_list = []
    for i, result in enumerate(query_result.docs):
        result_order = i
        url = result.url
        title = result.title
        text = result.content
        score = result.vector_score
        query_result_list.append((result_order,url,title,text,score))
        
    # Display result as a DataFrame for ease of us
    result_df = pd.DataFrame(query_result_list)
    result_df.columns = ['id','url','title','result','certainty']
    return result_df

In [21]:
%%time

wiki_query='What is Thomas Dolby known for?'

result_df = get_redis_results(redis_client,wiki_query,index_name=INDEX_NAME)
result_df.head(2)

CPU times: user 7.1 ms, sys: 2.35 ms, total: 9.45 ms
Wall time: 495 ms


Unnamed: 0,id,url,title,result,certainty
0,0,https://simple.wikipedia.org/wiki/Thomas%20Dolby,Thomas Dolby,"Title: Thomas Dolby;\nThomas Dolby (born Thomas Morgan Robertson; 14 October 1958) is a British musican and computer designer. He is probably most famous for his 1982 hit, ""She Blinded me with Science"".\n\nHe married actress Kathleen Beller in 1988. The couple have three children together.\n\nDiscography\n\nSingles\n\nA Track did not chart in North America until 1983, after the success of ""She Blinded Me With Science"".\n\nAlbums\n\nStudio albums\n\nEPs\n\nReferences\n\nEnglish musicians\nLiving people\n1958 births\nNew wave musicians\nWarner Bros. Records artists",0.132723689079
1,1,https://simple.wikipedia.org/wiki/Synthesizer,Synthesizer,Title: Synthesizer;\nAudio technology,0.223129153252


In [22]:
# Build a prompt to provide the original query, the result and ask to summarise for the user
retrieval_prompt = '''Use the content to answer the search query the customer has sent.
If you can't answer the user's question, say "Sorry, I am unable to answer the question with the content". Do not guess.

Search query: 

SEARCH_QUERY_HERE

Content: 

SEARCH_CONTENT_HERE

Answer:
'''

def answer_user_question(query):
    
    results = get_redis_results(redis_client,query,INDEX_NAME)
    
    retrieval_prepped = retrieval_prompt.replace('SEARCH_QUERY_HERE',query).replace('SEARCH_CONTENT_HERE',results['result'][0])
    retrieval = openai.ChatCompletion.create(model=CHAT_MODEL,messages=[{'role':"user",'content': retrieval_prepped}],max_tokens=500)
    
    # Response provided by GPT-3.5
    return retrieval['choices'][0]['message']['content']

In [23]:
print(answer_user_question(wiki_query))

Thomas Dolby is known for his music, particularly his 1982 hit "She Blinded Me With Science". He is also a computer designer.


### Answer

We've now created a knowledge base that can answer user questions on Wikipedia. However, the user experience could be better, and this is where the Answer layer comes in, where an LLM Agent is used to interact with the user.

There are different level of complexity in building a knowledge retrieval experience leveraging an LLM; there is an experience vs. effort trade-off to consider when selecting the right type of interaction. There are many patterns, but I'll highlight a few of the most common here:

#### Choosing the user experience and architecture

There are different level of complexity in building a knowledge retrieval experience leveraging an LLM; there is an experience vs. effort trade-off to consider when selecting the right type of interaction. There are many patterns, but I'll highlight a few of the most common here:
- **Q&A:** Your classic search engine use case, where the user inputs a question and your LLM gives them an answer either using its knowledge or, much more commonly, using a knowledge base that you prepare using the steps we've covered already. This simple use case assumes no memory of past queries is required, and no ability to clarify with the human or ask for more information.
- **Chat:** I think of Chat as being Q&A + memory - this is a slightly more sophisticated interaction where the LLM remembers what was previously asked and can delve deeper on something already covered.
- **Agent:** The most sophisticated is what LangChain calls an Agent, they leverage large language models to process and produce human-like results through a variety of tools, and will chain queries together dynamically until it has an answer that the LLM feels is appropriate to answer the user's question. However, for every "turn" you allow between Agent and user you increase the risks of loss of context, hallucination, or parsing errors, so be clear about the exact requirements your users have before embarking on building the Answer layer.

Q&A use cases are the simplest to implement, while Agents can give the most sophisticated user experience - in this notebook we'll build an Agent with memory and a single Tool to give an appreciation for the flexibilty prompt chaining gives you in getting a more complete answer for your users.

#### Ensuring reliability

The more complexity you add, the more chance your LLM will fail to respond correctly, or a response will come back in the wrong format and break your Answer pipeline. We'll share a few methods our customers have used elsewhere to help "channel" the Agent down a more deterministic path, and to deal with issues when they do crop up:
- **Prompt chaining:** Prompting the model to take a step-by-step approach and think aloud using a scratchpad has been proven to deliver more consistent results. It also means that as a developer you can break up one complex prompt into many simpler, more deterministic prompts, with the output of one prompt becoming the input for the next. This approach is known as Chain-of-Thought (CoT) reasoning - I'd suggest digging deeper as this is a dynamic new area of research, with a few of the key papers referenced here:
    - Chain of thought prompting [paper](https://arxiv.org/abs/2201.11903)
    - Self-reflecting agent [paper](https://arxiv.org/abs/2303.11366)
- **Self-referencing:** You can return references for the LLM's answer through either your application logic, or by prompt engineering it to return references. I would generally suggest doing it in your application logic, although if you have multiple chunks then a hybrid approach where you ask the LLM to return the key of the chunk it used could be advisable. I view this as a UX opportunity, where for many search use cases giving the "raw" output of the chunks retrieved as well as the summarised answer can give the user the best of both worlds, but please go with whatever is most appropriate for your users.
- **Discriminator models:** The best control for unwanted outputs is undoubtably through preventing it from happening with prompt engineering, prompt chaining and retrieval. However, when all these fail then a discriminator model is a useful detective control. This is a classifier trained on past unwanted outputs, that flags the Agent's response to the user as Safe or Not, enabling you to perform some business logic to either retry, pass to a human, or say it doesn't know. 
    - There is an example in our [Help Center](https://help.openai.com/en/articles/5528730-fine-tuning-a-classifier-to-improve-truthfulness).

This is a dynamic topic that has still not consolidated to a clear design that works best above all others, so for ease of implementation we will use LangChain, which supplies a framework with implementations for most of the concepts we've discussed above.

We'll create an Agent with access to our knowledge base, give it a prompt template and a custom parser for extracting the answers, set up a prompt chain and then let it answer our Wikipedia questions.

Our work here draws heavily on LangChain's great documentation, in particular [this guide](https://python.langchain.com/en/latest/modules/agents/agents/custom_llm_chat_agent.html).

In [24]:
from langchain.agents import Tool, AgentExecutor, LLMSingleActionAgent, AgentOutputParser
from langchain.prompts import BaseChatPromptTemplate
from langchain import SerpAPIWrapper, LLMChain
from langchain.chat_models import ChatOpenAI
from typing import List, Union
from langchain.schema import AgentAction, AgentFinish, HumanMessage
from langchain.memory import ConversationBufferWindowMemory
import re

In [25]:
def ask_gpt(query):
    response = openai.ChatCompletion.create(model=CHAT_MODEL,messages=[{"role":"user","content":"Please answer my question.\nQuestion: {}".format(query)}],temperature=0)
    return response['choices'][0]['message']['content']

In [26]:
# Define which tools the agent can use to answer user queries
tools = [
    Tool(
        name = "Search",
        func=answer_user_question,
        description="Useful for when you need to answer general knowledge questions. Input should be a fully formed question."
    ),
    Tool(
        name = "Knowledge",
        func = ask_gpt,
        description = "Useful for any other questions. Input should be a fully formed question."
    )
]

In [27]:
# Set up the base template
template = """You are WikiGPT, a helpful bot who answers question using your tools or your own knowledge.
You have access to the following tools::

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Previous conversation history:
{history}

New question: {input}
{agent_scratchpad}"""

In [28]:
# Set up a prompt template
class CustomPromptTemplate(BaseChatPromptTemplate):
    # The template to use
    template: str
    # The list of tools available
    tools: List[Tool]
    
    def format_messages(self, **kwargs) -> str:
        # Get the intermediate steps (AgentAction, Observation tuples)
        # Format them in a particular way
        intermediate_steps = kwargs.pop("intermediate_steps")
        thoughts = ""
        for action, observation in intermediate_steps:
            thoughts += action.log
            thoughts += f"\nObservation: {observation}\nThought: "
        # Set the agent_scratchpad variable to that value
        kwargs["agent_scratchpad"] = thoughts
        # Create a tools variable from the list of tools provided
        kwargs["tools"] = "\n".join([f"{tool.name}: {tool.description}" for tool in self.tools])
        # Create a list of tool names for the tools provided
        kwargs["tool_names"] = ", ".join([tool.name for tool in self.tools])
        formatted = self.template.format(**kwargs)
        return [HumanMessage(content=formatted)]
    
    
class CustomOutputParser(AgentOutputParser):
    
    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                # Return values is generally always a dictionary with a single `output` key
                # It is not recommended to try anything else at the moment :)
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        # Parse out the action and action input
        regex = r"Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:
            raise ValueError(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)

In [29]:
prompt = CustomPromptTemplate(
    template=template,
    tools=tools,
    # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically
    # The history template includes "history" as an input variable so we can interpolate it into the prompt
    input_variables=["input", "intermediate_steps", "history"]
)

# Initiate the memory with k=2 to keep the last two turns
# Provide the memory to the agent
memory = ConversationBufferWindowMemory(k=2)

In [30]:
output_parser = CustomOutputParser()

llm = ChatOpenAI(temperature=0)

# LLM chain consisting of the LLM and a prompt
llm_chain = LLMChain(llm=llm, prompt=prompt)

tool_names = [tool.name for tool in tools]
agent = LLMSingleActionAgent(
    llm_chain=llm_chain, 
    output_parser=output_parser,
    stop=["\nObservation:"], 
    allowed_tools=tool_names
)

In [31]:
agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True, memory=memory)

In [32]:
agent_executor.run(wiki_query)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I'm not sure who Thomas Dolby is, I should probably search for more information.
Action: Search
Action Input: "What is Thomas Dolby known for?"[0m

Observation:[36;1m[1;3mThomas Dolby is known for being a British musician and computer designer, and his 1982 hit "She Blinded Me With Science".[0m[32;1m[1;3mNow that I know who Thomas Dolby is, I can answer the question.
Final Answer: Thomas Dolby is known for being a British musician and computer designer, and his 1982 hit "She Blinded Me With Science".[0m

[1m> Finished chain.[0m


'Thomas Dolby is known for being a British musician and computer designer, and his 1982 hit "She Blinded Me With Science".'

In [33]:
agent_executor.run('What is 5 + 5')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: This is a simple math question.
Action: Knowledge
Action Input: What is the sum of 5 and 5?[0m

Observation:[33;1m[1;3mThe sum of 5 and 5 is 10.[0m[32;1m[1;3mI now know the final answer.
Final Answer: The sum of 5 and 5 is 10.[0m

[1m> Finished chain.[0m


'The sum of 5 and 5 is 10.'

### Evaluation

Last comes the not-so-fun bit that will make the difference between nifty prototype and production application - the process of evaluating and tuning your results. 

The key takeaway here is to make a framework that saves the results of each evaluation, as well as the parameters. Evaluation can be a difficult task that takes significant resources, so it is best to start prepared to handle multiple iterations. Some useful principles we've seen successful deployments use are:
- **Assign clear product ownership and metrics:** Ensure you have a team aligned from the start to annotate the outputs and determine whether they're bad or good. This may seem an obvious step, but too often the focus is on the engineering challenge of successfully retrieving content rather than the product challenge of providing retrieval results that are useful.
- **Log everything:** Store all requests and responses to and from your LLM and retrieval service if you can, it builds a great base for fine-tuning both the embeddings and any fine-tuned models or few-shot LLMs in future.
- **Use GPT-4 as a labeller:** When running evaluations, it can help to use GPT-4 as a gatekeeper for human annotation. Human annotation is costly and time-consuming, so doing an initial evaluation run with GPT-4 can help set a quality bar that needs to be met to justify human labeling. At this stage I would not suggest using GPT-4 as your only labeler, but it can certainly ease the burden.
    - This approach is outlined further in [this paper](https://arxiv.org/abs/2108.13487).

We'll use these principles to make a quick evaluation framework where we will:
- Use GPT-4 to make a list of hypothetical questions on our topic
- Ask our Agent the questions and save question/answer tuples
    - These two above steps simulate the actual users interacting with your application
- Get GPT-4 to evaluate whether the answers correctly respond to the questions
- Look at our results to measure how well the Agent answered the questions
- Plan remedial action

In [34]:
import time

# Build a prompt to provide the original query, the result and ask to summarise for the user
evaluation_question_prompt = '''You are a helpful Wikipedia assistant who will generate a list of 10 creative general knowledge questions in markdown format.

Example:
- Explain how photons work
- What is Thomas Dolby known for?
- What are some key events of the 20th century?

Begin!
'''

try:
    # We'll use our model to generate 10 hypothetical questions to evaluate
    question = openai.ChatCompletion.create(model=CHAT_MODEL
                                            ,messages=[{"role":"user","content":evaluation_question_prompt}]
                                            ,temperature=0.9)
    evaluation_questions = question['choices'][0]['message']['content']
except Exception as e:
    print(e)


In [35]:
cleaned_questions = evaluation_questions.split('\n')
print(cleaned_questions)

['1. What is the difference between weather and climate?', '2. Who designed the Eiffel Tower?', '3. What is the capital of Australia?', '4. What is the chemical symbol for gold?', '5. Who invented the telephone?', '6. What is the largest organ in the human body?', '7. Which famous artist painted the Mona Lisa?', '8. What is the highest mountain in Africa?', '9. What famous building was destroyed during the September 11th attacks?', '10. Who wrote the novel "To Kill a Mockingbird"?']


In [36]:
# We'll use our agent to answer the generated questions to simulate users interacting with the system
question_answer_pairs = []

for question in cleaned_questions:
    memory = ConversationBufferWindowMemory(k=2)
    
    agent_executor = AgentExecutor.from_agent_and_tools(agent=agent
                                                        , tools=tools
                                                        , verbose=False
                                                        ,memory=memory)
    try:
        
        answer = agent_executor.run(question)
    except Exception as e:
        print(question)
        print(e)
        answer = 'Unable to answer question'
    question_answer_pairs.append((question,answer))
    time.sleep(2)

In [37]:
len(question_answer_pairs), question_answer_pairs[:5]

(10,
 [('1. What is the difference between weather and climate?',
   'Weather refers to short-term atmospheric conditions in a specific area, while climate refers to long-term patterns and trends of weather in a particular region over a period of time.'),
  ('2. Who designed the Eiffel Tower?',
   'Gustave Eiffel designed the Eiffel Tower.'),
  ('3. What is the capital of Australia?',
   'The capital of Australia is Canberra.'),
  ('4. What is the chemical symbol for gold?',
   'The chemical symbol for gold is Au.'),
  ('5. Who invented the telephone?',
   'Alexander Graham Bell invented the telephone.')])

In [38]:
# Build a prompt to provide the original query, the result and ask to evaluate for the user
gpt_evaluator_system = '''You are WikiGPT, a helpful Wikipedia expert.
You will be presented with general knowledge questions our users have asked.

Think about this step by step:
- You need to decide whether the answer adequately answers the question
- If it answers the question, you will say "Correct"
- If it doesn't answer the question, you will say one of the following:
    - If it couldn't answer at all, you will say "Unable to answer"
    - If the answer was provided but was incorrect, you will say "Incorrect" 
- If none of these rules are met, say "Unable to evaluate"

Evaluation can only be "Correct", "Incorrect", "Unable to answer", and "Unable to evaluate"

Example 1:

Question: What is the cost cap for the 2023 season of Formula 1?

Answer: The cost cap for 2023 is 95m USD.

Evaluation: Correct

Example 2:

Question: What is Thomas Dolby known for?

Answer: Inventing electricity

Evaluation: Incorrect

Begin!'''

# We'll provide our evaluator the questions and answers we've generated and get it to evaluate them as one of our four evaluation categories.
gpt_evaluator_message = '''
Question: {question}

Answer: {answer}

Evaluation:'''

In [39]:
evaluation_output = []

In [40]:
for pair in question_answer_pairs:
    
    message = gpt_evaluator_message.format(question=pair[0]
                                           ,answer=pair[1])
    evaluation = openai.ChatCompletion.create(model=CHAT_MODEL
                                              ,messages=[{"role":"system","content":gpt_evaluator_system}
                                                         ,{"role":"user","content":message}]
                                              ,temperature=0)
    
    evaluation_output.append((pair[0]
                              ,pair[1]
                              ,evaluation['choices'][0]['message']['content']))

In [41]:
# We'll smooth the results for a simpler evaluation matrix
# In a real scenario we would take time and tune our prompt/add few shot examples to ensure consistent output from the evaluation step
def collate_results(x):
    text = x.lower()
    
    if 'incorrect' in text:
        return 'incorrect'
    elif 'correct' in text:
        return 'correct'
    else: 
        return 'unable to answer'

In [42]:
eval_df = pd.DataFrame(evaluation_output)
eval_df.columns = ['question','answer','evaluation']
# Replacing all the "unable to evaluates" with "unable to answer"
eval_df['evaluation'] = eval_df['evaluation'].apply(lambda x: collate_results(x))
eval_df.evaluation.value_counts()

correct    10
Name: evaluation, dtype: int64

#### Analysis

Depending on how GPT did here you may have actually gotten some good responses, but in all likelihood in the real world you'll end up with incorrect or unable to answer results, and will need to tune your search, LLM or another aspect of the pipeline.

Your remediation plan could be as follows:
- **Incorrect answers:** Either prompt engineering to help the model work out how to answer better (maybe even a bigger model like GPT-4), or search optimisation to return more relevant chunks. Chunking/embedding changes may help this as well - larger chunks may give more context, allowing the model to formulate a better answer.
- **Unable to answer:** This is either a retrieval problem, or the data doesn't exist in our knowledge base. We can prompt engineer to classify questions that are "out-of-bounds" and give the user a stock reply, or we can tune our search so the relevant data is returned.

This is the framework we'll build on to get our knowledge retrieval solution to production - again, log everything and store each run down to a question level so you can track regressions and iterate towards your production solution.

## Conclusion

This concludes our Enterprise Knowledge Retrieval walkthrough. We hope you've found it useful, and that you're now in a position to build enterprise knowledge retrieval solutions, and have a few tricks to start you down the road of putting them into production.