# Demo (Basic Function)

Demonstrate the basic operation of Knowledge Q&A.

There are two sections in this demo:

**Index the Dataset** demonstrates how to index the known knowledge.  In this demo, I downloaded a tiny set of 20 Wikipedia articles in [this Kaggle directory](https://www.kaggle.com/datasets/ltcmdrdata/plain-text-wikipedia-202011?resource=download) just to demonstrate the functionality.  I used Marqo for this purpose.

**RAG Q&A** demonstrates how to ask LLM a question with the given knowledge.  You can see the bot picked up a correct article and gave the respective citation.  If the question is outside the scope of the given knowledge, the bot would reply that it didn't know and with a citation marker "\[--\]".  This is useful later for us to trigger the bot to ask follow up questions.

Also demonstrated is the ability to remove the restriction of the knowledge scope, by either asking the bot not to restrict itself, or remove the index reference all together.  The bot will respond with its pretrained knowledge.

**Note**: Performance evaluation will be added later.

## Prerequisites

- Marqo server running
- OpenAI key set in environment variable OPENAI_API_KEY
-


In [1]:
import os
import json

In [2]:
from cjw.knowledgeqa.indexer.Indexer import Indexer
from cjw.knowledgeqa.indexer.MarqoIndexer import MarqoIndexer
from cjw.knowledgeqa.bots.GptBot import GptBot

## Set up the Environment

Adjust these variables if necessary to suit the local environment.

In [3]:

HOME = os.path.expanduser("~")
PROJECT_DIR = f"{HOME}/IdeaProjects/knowledgeqa"
DATA_FILE = f"{PROJECT_DIR}/data/simple.json"

MARQO_SERVER = 'http://localhost:8882'
TEST_INDEX_NAME = "test_indexer"

RAG_KNOWLEDGE = 5       # How many facts should be pulled out from the index before being fed to LLM
GPT_TEMPERATURE = 0.6

## Index the Dataset

Read a tiny Wikipedia dataset.  It is the first few articles of the first file in [this Kaggle directory](https://www.kaggle.com/datasets/ltcmdrdata/plain-text-wikipedia-202011?resource=download).

In [4]:
try:
    # Create the index if not yet created
    index = MarqoIndexer.new(MARQO_SERVER, TEST_INDEX_NAME)
except Indexer.IndexExistError as e:
    # The index exists, so we just open it.
    print(f"Index exists: {e}")
    index = MarqoIndexer(MARQO_SERVER, TEST_INDEX_NAME)

Index exists: Index test_indexer exists


In [5]:
# Read the test data
with open(DATA_FILE, "r") as fd:
    data = json.load(fd)

# Put the data into the index.  Overwrite those with the same IDs.
status = await index.add(data, keyFields=["title", "text"], idField="id")
print(f"{len(status['items'])} items updated")

2024-01-07 17:00:02,764 logger:'marqo' INFO     add_documents batch 0: took 22.685s for Marqo to process & index 20 docs. Roundtrip time: 22.726s.


20 items updated


In [6]:
def showResults(results: dict):
    # Print the embedding search for nice reading
    for r in results:
        print(f"id={r['_id']} score={r['_score']} title={r['title']}\n{r['_highlights']}\n")

**Demonstrated that we can search based on embedding**

In [7]:
# Test semantic searching
nearestEmbedding = await index.search("What is M-137?", top=RAG_KNOWLEDGE)
showResults(nearestEmbedding)

id=7751000 score=0.81023204 title=M-137 (Michigan highway)
{'text': 'There M-137 ran almost due north before terminating at its connection with the rest of the state trunkline system at US 31 at Interlochen Corners. The roadway continues north of US 31 as South Long Lake Road after the M-137 designation ended.'}

id=7751062 score=0.6161333 title=Ghelamco Arena
{'text': 'Gent'}

id=7751190 score=0.58454525 title=Diego, Prince of Asturias
{'text': '==Ancestry== Category:1575 births Category:1582 deaths Category:16th-century House of Habsburg Category:Princes of Asturias Category:Dukes of Montblanc Category:Princes of Portugal Category:Spanish infantes Category:Portuguese infantes Category:Heirs apparent who never acceded'}

id=7751199 score=0.5838754 title=Union College, University of Queensland
{'title': 'Union College, University of Queensland'}

id=7751172 score=0.5790509 title=Racine Lutheran High School
{'title': 'Racine Lutheran High School'}



In [8]:
# Print the top pick for reference in later demo.
topPick = nearestEmbedding[0]
print(f"ID: {topPick['_id']}")
print(f"Title: {topPick['title']}")
print(f"Text: {topPick['text']}")

ID: 7751000
Title: M-137 (Michigan highway)
Text: M-137 was a state trunkline highway in the US state of Michigan that served as a spur route to the Interlochen Center for the Arts and Interlochen State Park. It started south of the park and ran north between two lakes in the area and through the community of Interlochen to US Highway 31 (US 31) in Grand Traverse County. The highway was first shown without a number label on maps in 1930 and labeled after an extension the next year. The highway's current routing was established in the 1950s. Jurisdiction of the roadway was transferred from the Michigan Department of Transportation (MDOT) to the Grand Traverse County Road Commission in June 2020, and the highway designation was decommissioned in the process; signage was removed by August 2020 to reflect the changeover. ==Route description== M-137 began at the southern end of Interlochen State Park at an intersection with Vagabond Lane. Farther south, the roadway continues toward Green La

## RAG Q&A

Use GPT-4 using the retrieved articles for RAG.

**First, we ask for something that was mentioned in the facts.**

Note that the sample was about M-137 highway, but I asked the bot for point of interests.  Also note that the bot not only provided its answer, but also attached with the citation to the article.

In [9]:
# Create a bot (using GPT-4) and give it the indexed knowledge above.  We make the bot pick the top 5 candidate articles from which to derive its answer.
bot = GptBot.of("gpt4").withFacts(index, contentFields=["title", "text"], top=5)

In [10]:
# Ask the question that I know in the knowledge.
question = "Do you know any point of interests in Michigan?"
answer = await bot.ask(question)
print(f"{answer.content} [{answer.citation}]")

Token indices sequence length is longer than the specified maximum sequence length for this model (3539 > 1024). Running this sequence through the model will result in indexing errors


Yes, one point of interest in Michigan is the M-137, a state trunkline highway that served as a spur route to the Interlochen Center for the Arts and Interlochen State Park. However, as of June 2020, the highway designation was decommissioned and the jurisdiction of the roadway was transferred from the Michigan Department of Transportation to the Grand Traverse County Road Commission. [7751000]


**Then, we try to ask a question that is not mentioned in the facts.**

The default behavior is not to answer outside the scope of our indexed knowledge.  So, the bot said "I don't know".  In addition, the citation is now "--".  This is useful for triggering the system for further actions, such as making the bot to ask a follow up question.

In [11]:
unknown = "What is Euler identity?"
answer2 = await bot.ask(unknown)
print(f"{answer2.content} [{answer2.citation}]")

I don't know [--]


**We can ask it not to restricted to the given facts, but use them to update its pre-trained knowledge.**  `(restricted=False)`

Note that the bot now answered the question about Euler's Identity (the citation is None).  But it still answered using the given indexed knowledge if it was available.

In [12]:
unknown = "What is Euler identity?"
answer2 = await bot.ask(unknown, restricted=False)
print(f"{answer2.content} [{answer2.citation}]")

Euler's identity is a mathematical equation that establishes a deep relationship between several fundamental mathematical constants. It is often considered a remarkable and beautiful equation because it combines five of the most important numbers in mathematics into a single simple equation: the numbers 0, 1, pi, e and the imaginary unit i. The identity is written as e^(i*pi) + 1 = 0. This equation is a special case of Euler's formula, which states that for any real number x, e^(ix) = cos(x) + i*sin(x). [None]


In [13]:
question = "Do you know any point of interests in Michigan?"
answer = await bot.ask(question, restricted=False)
print(f"{answer.content} [{answer.citation}]")

One notable point of interest in Michigan is the Interlochen Center for the Arts and Interlochen State Park, which were served by the M-137 state trunkline highway. This highway ran north through the community of Interlochen until it was decommissioned in 2020 . [7751000]


**We can remove the facts and it will answer based on its pretrained knowledge.**

Here we removed the index.  So the bot is now answering with its pretrained knowledge.

In [14]:
question = "Do you know any point of interests in Michigan?"
bot.withFacts(None)
answer = await bot.ask(question, restricted=False)
print(f"{answer.content} [{answer.citation}]")

Yes, there are many points of interest in Michigan. Here are a few:

1. The Henry Ford Museum: This museum in Dearborn is dedicated to American innovation, particularly in the field of technology.

2. Detroit Institute of Arts: Home to one of the best art collections in the United States, the DIA offers a diverse range of exhibitions and educational programs.

3. Mackinac Island: Known for its historic sites, horse-drawn carriages, and fudge shops. The island is also home to Fort Mackinac, which was built by the British during the American Revolutionary War.

4. Pictured Rocks National Lakeshore: Located on the Lake Superior shoreline, this area offers stunning natural beauty, including sand dunes, waterfalls, and colorful mineral-stained cliffs.

5. Sleeping Bear Dunes National Lakeshore: This park features towering sand dunes, beautiful beaches, and clear blue water. The Sleeping Bear Dunes also offer excellent hiking and camping opportunities.

6. Detroit's Motown Museum: This museu