# Demo (Basic Function)

Demonstrate the basic operation of Knowledge Q&A

## Prerequisites

- Marqo server running
- OpenAI key set in environment variable OPENAI_API_KEY
-


In [1]:
import os
import json

In [2]:
from cjw.knowledgeqa.indexer.Indexer import Indexer
from cjw.knowledgeqa.indexer.MarqoIndexer import MarqoIndexer
from cjw.knowledgeqa.bots.GptBot import GptBot

## Set up the Environment

Adjust these variables if necessary

In [3]:

HOME = os.path.expanduser("~")
PROJECT_DIR = f"{HOME}/IdeaProjects/knowledgeqa"
DATA_FILE = f"{PROJECT_DIR}/data/simple.json"

MARQO_SERVER = 'http://localhost:8882'
TEST_INDEX_NAME = "test_indexer"

RAG_KNOWLEDGE = 5       # How many facts should be pulled out from the index before being fed to LLM
GPT_TEMPERATURE = 0.6

## Index the Dataset

Read a tiny Wikipedia dataset.  It is the first few articles of the first file in [this Kaggle directory](https://www.kaggle.com/datasets/ltcmdrdata/plain-text-wikipedia-202011?resource=download).

In [4]:
try:
    index = MarqoIndexer.new(MARQO_SERVER, TEST_INDEX_NAME)
except Indexer.IndexExistError as e:
    print(f"Index exists: {e}")
    index = MarqoIndexer(MARQO_SERVER, TEST_INDEX_NAME)

Index exists: Index test_indexer exists


In [5]:
# Insert test data
with open(DATA_FILE, "r") as fd:
    data = json.load(fd)

status = await index.add(data, keyFields=["title", "text"], idField="id")
print(f"{len(status['items'])} items updated")

2024-01-07 16:03:36,385 logger:'marqo' INFO     add_documents batch 0: took 29.998s for Marqo to process & index 20 docs. Roundtrip time: 30.025s.


20 items updated


In [6]:
def showResults(results: dict):
    # Print the embedding search for nice reading
    for r in results:
        print(f"id={r['_id']} score={r['_score']} title={r['title']}\n{r['_highlights']}\n")

In [7]:
nearestEmbedding = await index.search("What is M-137?", top=RAG_KNOWLEDGE)
showResults(nearestEmbedding)

id=7751000 score=0.81023204 title=M-137 (Michigan highway)
{'text': 'There M-137 ran almost due north before terminating at its connection with the rest of the state trunkline system at US 31 at Interlochen Corners. The roadway continues north of US 31 as South Long Lake Road after the M-137 designation ended.'}

id=7751062 score=0.6161333 title=Ghelamco Arena
{'text': 'Gent'}

id=7751190 score=0.58454525 title=Diego, Prince of Asturias
{'text': '==Ancestry== Category:1575 births Category:1582 deaths Category:16th-century House of Habsburg Category:Princes of Asturias Category:Dukes of Montblanc Category:Princes of Portugal Category:Spanish infantes Category:Portuguese infantes Category:Heirs apparent who never acceded'}

id=7751199 score=0.5838754 title=Union College, University of Queensland
{'title': 'Union College, University of Queensland'}

id=7751172 score=0.5790509 title=Racine Lutheran High School
{'title': 'Racine Lutheran High School'}



In [8]:
# Print the top pick for reference in later demo.
topPick = nearestEmbedding[0]
print(f"ID: {topPick['_id']}")
print(f"Title: {topPick['title']}")
print(f"Text: {topPick['text']}")

ID: 7751000
Title: M-137 (Michigan highway)
Text: M-137 was a state trunkline highway in the US state of Michigan that served as a spur route to the Interlochen Center for the Arts and Interlochen State Park. It started south of the park and ran north between two lakes in the area and through the community of Interlochen to US Highway 31 (US 31) in Grand Traverse County. The highway was first shown without a number label on maps in 1930 and labeled after an extension the next year. The highway's current routing was established in the 1950s. Jurisdiction of the roadway was transferred from the Michigan Department of Transportation (MDOT) to the Grand Traverse County Road Commission in June 2020, and the highway designation was decommissioned in the process; signage was removed by August 2020 to reflect the changeover. ==Route description== M-137 began at the southern end of Interlochen State Park at an intersection with Vagabond Lane. Farther south, the roadway continues toward Green La

## RAG Q&A

Use GPT-4 using the retrieved articles for RAG.

First, we ask for something that was mentioned in the facts.

In [9]:
bot = GptBot.of("gpt4").withFacts(index, contentFields=["title", "text"], top=5)


In [10]:
question = "Do you know any point of interests in Michigan?"
answer = await bot.ask(question)
print(f"{answer.content} [{answer.citation}]")

Token indices sequence length is longer than the specified maximum sequence length for this model (3539 > 1024). Running this sequence through the model will result in indexing errors


Yes, one point of interest in Michigan is the M-137, a state trunkline highway that served as a spur route to the Interlochen Center for the Arts and Interlochen State Park. [7751000]


Then, we try to ask a question that is not mentioned in the facts.

In [11]:
unknown = "What is Euler identity?"
answer2 = await bot.ask(unknown)
print(f"{answer2.content} [{answer2.citation}]")

I don't know [--]


We can ask it not to restricted to the given facts, but use them to update its pre-trained knowledge.

In [12]:
unknown = "What is Euler identity?"
answer2 = await bot.ask(unknown, restricted=False)
print(f"{answer2.content} [{answer2.citation}]")

Euler's Identity is a mathematical equation that beautifully connects several fundamental mathematical constants. The equation is e^(iπ) + 1 = 0. In this equation, e is the mathematical constant approximately equal to 2.71828, i is the imaginary unit, which satisfies the equation i^2 = -1, and π is the ratio of the circumference of a circle to its diameter, approximately equal to 3.14159. This identity is named after the Swiss mathematician Leonhard Euler. [None]


In [13]:
question = "Do you know any point of interests in Michigan?"
answer = await bot.ask(question, restricted=False)
print(f"{answer.content} [{answer.citation}]")

Yes, one of the points of interest in Michigan was the M-137 state trunkline highway that served as a spur route to the Interlochen Center for the Arts and Interlochen State Park. It started south of the park and ran north between two lakes in the area and through the community of Interlochen to US Highway 31 (US 31) in Grand Traverse County. However, this highway was decommissioned and its signage was removed by August 2020. [7751000]


We can remove the facts and it will answer based on its pretrained knowledge

In [14]:
question = "Do you know any point of interests in Michigan?"
bot.withFacts(None)
answer = await bot.ask(question, restricted=False)
print(f"{answer.content} [{answer.citation}]")

Yes, there are numerous points of interest in Michigan. Here are a few:

1. Pictured Rocks National Lakeshore: This is a U.S. National Lakeshore on the shore of Lake Superior. It offers spectacular scenery, hiking trails, and kayaking.

2. Mackinac Island: It's a place known for its iconic 18th-century fort, stunning water views, historic sites and fudge. Automobiles are not allowed on the island, so horse-drawn carriages, bicycles, and walking are the main modes of transportation.

3. The Henry Ford Museum: This is a large indoor and outdoor history museum complex. It also includes the Henry Ford Museum of American Innovation and Greenfield Village.

4. Detroit Institute of Arts: This is one of the premier art museums in the United States and home to more than 65,000 works of art.

5. Sleeping Bear Dunes National Lakeshore: It is famous for its sand dunes that are as high as 460 feet above Lake Michigan.

6. Detroit's Motown Museum: The museum is located in the house where Berry Gordy