## Setup

Load environment variables (stored in your `.env` file).

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

## Load Data Set

Load a data set that contains Amazon reviews for the [Ticket to Ride board game](https://www.amazon.com/Days-Wonder-DO7201-Ticket-Ride/dp/0975277324/ref=sr_1_1?crid=3H3EQBAOZMSHV&keywords=0975277324&qid=1694895498&sprefix=0975277324%2Caps%2C99&sr=8-1&th=1).

<img src="data/ticket_to_ride.jpg" width="500" height="500">

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', 800)
dataset = pd.read_json("data/ticket_to_ride_amazon_reviews.json.gz")
dataset.head(3)

Unnamed: 0,rating,reviewText
0,5,"My 10 year old son and I are crazy about board games. Since we got this one the others have been gathering dust. We play at least once a week, and the rest of the week he's strategizing for the next time we'll play. Not only is it a fantastic lesson on urban planning but also geography. I highly recommend this game for a great family experience!"
1,5,"When I first heard about Days of Wonder's newest game, Ticket to Ride (Days of Wonder, 2004 - Alan Moon), I was excited. But how could I not be - for all of Days of Wonders games so far have been fabulous, and it seems that each successive game gets better and better. And Alan Moon with a train game (shades of Union Pacific) sounded like a winning combination. I had an opportunity to play the final version of the game, and was quite impressed with how the game looked.\n\nAnd is the game any good? The short answer is that once you play this game, you'll never play TransAmerica again. It's a fantastic medium-weight game - one that plays equally well with two to five players. The components are superb, the artwork is great, the game is downright fun (and nasty sometimes), and the total pa..."
2,5,"This is an excellent game for 2-5 players, and especially for kids. I was initially put off by the theme and the mentioning of winning the $1 million dollar prize on the advertisement. I immediately thought there would be paper money involved and it would be like Monopoly where I would have to be a banker,,,and yech. But no, I was WRONG. There is no fake money involved! Just little cards with different colors. Easy, but challenging. It is a game where you can be nice and not mess with your opponents supposed destination, or you can try and block them. You get points according to the tracks you lay and connections you make.\n But please, I love Amazon, but their Days of Wonder games are all overpriced. You can get this and their other games much cheaper on designated game sites...."


In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4166 entries, 0 to 4165
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   rating      4166 non-null   int64 
 1   reviewText  4166 non-null   object
dtypes: int64(1), object(1)
memory usage: 65.2+ KB


## Store Data in Vector Database

Instantiate in-memory [ChromaDB](https://www.trychroma.com/) Vector Database client with a new, empty collection.

In [4]:
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="ticket_to_ride_reviews")

Add reviews to the vector database. Since we aren't providing our own custom embeddings, ChromaDB uses the [Sentence Transformers](https://www.sbert.net/) `all-MiniLM-L6-v2` model to create the embeddings automatically.

In [5]:
metadata = [
    {
        "rating": x, 
    } for x in dataset.rating
]

import time
t0 = time.time()
collection.add(
    ids=[str(x) for x in dataset.index.tolist()],
    documents=dataset.reviewText.tolist(),
    metadatas=metadata,
)
time.time() - t0

77.49635791778564

## Query Large Language Model

In [6]:
from langchain.llms import OpenAI
from langchain import PromptTemplate, LLMChain

llm = OpenAI()

Create prompt template

In [7]:
template = """This question is about the Ticket to Ride board game.

Question: {question}

Here are some Amazon reviews that are relevant to the question. Each review is separated by the text "<separator>":
{relevant_data}

Use your general knowledge and the information contained in these reviews to answer the question.
"""

prompt = PromptTemplate(template=template, input_variables=["question", "relevant_data"])
llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)

Take the user's question, look up relevant reviews in the vector database, create the prompt, and query the ChatGPT API.

In [13]:
question = "Is the game hard to understand?"
rating_expression = { "$gte": 5 } # try this as well: { "$lte": 1 }

# Query vector DB
vector_db_results = collection.query(
    query_texts=[question],
    n_results=20,
    where={ "rating": rating_expression }
)
relevant_data = "<separator>".join(vector_db_results["documents"][0])

# Send prompt to Chat GPT API and print result
result = llm_chain.run(question=question, relevant_data=relevant_data)
print(result)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThis question is about the Ticket to Ride board game.

Question: Is the game hard to understand?

Here are some Amazon reviews that are relevant to the question. Each review is separated by the text "<separator>":
This was a gift for my son and daughter-in-law. We played it while they were visiting and had a good time. It is not difficult to learn, but it can be challenging to outwit the other players. I would recommend this game.<separator>Got this as a present for my husband and he is in love with the game.. It requires a mix of strategy and luck.. A bit difficult to understand at first but once you start playing it becomes super easy. Played with my 12 year old brother who also loves the game.<separator>Got this game for the family and everyone loves it.  From kids to grand parents, it is easy to understand, fund to play and requires some stratigic thinking.<separator>The game is kinda easy but yet inte