# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

I chose a tourism dataset because it contains structured travel-related
information such as destinations, best seasons, budgets, and attractions. 
This type of domain-specific data is not always well-covered by the base model, 
so including it as context allows the chatbot to give more accurate and relevant 
recommendations to users.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [4]:
import pandas as pd

df = pd.read_csv("data/tourism_dataset.csv")
df.head()

Unnamed: 0,text
0,Destination: Japan. Best season: spring (March...
1,Destination: Morocco. Best season: April to Ju...
2,Destination: Thailand. Best season: November t...
3,Destination: Italy. Best season: April to Octo...
4,Destination: Canada. Best season: September to...


In [5]:
len(df)

25

In [7]:
print(df.isnull().sum())

text    0
dtype: int64


In [8]:
df['text'] = df['text'].str.strip()
df.head()

Unnamed: 0,text
0,Destination: Japan. Best season: spring (March...
1,Destination: Morocco. Best season: April to Ju...
2,Destination: Thailand. Best season: November t...
3,Destination: Italy. Best season: April to Octo...
4,Destination: Canada. Best season: September to...


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [10]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "voc-14512486126677429913168a0a04d672009.56515721"

In [11]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.head()

Unnamed: 0,text,embeddings
0,Destination: Japan. Best season: spring (March...,"[0.0071809724904596806, -0.023471377789974213,..."
1,Destination: Morocco. Best season: April to Ju...,"[0.003225389402359724, -0.014273237437009811, ..."
2,Destination: Thailand. Best season: November t...,"[-0.0004952811868861318, -0.03182801231741905,..."
3,Destination: Italy. Best season: April to Octo...,"[0.01141958124935627, -0.03318019211292267, 0...."
4,Destination: Canada. Best season: September to...,"[0.0035308916121721268, -0.028839368373155594,..."


In [12]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


In [13]:
question = "I am looking for a destination with a hight budget."
relevant_destinations = get_rows_sorted_by_relevance(question, df)

# Show the top 5 most relevant destinations
relevant_destinations[["text", "distances"]].head(5)

Unnamed: 0,text,distances
19,Destination: Switzerland. Best season: Decembe...,0.187535
15,Destination: Indonesia (Bali). Best season: Ma...,0.192354
14,Destination: United States (West). Best season...,0.194021
9,Destination: Greece. Best season: May to Septe...,0.194623
5,Destination: Spain. Best season: May to Septem...,0.195084


In [14]:
df.iloc[19]["text"]

'Destination: Switzerland. Best season: December to March (ski) or June to September (hiking). Recommended for: families, nature lovers. Budget: high. Attractions: Alps, Zurich, Geneva.'

In [15]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [16]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(question, df, max_prompt_tokens=1800, max_answer_tokens=150):
    prompt = create_prompt(question, df, max_prompt_tokens)
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""


In [24]:
def basic_answer_question(question, max_tokens=50):
    """
    Get a direct answer from the model without adding dataset context.
    """
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=question,
            max_tokens=max_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Performance comparison

In [25]:
# Test 1
q1 = "Suggest two destinations where I could eat delicious food."

print("Question 1:", q1)
print("\nBasic Answer:", basic_answer_question(q1))
print("\nCustom Answer:", answer_question(q1, df))
print("-" * 50)

Question 1: Suggest two destinations where I could eat delicious food.

Basic Answer: 1) Tokyo, Japan - known for its variety of gourmet restaurants and street food, Tokyo offers a wide range of delicious food such as sushi, ramen, okonomiyaki, and tempura. The city is also home to numerous Mich

Custom Answer: Spain and Thailand.
--------------------------------------------------


In [27]:
q2 = "I am wealthy and want to visit a luxurious and wealthy country. Which country would you suggest?"

print("Question 2:", q2)
print("\nBasic Answer:", basic_answer_question(q2))
print("\nCustom Answer:", answer_question(q2, df))
print("-" * 50)

Question 2: I am wealthy and want to visit a luxurious and wealthy country. Which country would you suggest?

Basic Answer: There are many luxurious and wealthy countries in the world, each with its own unique offerings. Here are five options that might interest you:

1. Monaco: With its glamorous lifestyle, picturesque Mediterranean coastline, and exclusive yachts and casinos, Monaco is

Custom Answer: Based on the context provided, I would suggest Dubai (United Arab Emirates) as it is recommended for shopping, luxury, and families, and has a high budget. Other possible options could include Switzerland or Australia.
--------------------------------------------------


In [28]:
print("Welcome to your custom Tourism Chatbot!")
print("Type 'exit' to quit.\n")

while True:
    # Ask the user for a question
    user_question = input("Enter your question: ")
    
    # Stop if the user types 'exit'
    if user_question.lower() == "exit":
        print("Goodbye!")
        break
    
    # Get the answer from the chatbot
    answer = answer_question(user_question, df)
    
    # Print the answer
    print("\nChatbot Answer:", answer)
    print("-" * 50)

Welcome to your custom Tourism Chatbot!
Type 'exit' to quit.

Enter your question: Suggest a low-cost destination with good weather.

Chatbot Answer: Vietnam or Thailand
--------------------------------------------------
Enter your question: Is Morocco an affordable country for tourists?

Chatbot Answer: Based on the context given, it is likely that Morocco is an affordable destination for tourists with a medium budget. The recommended season for visiting Morocco is in April to June or September to November and it is recommended for both families and history enthusiasts, indicating that it is not overly expensive and has something to offer for different types of travelers. However, without knowing the specific price range for accommodations, attractions, and other expenses, it cannot be determined for certain if Morocco is affordable for all tourists.
--------------------------------------------------
Enter your question: I want to visit a country in spring with lots of cultural attrac