# Custom Chatbot Project

The dataset comprises the abstracts of the popular book on [www.goodreads.com](https://www.goodreads.com) for the years 2022 to 2024.

In [3]:
import openai
import os

openai.api_base = "https://openai.vocareum.com/v1"
# get the api key from the environment variable
openai.api_key = os.getenv("VOC_OPENAI_API_KEY") 

from openai import OpenAI

client = OpenAI(
  api_key=openai.api_key,  
    base_url=openai.api_base
)

## Data Wrangling

In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [4]:

import pandas as pd
import requests
from bs4 import BeautifulSoup

years = ["2022", "2023", "2024"]

# Download the HTML files
for year in years:
    resp = requests.get(f"https://www.goodreads.com/book/popular_by_date/{year}")
    with open(f"goodreads_{year}.html", "w", encoding='utf-8') as f:
        f.write(resp.text)


In [26]:

# read the data
books_abstracts = []
for year in years:
    with open(f"goodreads_{year}.html", "r", encoding='utf-8') as f:
        html = f.read()

    # parse the HTML
    soup = BeautifulSoup(html, 'html.parser')
    for item in soup.find_all('article', class_='BookListItem'):

        title = item.find('h3', class_='Text Text__title3 Text__umber').text
        authors = item.find('div', class_='BookListItem__authors').text
        abstract = item.find('div', class_='TruncatedContent').text
        books_abstracts.append(f"{title} by {authors} ({year})\n{abstract}")


# save the data
df = pd.DataFrame(books_abstracts, columns=["text"])
df.to_csv("goodreads_books.csv", index=False)

print(f"Number of books: {len(df)}")

print(df.head())

Number of books: 45
                                                text
0  The Housemaid (The Housemaid, #1) by Freida Mc...
1  It Starts with Us (It Ends with Us, #2) by Col...
2  Reminders of Him by Colleen Hoover (2022)\nA t...
3  Book Lovers by Emily Henry (2022)\nOne summer....
4  Tomorrow, and Tomorrow, and Tomorrow by Gabrie...


In [27]:
# code taken from the class notebook casestudy.ipynb
# Create a list to store the embeddings
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = client.embeddings.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        model=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data.embedding for data in response.data])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.to_csv("goodreads_books_embedded.csv", index=False)
df.head()

Unnamed: 0,text,embeddings
0,"The Housemaid (The Housemaid, #1) by Freida Mc...","[-0.009157824330031872, -0.02493787370622158, ..."
1,"It Starts with Us (It Ends with Us, #2) by Col...","[0.001358734560199082, -0.0032529558520764112,..."
2,Reminders of Him by Colleen Hoover (2022)\nA t...,"[-0.0019802830647677183, -0.01722043566405773,..."
3,Book Lovers by Emily Henry (2022)\nOne summer....,"[-0.010145358741283417, -0.01708478480577469, ..."
4,"Tomorrow, and Tomorrow, and Tomorrow by Gabrie...","[0.013451321050524712, -0.022161180153489113, ..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [29]:
# code taken from the class notebook casestudy.ipynb and from https://github.com/openai/openai-python/blob/release-v0.28.0/openai/embeddings_utils.py
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def distances_from_embeddings(query_embedding,embeddings):
    distances = [
        1-cosine_similarity(query_embedding, embedding)
        for embedding in embeddings
    ]
    return distances

def get_embedding(text, engine=EMBEDDING_MODEL_NAME):
    return client.embeddings.create(
        input=[text],
        model=engine
    ).data[0].embedding

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [30]:
ranking = get_rows_sorted_by_relevance("A book about love", df)
ranking.head()

Unnamed: 0,text,embeddings,distances
3,Book Lovers by Emily Henry (2022)\nOne summer....,"[-0.010145358741283417, -0.01708478480577469, ...",0.164597
30,Funny Story by Emily Henry (2024)\nA shimmerin...,"[0.0001619218965061009, -0.020004400983452797,...",0.170323
26,Hello Beautiful by Ann Napolitano (2023)\nAn e...,"[-0.023775337263941765, -0.0016680326079949737...",0.177523
21,"Divine Rivals (Letters of Enchantment, #1) by ...","[0.0008809841237962246, -0.028963107615709305,...",0.181153
27,The Seven Year Slip by Ashley Poston (2023)\nA...,"[0.00316825695335865, -0.008537117391824722, 0...",0.183407


In [None]:
# based on the class notebook casestudy.ipynb 

import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
You are an expert for popular books published in the years 2022, 2023 and 2024.
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [10]:
# based on the class notebook casestudy.ipynb 

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = client.completions.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response.choices[0].text.strip()
    except Exception as e:
        print(e)
        return ""
    
def answer_question_with_model_only(
    question, max_answer_tokens=150
):
   
    try:
        response = client.completions.create(
            model=COMPLETION_MODEL_NAME,
            prompt="You are an expert for popular books published in the last 3 years. Answer the following question: " + question,
            max_tokens=max_answer_tokens
        )
        return response.choices[0].text.strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [18]:
answer_question("Can you reccomend a detective story book writen between 2022 and 2024?", df)

"I don't know. The latest book in the context is from 2021."

In [12]:
# run with an empty dataframe
answer_question("Can you reccomend a detective story book?", df[0:0])

'Yes, I would definitely recommend "The Girl on the Train" by Paula Hawkins. It was published in 2015 and has received widespread acclaim for its gripping plot and twists that keep readers on the edge of their seats.'

In [13]:
answer_question_with_model_only("Can you reccomend a detective story book?")

'Absolutely! "The Silent Patient" by Alex Michaelides is an incredibly gripping and twisty detective story that was published in 2019. It follows a criminal psychotherapist as he tries to uncover the truth behind a woman\'s mysterious silence and the murder of her husband. It\'s full of well-developed characters, unexpected plot twists, and a haunting atmosphere that will keep you on the edge of your seat until the very end. Other highly recommended detective story books from the past 3 years include "The Guest List" by Lucy Foley, "The Girl on the Train" by Paula Hawkins, and "The Woman in the Window" by A.J. Finn.'

### Question 2

In [17]:
answer_question("Did Britney write anything recently?", df)

'No, these descriptions do not mention anything recent by Britney Spears, so the question cannot be answered based on the context.'

In [15]:
# run with an empty dataframe
answer_question("Did Britney write anything recently?", df[0:0])

"I don't know"

In [16]:
answer_question_with_model_only("Did Britney write anything recently?")

'I am an AI and do not possess information on contemporary popular books and authors. However, according to a quick search, Britney Spears, a popular singer and celebrity, released her autobiography "A Mother\'s Gift" in 2001. There is no record of her publishing any recent books in the last three years.'