# Custom Chatbot Project

The dataset comprises the abstracts of the popular book on [www.goodreads.com](https://www.goodreads.com) for the years 2022 to 2024.

In [None]:
import openai
import os

openai.api_base = "https://openai.vocareum.com/v1"
# get the api key from the environment variable
openai.api_key = os.getenv("VOC_OPENAI_API_KEY") 

from openai import OpenAI

client = OpenAI(
  api_key=openai.api_key,  
    base_url=openai.api_base
)

## Data Wrangling

In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [None]:

import pandas as pd
import requests
from bs4 import BeautifulSoup

years = ["2022", "2023", "2024"]

# Download the HTML files
for year in years:
    resp = requests.get(f"https://www.goodreads.com/book/popular_by_date/{year}")
    with open(f"goodreads_{year}.html", "w", encoding='utf-8') as f:
        f.write(resp.text)


In [47]:
import re

# read the data
list_items = []
for year in years:
    with open(f"goodreads_{year}.html", "r", encoding='utf-8') as f:
        html = f.read()

    # parse the HTML
    soup = BeautifulSoup(html, 'html.parser')
    list_items.extend(soup.find_all('article', class_='BookListItem'))

books_abstracts = list([item.get_text("\n") for item in list_items])

# cleanup the text
pattern = re.compile(r'\s*(\d+\.\d+)\s*(\d+k).*\s*Want to read\s*')
books_abstracts = [pattern.sub(' ', item) for item in books_abstracts]

# save the data
df = pd.DataFrame(books_abstracts, columns=["text"])
df.to_csv("goodreads_books.csv", index=False)

print(f"Number of books: {len(df)}")

print(df.head())

Number of books: 45
                                                text
0  #\n1\nThe Housemaid (The Housemaid, #1)\nFreid...
1  #\n2\nIt Starts with Us (It Ends with Us, #2)\...
2  #\n3\nReminders of Him\nColleen Hoover\n4.36\n...
3  #\n4\nBook Lovers\nEmily Henry\n4.12\n1m\n \nr...
4  #\n5\nTomorrow, and Tomorrow, and Tomorrow\nGa...


In [50]:
# code taken from the class notebook casestudy.ipynb
# Create a list to store the embeddings
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = client.embeddings.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        model=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data.embedding for data in response.data])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.to_csv("goodreads_books_embedded.csv", index=False)
df.head()

Unnamed: 0,text,embeddings
0,"#\n1\nThe Housemaid (The Housemaid, #1)\nFreid...","[-0.004109004512429237, -0.0223678145557642, -..."
1,"#\n2\nIt Starts with Us (It Ends with Us, #2)\...","[-0.005668665282428265, 0.004265830852091312, ..."
2,#\n3\nReminders of Him\nColleen Hoover\n4.36\n...,"[0.003345254110172391, -0.012298277579247952, ..."
3,#\n4\nBook Lovers\nEmily Henry\n4.12\n1m\n \nr...,"[-0.0031754060182720423, -0.013626998290419579..."
4,"#\n5\nTomorrow, and Tomorrow, and Tomorrow\nGa...","[0.010921248234808445, -0.01308033149689436, -..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [59]:
# code taken from the class notebook casestudy.ipynb and from https://github.com/openai/openai-python/blob/release-v0.28.0/openai/embeddings_utils.py
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def distances_from_embeddings(query_embedding,embeddings):
    distances = [
        1-cosine_similarity(query_embedding, embedding)
        for embedding in embeddings
    ]
    return distances

def get_embedding(text, engine=EMBEDDING_MODEL_NAME):
    return client.embeddings.create(
        input=[text],
        model=engine
    ).data[0].embedding

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [65]:
ranking = get_rows_sorted_by_relevance("A book about love", df)
ranking.head()

Unnamed: 0,text,embeddings,distances
3,#\n4\nBook Lovers\nEmily Henry\n4.12\n1m\n \nr...,"[-0.0031754060182720423, -0.013626998290419579...",0.181074
30,#\n1\nFunny Story\nEmily Henry\n4.24\n719k\n \...,"[0.0029939222149550915, -0.02424975298345089, ...",0.181479
27,#\n13\nThe Seven Year Slip\nAshley Poston\n4.2...,"[0.003417385509237647, -0.007594190072268248, ...",0.187588
21,"#\n7\nDivine Rivals (Letters of Enchantment, #...","[0.0022449044045060873, -0.028142593801021576,...",0.18819
4,"#\n5\nTomorrow, and Tomorrow, and Tomorrow\nGa...","[0.010921248234808445, -0.01308033149689436, -...",0.195343


In [66]:
# based on the class notebook casestudy.ipynb 

import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
You are an expert for popular books published in the last 3 years.
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [81]:
# based on the class notebook casestudy.ipynb 

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = client.completions.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response.choices[0].text.strip()
    except Exception as e:
        print(e)
        return ""
    
def answer_question_with_model_only(
    question, max_answer_tokens=150
):
   
    try:
        response = client.completions.create(
            model=COMPLETION_MODEL_NAME,
            prompt="You are an expert for popular books published in the last 3 years. Answer the following question: " + question,
            max_tokens=max_answer_tokens
        )
        return response.choices[0].text.strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [75]:
answer_question("Can you reccomend a detective story book?", df)

'"Listen for the Lie" by Amy Tintera or "Never Lie" by Freida McFadden could both be good options for a detective story reccomendation.'

In [None]:
# run with an empty dataframe
answer_question("Can you reccomend a detective story book?", df[0:0])

"I don't know."

In [82]:
answer_question_with_model_only("Can you reccomend a detective story book?")

'Yes, I can recommend "The Woman in the Window" by A.J. Finn. It is a gripping thriller about an agoraphobic woman who witnesses a crime from her window. As she tries to uncover the truth, she begins to question her own sanity. The twists and turns in this book will keep readers on the edge of their seats until the very end. It was published in 2018 and became an instant New York Times bestseller. It has also been optioned for a movie adaptation.'

### Question 2

In [None]:
answer_question("Did Britney write anything recently?", df)

'Yes, Britney Spears wrote "The Woman in Me" in the last 3 years.'

In [None]:
# run with an empty dataframe
answer_question("Did Britney write anything recently?", df[0:0])

"I don't know."

In [83]:
answer_question_with_model_only("Did Britney write anything recently?")

'No, Britney Spears has not published any books in the last 3 years. Her most recent book, "A Mother\'s Gift," was published in 2001. She has stated in interviews that she has no current plans to write any new books.'