# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task.

This custom chatbot answers questions about Dr Kenneth Y. Wertheim's achievements and current work. The code used for data wrangling scrapes Dr Wertheim's personal website to produce a custom dataset. It's suitable because Dr Wertheim themselves created and maintains the website.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In this part, I use the code from the Building a Simple Web Scraper and Build a Custom OpenAI Chatbot with ML-Driven Prompt Engineering exercises.

In [2]:
#Import webscraping libraries.
import requests
from bs4 import BeautifulSoup

#This function takes an URL, fetches the page, and returns the body.
def fetch_page(url: str):
    headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'
    }
    r = requests.get(url, headers=headers)
    if r.status_code == 200:
        return r.text
    else:
        print(r.status_code)
        return r.text
    
#A list of urls to be passed to fetch_page.
data_url = ["https://www.kywertheim.com/", "https://www.kywertheim.com/bio.html",
            "https://kywertheim.com/credentials.html", "https://kywertheim.com/projects.html",
            "https://kywertheim.com/schelling.html", "https://kywertheim.com/MAT.html",
            "https://kywertheim.com/DAIM.html", "https://kywertheim.com/MYCN.html",
            "https://kywertheim.com/primage.html", "https://kywertheim.com/NBclonal.html",
            "https://kywertheim.com/crest.html", "https://kywertheim.com/immune.html",
            "https://kywertheim.com/MLcapstone.html", "https://kywertheim.com/lymph.html",
            "https://kywertheim.com/earlyprojects.html", "https://kywertheim.com/EDIcourse.html",
            "https://kywertheim.com/AlumMentor.html", "https://kywertheim.com/EDIShef.html",
            "https://kywertheim.com/teach.html", "https://kywertheim.com/review.html",
            "https://kywertheim.com/works.html", "https://kywertheim.com/news.html",
            "https://kywertheim.com/oldnews.html"
           ]

#Fetch the pages.
mywebsite = [fetch_page(url) for url in data_url]
print(mywebsite)

['<!DOCTYPE html>\n\n<html>\n\n\t<head>\n\t\t<title>Dr Kenneth Y. Wertheim</title>\n\t\t<link href="main.css" rel="stylesheet" tpye="text/css">\n\t</head>\n\n\t<body>\n\n\t\t<div class="wrapper">\n\n\t\t\t<div class="header">\n\t\t\t\t<h1>Dr Kenneth Y. Wertheim</h1>\n\t\t\t\t<h2>Also known as 11250205</h2>\n\t\t\t</div>\n\n\t\t\t<div class="main">\n\t\t\t\t<p>I am a systems theorist with expertise in mathematical modelling, scientific computing, and machine learning. Biological and social systems are my main interests, but I develop applications of artificial intelligence too. When I am not theorising, I advocate linguistic justice and teach yoga.</p>\n\t\t\t\t<p>Aracial, acultural, and agender, I am a global citizen without a home country, but I am currently based in Kingston upon Hull in the UK. My correct pronouns are they and them. I am also known as 11250205.</p>\n\t\t\t\t<p>If you want to contact me, please find my email address here: <a href="personal/cv.pdf">link</a>.</p>\n\t\t

In [3]:
#This parser function extracts the text components from the fetched pages.
def parse_page(html_doc:str):
    soup = BeautifulSoup(html_doc, 'html.parser')
    ptags = soup.find_all('p') #First, find all the <p> tags.
    ptags_stripped = []
    for ptag in ptags:
        ptags_stripped.append(ptag.get_text(strip=True)) #Second, only keep the text of each <p> tag.
    return ptags_stripped

#Turn each useful page of my website into a list of stripped p tags.
all_pages = []
for page in mywebsite:
    all_pages.append(parse_page(page))

#Sanity check.
len(data_url) == len(all_pages)

True

In [4]:
#Inspection. The line 'Copyright ©\xa02017–2024 Kenneth Y. Wertheim' is repetitive and not helpful.
print(all_pages)

[['I am a systems theorist with expertise in mathematical modelling, scientific computing, and machine learning. Biological and social systems are my main interests, but I develop applications of artificial intelligence too. When I am not theorising, I advocate linguistic justice and teach yoga.', 'Aracial, acultural, and agender, I am a global citizen without a home country, but I am currently based in Kingston upon Hull in the UK. My correct pronouns are they and them. I am also known as 11250205.', 'If you want to contact me, please find my email address here:link.', 'Copyright ©\xa02017–2024 Kenneth Y. Wertheim'], ['I am a neurodivergent (autism), a global citizen, and an agender person.', "Please do not make assumptions about me, but do acknowledge the reality I am in. To me, the concepts of nationality, race, gender, and age; cultures and religions; and the institutions of marriage and family are relics of humanity's tribal past. I do not approve of what creates conflicts and hin

In [5]:
#Merge the p tags from all pages into one corpus and remove the unhelpful line.
corpus = []
for page in all_pages:
    for ptag in page:
        if ptag != 'Copyright ©\xa02017–2024 Kenneth Y. Wertheim':
            corpus.append(ptag)
print(corpus)

['I am a systems theorist with expertise in mathematical modelling, scientific computing, and machine learning. Biological and social systems are my main interests, but I develop applications of artificial intelligence too. When I am not theorising, I advocate linguistic justice and teach yoga.', 'Aracial, acultural, and agender, I am a global citizen without a home country, but I am currently based in Kingston upon Hull in the UK. My correct pronouns are they and them. I am also known as 11250205.', 'If you want to contact me, please find my email address here:link.', 'I am a neurodivergent (autism), a global citizen, and an agender person.', "Please do not make assumptions about me, but do acknowledge the reality I am in. To me, the concepts of nationality, race, gender, and age; cultures and religions; and the institutions of marriage and family are relics of humanity's tribal past. I do not approve of what creates conflicts and hinders evolution, so I do not define myself in such t

In [6]:
#Load the corpus into a pandas dataframe.
import pandas as pd
df = pd.DataFrame()
df["text"] = corpus
df = df[df["text"].str.len()>0] #Just in case the dataframe contains empty rows.
df.reset_index(inplace=True, drop=True) #Reset the index to a sequential one.
df

Unnamed: 0,text
0,I am a systems theorist with expertise in math...
1,"Aracial, acultural, and agender, I am a global..."
2,"If you want to contact me, please find my emai..."
3,"I am a neurodivergent (autism), a global citiz..."
4,"Please do not make assumptions about me, but d..."
...,...
271,"In the acknowledgements of my PhD thesis, I de..."
272,I marked the beginning of the year by finishin...
273,Kenneth Y. Wertheim.
274,"My Dearest Rose, oil on canvas, 2018."


In [7]:
#Access OpenAI API
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "" #Replace the empty string with your own API key.

In [8]:
#Embed each row of the corpus by creating a semantic vector.
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
embed = openai.Embedding.create(
    input=df["text"].tolist(),
    model=EMBEDDING_MODEL_NAME
)

In [9]:
#Add the embeddings to the dataframe.
embeddings = [data["embedding"] for data in embed["data"]]
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,I am a systems theorist with expertise in math...,"[-0.01737857237458229, -0.013809639029204845, ..."
1,"Aracial, acultural, and agender, I am a global...","[-0.01827007159590721, -0.003775858087465167, ..."
2,"If you want to contact me, please find my emai...","[-0.049696292728185654, -0.00811275839805603, ..."
3,"I am a neurodivergent (autism), a global citiz...","[-0.028072498738765717, -0.006071896757930517,..."
4,"Please do not make assumptions about me, but d...","[-0.02093394845724106, -0.016574883833527565, ..."
...,...,...
271,"In the acknowledgements of my PhD thesis, I de...","[-0.016483446583151817, -0.01144942082464695, ..."
272,I marked the beginning of the year by finishin...,"[-0.022733693942427635, -0.01655718870460987, ..."
273,Kenneth Y. Wertheim.,"[-0.007872112095355988, -0.011855024844408035,..."
274,"My Dearest Rose, oil on canvas, 2018.","[-0.019328245893120766, -0.03409375995397568, ..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In this part, I use the code from the Build a Custom OpenAI Chatbot with ML-Driven Prompt Engineering exercise.

In [32]:
#This block of code compares a query with each row of the corpus and
#ranks the rows based on their relevance to the query.
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

get_rows_sorted_by_relevance("What did Dr Wertheim do when they were in Sheffield?", df)

Unnamed: 0,text,embeddings,distances
70,"Based at the University of Sheffield, I modell...","[0.0024402542039752007, 0.010176094248890877, ...",0.169387
259,Kenneth Y. Wertheim.,"[-0.007918896153569221, -0.011881690472364426,...",0.171954
273,Kenneth Y. Wertheim.,"[-0.007872112095355988, -0.011855024844408035,...",0.171961
199,"Yesterday, I had my last day at the University...","[-0.0024456738028675318, -0.012336992658674717...",0.178364
114,"Between November 2010 and March 2011, I volunt...","[-0.015073556452989578, 0.0020233874674886465,...",0.187202
...,...,...,...
86,The neural crest is a transient population of ...,"[-0.020272349938750267, -0.0010255689267069101...",0.310491
36,Percentile rank: 98. The Cattell III B Intelli...,"[-0.020977552980184555, -0.0005106421303935349...",0.315698
223,"2022 is Stupid, acrylic on canvas, 2022.","[-0.01979467272758484, -0.017956869676709175, ...",0.315918
35,Percentile rank: 99. The Cattell Culture Fair ...,"[-0.030141182243824005, 0.0029151630587875843,...",0.318165


In [33]:
#This block of code combines a query with a context extracted from the corpus to create a prompt.
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know."

Context: 

{}

---

Question: {}

Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

print(create_prompt("What did Dr Wertheim do when they were in Sheffield?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know."

Context: 

Based at the University of Sheffield, I modelled two paediatric cancers at the cellular scale under the supervision of Doctor Dawn Walker.

###

Kenneth Y. Wertheim.

###

Kenneth Y. Wertheim.

###

Yesterday, I had my last day at the University of Sheffield. In the last four years, I had a great time doing cutting-edge research within a European project and supervising various exploratory projects. I am grateful to Dr Dawn Walker for this opportunity.

---

Question: What did Dr Wertheim do when they were in Sheffield?

Answer:


In [34]:
#This block of code takes a query to create a prompt and then calls a completion model to generate a response.
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=1500
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

print(answer_question("What did Dr Wertheim do when they were in Sheffield?", df))

Dr Wertheim did cutting-edge research within a European project and supervised various exploratory projects. They also established international collaborations, took bioinformatics courses, and taught a course on Modelling and Simulation of Natural Systems. They also supervised students and visited Tapton School to give a seminar talk.


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [37]:
#Let's get a response without adaptation.
prompt1 = """
Question: "What did Dr Wertheim do during the PRIMAGE project?"

Answer:
"""

initial_prompt1_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt1,
    max_tokens=1500
)["choices"][0]["text"].strip()

print(initial_prompt1_answer)
#The response below is not entirely correct.
#Dr Wertheim was responsible for computational modelling, not medical imaging processing.

Dr. Wertheim was one of the lead researchers in the PRIMAGE project, a European research project that aimed to develop an innovative software platform for personalized medicine in the field of childhood cancer. As a part of this project, Dr. Wertheim helped develop and implement advanced medical imaging processing techniques, including machine learning algorithms, to analyze and interpret medical images of cancer patients. He also collaborated with other members of the research team to establish clinical guidelines for the use of these techniques in clinical decision making, ultimately working towards improving treatment outcomes and quality of life for pediatric cancer patients.


In [39]:
#Let's get a response after adapting the same completion model by RAG.
print(answer_question("What did Dr Wertheim do during the PRIMAGE project?", df))

He built the first multicellular computational model of neuroblastoma, which was integrated into a multiscale simulation framework and was later published in a paper. He also worked on a project to simulate tumour progression in silico and presented his research at various conferences. Additionally, he established international collaborations and secured research grants for spin-off projects related to PRIMAGE.


### Question 2

In [50]:
#Let's get a response without adaptation.
prompt2 = """
Question: "At the University of Hull, what's covered in the Applied AI module? What's Dr Wertheim's role?"

Answer:
"""

initial_prompt2_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt2,
    max_tokens=1500
)["choices"][0]["text"].strip()

print(initial_prompt2_answer)
#The response below is not entirely correct.
#For example, the module doesn't cover robotics and data mining.

The Applied AI module at the University of Hull covers the practical applications of artificial intelligence in various fields, including business, healthcare, education, and social media. This includes topics such as machine learning, natural language processing, robotics, and data mining. Students will learn about the development and implementation of AI systems and algorithms, as well as the ethical and social implications of this technology.

Dr Wertheim's role in this module is likely that of an instructor or lecturer. They may be responsible for teaching certain topics, designing and grading assignments, and providing guidance and support to students. They may also have a research background in AI and be able to share their expertise and insights with students.


In [51]:
#Let's get a response after adapting the same completion model by RAG.
print(answer_question("At the University of Hull, what's covered in the Applied AI module? What's Dr Wertheim's role?", df))

Dr Wertheim's role at the University of Hull is as the module leader for the Applied AI module. This module covers topics such as supervised, unsupervised, and reinforcement learning algorithms, as well as computer vision and natural language processing applications. It includes linear regression, decision trees, Naive Bayes, k-means clustering, hierarchical clustering, principal component analysis, feedforward neural networks, convolutional neural networks, autoencoders, recurrent neural networks, attention models, and Q-learning.
