# Custom Chatbot Project

For this project, I have chosen the Wikipedia article on "2023". The Wikipedia article on "2023" is an excellent choice because it covers a wide range of recent and relevant events across various domains, such as politics, technology, science, culture, and sports. This diversity allows the chatbot to provide comprehensive, up-to-date, and credible information, making it a valuable tool for users interested in current events, educational purposes, and general knowledge.

## Data Wrangling

### Loading the Data with `pandas`

In [71]:
import requests
import pandas as pd
import numpy as np
from dateutil.parser import parse
from openai.embeddings_utils import get_embedding
from openai.embeddings_utils import distances_from_embeddings
import tiktoken

In [72]:
response = requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2023&explaintext=1&formatversion=2&format=json")

In [73]:
response.json()["query"]["pages"][0]["extract"].split("\n")

['2023 (MMXXIII) was a common year starting on Sunday of the Gregorian calendar, the 2023rd year of the Common Era (CE) and Anno Domini (AD) designations, the 23rd  year of the 3rd millennium and the 21st century, and the  4th   year of the 2020s decade.  ',
 'The year 2023 saw the decline in severity of the COVID-19 pandemic, with the WHO (World Health Organization) ending its global health emergency status in May. Catastrophic natural disasters included the fifth-deadliest earthquake of the 21st century striking Turkey and Syria, leaving up to 62,000 people dead, Cyclone Freddy – the longest-lasting recorded tropical cyclone in history – leading to over 1,400 deaths in Malawi and Mozambique, Storm Daniel, which became the deadliest cyclone worldwide since Cyclone Nargis after killing at least 11,000 people in Libya, a major 6.8 magnitude earthquake striking western Morocco, killing 2,960 people, and a 6.3 magnitude quadruple earthquake striking western Afghanistan, killing over 1,400

In [74]:
df = pd.DataFrame()
df["text"] = response.json()["query"]["pages"][0]["extract"].split("\n")
df

Unnamed: 0,text
0,2023 (MMXXIII) was a common year starting on S...
1,The year 2023 saw the decline in severity of t...
2,The Russian invasion of Ukraine and Myanmar ci...
3,A banking crisis resulted in the collapse of n...
4,"In the realm of technology, 2023 saw the conti..."
...,...
295,"Physics – Pierre Agostini, Ferenc Krausz & Ann..."
296,Physiology or Medicine – Katalin Karikó & Drew...
297,
298,


In [75]:
# remvoe blank lines
df = df[df["text"].str.len() > 0]
df

Unnamed: 0,text
0,2023 (MMXXIII) was a common year starting on S...
1,The year 2023 saw the decline in severity of t...
2,The Russian invasion of Ukraine and Myanmar ci...
3,A banking crisis resulted in the collapse of n...
4,"In the realm of technology, 2023 saw the conti..."
...,...
293,"Literature – Jon Fosse, for his innovative pla..."
294,"Peace – Narges Mohammadi, for her works on the..."
295,"Physics – Pierre Agostini, Ferenc Krausz & Ann..."
296,Physiology or Medicine – Katalin Karikó & Drew...


In [76]:
# remove line starting with ==
df = df[~df["text"].str.startswith("==")]
df

Unnamed: 0,text
0,2023 (MMXXIII) was a common year starting on S...
1,The year 2023 saw the decline in severity of t...
2,The Russian invasion of Ukraine and Myanmar ci...
3,A banking crisis resulted in the collapse of n...
4,"In the realm of technology, 2023 saw the conti..."
...,...
292,"Economics – Claudia Goldin, for her empirical ..."
293,"Literature – Jon Fosse, for his innovative pla..."
294,"Peace – Narges Mohammadi, for her works on the..."
295,"Physics – Pierre Agostini, Ferenc Krausz & Ann..."


In [77]:
# Combine the date and description into one line.

prefix = ""
for (i, row) in df.iterrows():
    if " – " not in row["text"]:
        try:
            parse(row["text"])
            prefix = row["text"]
        except:
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
df.tail(10)

Unnamed: 0,text
277,December 21 – The world population on January ...
281,December 21 – The best-selling video game in 2...
282,December 21 – The highest-grossing movie in 20...
283,December 21 – The best-selling book in 2023 wa...
291,"Chemistry – Moungi Bawendi, Louis E. Brus & Al..."
292,"Economics – Claudia Goldin, for her empirical ..."
293,"Literature – Jon Fosse, for his innovative pla..."
294,"Peace – Narges Mohammadi, for her works on the..."
295,"Physics – Pierre Agostini, Ferenc Krausz & Ann..."
296,Physiology or Medicine – Katalin Karikó & Drew...


In [78]:
df.reset_index(inplace=True, drop=True)
df

Unnamed: 0,text
0,– 2023 (MMXXIII) was a common year starting o...
1,The year 2023 saw the decline in severity of t...
2,– The Russian invasion of Ukraine and Myanmar...
3,– A banking crisis resulted in the collapse o...
4,"– In the realm of technology, 2023 saw the co..."
...,...
210,"Economics – Claudia Goldin, for her empirical ..."
211,"Literature – Jon Fosse, for his innovative pla..."
212,"Peace – Narges Mohammadi, for her works on the..."
213,"Physics – Pierre Agostini, Ferenc Krausz & Ann..."


In [79]:
df.to_csv("text.csv")

### Creating an Embeddings Index with `openai.Embedding`

In [80]:
import pandas as pd
df = pd.read_csv("text.csv", index_col=0)
df

Unnamed: 0,text
0,– 2023 (MMXXIII) was a common year starting o...
1,The year 2023 saw the decline in severity of t...
2,– The Russian invasion of Ukraine and Myanmar...
3,– A banking crisis resulted in the collapse o...
4,"– In the realm of technology, 2023 saw the co..."
...,...
210,"Economics – Claudia Goldin, for her empirical ..."
211,"Literature – Jon Fosse, for his innovative pla..."
212,"Peace – Narges Mohammadi, for her works on the..."
213,"Physics – Pierre Agostini, Ferenc Krausz & Ann..."


In [81]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input=df["text"].tolist(),
    model=EMBEDDING_MODEL_NAME
)
embeddings = [data["embedding"] for data in response["data"]]
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,– 2023 (MMXXIII) was a common year starting o...,"[0.0036826389841735363, -0.01391774881631136, ..."
1,The year 2023 saw the decline in severity of t...,"[-0.02144140563905239, -0.0048804208636283875,..."
2,– The Russian invasion of Ukraine and Myanmar...,"[-0.02132827416062355, -0.01669287495315075, 0..."
3,– A banking crisis resulted in the collapse o...,"[-0.032874152064323425, -0.012347983196377754,..."
4,"– In the realm of technology, 2023 saw the co...","[-0.023189552128314972, -0.014148631133139133,..."
...,...,...
210,"Economics – Claudia Goldin, for her empirical ...","[-0.0173618383705616, -0.00786728784441948, -0..."
211,"Literature – Jon Fosse, for his innovative pla...","[-0.009454749524593353, 0.017640141770243645, ..."
212,"Peace – Narges Mohammadi, for her works on the...","[-0.013226707465946674, -0.012996620498597622,..."
213,"Physics – Pierre Agostini, Ferenc Krausz & Ann...","[-0.020426776260137558, 0.015005192719399929, ..."


In [82]:
df.to_csv("embeddings.csv")

### Finding Relevant Data with Cosine Similarity

In [83]:
df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

question = "Tell me about the notable AI innovations in 2023."
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
distances = distances_from_embeddings(question_embeddings, df["embeddings"].tolist(), distance_metric="cosine")
df["distances"] = distances
df

Unnamed: 0,text,embeddings,distances
0,– 2023 (MMXXIII) was a common year starting o...,"[0.0036826389841735363, -0.01391774881631136, ...",0.223111
1,The year 2023 saw the decline in severity of t...,"[-0.02144140563905239, -0.0048804208636283875,...",0.270183
2,– The Russian invasion of Ukraine and Myanmar...,"[-0.02132827416062355, -0.01669287495315075, 0...",0.251490
3,– A banking crisis resulted in the collapse o...,"[-0.032874152064323425, -0.012347983196377754,...",0.297747
4,"– In the realm of technology, 2023 saw the co...","[-0.023189552128314972, -0.014148631133139133,...",0.123513
...,...,...,...
210,"Economics – Claudia Goldin, for her empirical ...","[-0.0173618383705616, -0.00786728784441948, -0...",0.281268
211,"Literature – Jon Fosse, for his innovative pla...","[-0.009454749524593353, 0.017640141770243645, ...",0.286060
212,"Peace – Narges Mohammadi, for her works on the...","[-0.013226707465946674, -0.012996620498597622,...",0.288516
213,"Physics – Pierre Agostini, Ferenc Krausz & Ann...","[-0.020426776260137558, 0.015005192719399929, ...",0.264084


In [84]:
df.to_csv("distances.csv")

## Find Shortest Distance

In [93]:
df = pd.read_csv("distances.csv", index_col=0)

current_shortest = df.iloc[0]["distances"]
current_shortest_index = 0
current_shortest, current_shortest_index

for index, distance in enumerate(df["distances"].values):
    if distance < current_shortest:
        current_shortest = distance
        current_shortest_index = index
current_shortest, current_shortest_index

df.iloc[current_shortest_index]["text"]

' – In the realm of technology, 2023 saw the continued rise of generative AI models, with increasing applications across various industries. These models, leveraging advancements in machine learning and natural language processing, had become capable of creating realistic and coherent text, images, and music. An AI arms race between private companies has continued since the late 2010s, with Microsoft-backed OpenAI and Google owner Alphabet today most dominant among firms.'

In [94]:
df.sort_values(by="distances").to_csv("distances_sorted.csv")

## Custom Query Completion

In [95]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800,max_answer_tokens=150
):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    prompt_template =  """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
    context = []
    for text in df["text"].values:
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        if current_token_count <= max_prompt_tokens:
            context.append(text)
        else:
            break
    prompt = prompt_template.format("\n\n###\n\n".join(context), question)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"]
    except Exception as e:
        print(e)
        return ""

## Basic Response

In [90]:
questions = [
    "What were the major political events in 2023?",
    "When and what important scientific discoveries or breakthroughs occurred in 2023?",
    "What were the top cultural or entertainment events in 2023?",
    "What were the significant healthcare advancements or medical discoveries in 2023?",
    "What were the major sports events and their outcomes in 2023?"
]

before_customization_responses = []

for question in questions:
    response = openai.Completion.create(
        engine=COMPLETION_MODEL_NAME ,
        prompt=question,
        max_tokens=100
    )
    before_customization_responses.append(response.choices[0].text.strip())

print("Basic Responses (Before Customization):")
for i, question in enumerate(questions):
    print(f"Q: {question}")
    print(f"A: {before_customization_responses[i]}")
    print("\n#######\n")

Basic Responses (Before Customization):
Q: What were the major political events in 2023?
A: It is impossible to accurately predict all major political events in 2023, as the political landscape is constantly changing. However, some potential events that could occur in 2023 include:

1. Presidential elections in various countries, including the United States, France, Germany, and Russia.

2. Continued Brexit negotiations between the United Kingdom and the European Union.

3. Potential further escalation of tensions between the United States and North Korea, as well as ongoing diplomatic efforts to denuclearize the Korean

#######

Q: When and what important scientific discoveries or breakthroughs occurred in 2023?
A: As a language model AI, I cannot accurately predict future events, including scientific discoveries or breakthroughs. However, here are some major advancements and developments that have occurred recently in the world of science and technology:

1. The first successful huma

## Custom Performance Demonstration


In [96]:
df = pd.read_csv("distances_sorted.csv", index_col=0)

custom_responses = []
for question in questions:
    custom_answer = answer_question(question, df)
    custom_responses.append(custom_answer)

print("Custom Responses (After Customization):")
for i, question in enumerate(questions):
    print(f"Q: {question}")
    print(f"A: {custom_responses[i]}")
    print("\n#######\n")

Custom Responses (After Customization):
Q: What were the major political events in 2023?
A: 
1. The first AI Safety Summit took place with 28 countries signing an agreement on how to manage the riskiest forms of artificial intelligence.

2. SAG-AFTRA announces a strike against major film and TV studios in protest of low compensation, ownership of work, and generative AI.

3. Large-scale spontaneous protests erupt in Israel in the wake of Prime Minister Benjamin Netanyahu firing his defense minister.

4. The European Parliament approves a ban on the sale of new petrol and diesel vehicles in the European Union from 2035 to combat climate change and promote electric vehicles.

5. The 2023 Chinese presidential election is held with Xi Jinping being unanimously re-elected for a third term.

6. Economist and former deputy prime minister Tharman

#######

Q: When and what important scientific discoveries or breakthroughs occurred in 2023?
A:  In March and April of 2023, the first AI Safety Su

## Comparison of Basic and Custom Responses

In [99]:
print(f"Q: {questions[0]}")
print("\n#######\n")
print("Basic Response:")
print(f"A: {before_customization_responses[0]}")
print("\n#######\n")
print("Custom Response:")
print(f"A: {custom_responses[0]}")

Q: What were the major political events in 2023?

#######

Basic Response:
A: It is impossible to accurately predict all major political events in 2023, as the political landscape is constantly changing. However, some potential events that could occur in 2023 include:

1. Presidential elections in various countries, including the United States, France, Germany, and Russia.

2. Continued Brexit negotiations between the United Kingdom and the European Union.

3. Potential further escalation of tensions between the United States and North Korea, as well as ongoing diplomatic efforts to denuclearize the Korean

#######

Custom Response:
A: 
1. The first AI Safety Summit took place with 28 countries signing an agreement on how to manage the riskiest forms of artificial intelligence.

2. SAG-AFTRA announces a strike against major film and TV studios in protest of low compensation, ownership of work, and generative AI.

3. Large-scale spontaneous protests erupt in Israel in the wake of Prime 

The improvements in the chatbot's responses due to the incorporation of detailed information from the Wikipedia article on 2023. This shows how customizing the data source can significantly enhance the chatbot's performance and relevance to user queries.