# Tennis Chatbot Project
This project creates a **tennis ChatBot** which uses the Open AI API to leverage ChatGPT knowledge. To cover more recent tennis news, we also trained the ChatBot on 2022 and 2023 tennis news from Wikipedia.

The project is composed of two parts:
1. We create tennis embeddings for the 2022 and 2023 tennis news. We save those embeddings into a file, so they can be reused later.
2. We capture the user question, identify the most relevant embeddings and submit the question with the relevant information to ChatGPT.

To validate the project, we show the different responses provided by ChatGPT and the custom tennis ChatBot.

To run tennis ChatBot, simply keep asking questions and type in "quit" to leave the conversation.

## Datasets 
We'll scrape the following pages from Wikipedia to collect more recent tennis news:
* https://en.wikipedia.org/wiki/2022_in_tennis
* https://en.wikipedia.org/wiki/2023_ATP_Tour
* https://en.wikipedia.org/wiki/2022_ATP_Tour
* https://en.wikipedia.org/wiki/2022_WTA_Tour
* https://en.wikipedia.org/wiki/2022_US_Open_(tennis)
* https://en.wikipedia.org/wiki/2022_French_Open
* https://en.wikipedia.org/wiki/2022_Wimbledon_Championships
* https://en.wikipedia.org/wiki/2022_Australian_Open
* https://en.wikipedia.org/wiki/2023_Australian_Open
* https://en.wikipedia.org/wiki/2023_Wimbledon_Championships
* https://en.wikipedia.org/wiki/2023_French_Open_%E2%80%93_Men%27s_singles
* https://en.wikipedia.org/wiki/2023_French_Open_%E2%80%93_Women%27s_singles
* https://en.wikipedia.org/wiki/2022_US_Open_(tennis)
* https://en.wikipedia.org/wiki/2024_Australian_Open_%E2%80%93_Men%27s_singles
* https://en.wikipedia.org/wiki/2024_Australian_Open_%E2%80%93_Women%27s_singles
    

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In the code below, we create a list of articles which should be parsed to collect the latest tennis news. We ensure the collected data doesn't include any empty record as it would be rejected by the OpenAI API.

In [28]:
import requests
import pandas as pd
import re

titles = ["2022_in_tennis", 
          "2023_ATP_Tour", 
          "2022_ATP_Tour", 
          "2022_WTA_Tour", 
          "2022_US_Open_(tennis)", 
          "2022_French_Open", 
          "2022_Wimbledon_Championships",
          "2022_Australian_Open",
          "2023_Australian_Open",
          "2023_Wimbledon_Championships",
          "2022_US_Open_(tennis)",
          "2023 French Open – Men's singles",
          "2023 French Open – Women's singles",
          "2024 Australian Open – Women's singles",
          "2024 Australian Open – Men's singles" ]

# create list of dataframes which will be built through the for loop below.
data_list = []

for title in titles:
    print("Processing:", title)

    # extract year and tournement from titles so we can prepend it to the news
    year = title[0:4]
    tournement=""
    if "French" in title:
        tournement = "French Open"
    elif "Wimbledon" in title:
        tournement = "Wimbledon"
    elif "US" in title:
        tournement = "US Open"
    elif "Australian" in title:
        tournement = "Australian Open"
 
    # create dataframe to manage recent data from title page
    df = pd.DataFrame()

    # make sure the data doesn't get truncated in dataframe
    pd.set_option('display.max_colwidth', None)

    query_params = {
        "action": "query", 
        "prop": "extracts",
        "exlimit": 1,
        "titles": title,
        "explaintext": 1,
        "formatversion": 2,
        "format": "json"
    }

    resp = requests.get("https://en.wikipedia.org/w/api.php", params=query_params)
    response_dict = resp.json()
    # Load page text into a dataframe
    df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")
    # replace "No.X" with "NoX"
    df["text"] = df['text'].str.replace('[n|N]o\.\s','NoDot', regex=True)
    # split rows into multiple sentences
    df['text'] = df['text'].str.split('.')
    # Convert list into multiple rows
    df = df.explode('text')
    # Clean up text to remove empty lines and headings
    df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]
    # replace the NoDot back to No.1, No.2, etc...
    df["text"] = df['text'].str.replace('NoDot','No. ')
    # remove strings which only include a name followed by def an potentially another name
    patternDel = r'def$'
    filter = df.text.str.contains(patternDel)
    df = df[~filter]
    # clear empty rows which may result from splitting rows into multiples.
    df = df[df.text != 0]
    # prepend year and tournement to data so we provide more contextual information:
    df['text'] = df['text'].apply(lambda x: "{}{}{}{}{}".format(year, "-", tournement, "-", x))
    # add new data to the list
    data_list.append(df)
# consolidate all dataframes into a single dataframe
dfAll = pd.concat(data_list)
# reset index since we merged multiple dataframes and removed some rows.
dfFinal = dfAll.reset_index(drop=True)
print(dfFinal)

Processing: 2022_in_tennis
Processing: 2023_ATP_Tour
Processing: 2022_ATP_Tour
Processing: 2022_WTA_Tour
Processing: 2022_US_Open_(tennis)
Processing: 2022_French_Open
Processing: 2022_Wimbledon_Championships
Processing: 2022_Australian_Open
Processing: 2023_Australian_Open
Processing: 2023_Wimbledon_Championships
Processing: 2022_US_Open_(tennis)
Processing: 2023 French Open – Men's singles
Processing: 2023 French Open – Women's singles
Processing: 2024 Australian Open – Women's singles
Processing: 2024 Australian Open – Men's singles
                                                                                                                                                                                                            text
0                                                                                                                                 2022--This page covers all the important events in the sport of tennis in 2022
1                                       

The data above could still be cleaned manually as a few records don't seem to add much value. However, our algorithm will identify the most relevant data so it shouldn't be too much of a concern.

We now create embeddings for the dataframe created above. To do so, we use the embeddings API provided by Open AI.
Note that the data submitted to the API must satisfy the following requirements:
* The input parameter may not take a list longer than 2048 elements (chunks of text).
* The total number of tokens across all list elements of the input parameter cannot exceed 1,000,000. (Because the rate limit is 1,000,000 tokens per minute.)
* Each individual array element (chunk of text) cannot be more than 8191 tokens.
* No element in the list should be BLANK/EMPTY/NULL content in the input parameter (list of paragraph)

We address the first requirement by sending batches of 100 rows to the OpenAI API. We already addressed the last requirement by filtering the empty strings when we created the dataframes. The other requirements didn't seem to pose a problem with our project. 
In the future, we may want to implement a validation method to check that the input meet those requirements.

In [29]:
import os
from openai import OpenAI
openai_api_key = os.getenv("OPENAI_API_KEY")

# create session with openai
client = OpenAI(api_key=openai_api_key)

# This is the embedding model we'll be using.
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

batch_size = 100
embeddings = []
for i in range(0, len(dfFinal), batch_size):
    # Send text data to OpenAI model to get embeddings
    input=dfFinal.iloc[i:i+batch_size]["text"].tolist()
    
    try:
        response = client.embeddings.create(
        input=input,
        model=EMBEDDING_MODEL_NAME
        )

    except Exception as e:
        print(e)
        print (dfFinal.iloc[i:i+batch_size]["text"].tolist())

    # Add embeddings to embeddings list
    embeddings.extend([data.embedding for data in response.data])
# Add embeddings list to dataframe
dfFinal["embeddings"] = embeddings


In [30]:
# display first 20 embeddings
print(dfFinal["embeddings"].head(20))

0                         [-0.015601040795445442, -0.01319698803126812, 0.008800100535154343, 0.014955742284655571, -0.007414606865495443, 0.019232427701354027, -0.01996629498898983, 0.005285754334181547, -0.03757914900779724, -0.021180976182222366, 0.02110505849123001, 0.010976401157677174, -0.024002574384212494, 0.0005523787112906575, 0.02905108779668808, -0.03656691685318947, 0.02509072609245777, -0.004216582980006933, 0.018612435087561607, -0.007901743985712528, 0.009603560902178288, 0.008603980764746666, 0.0019675279036164284, -0.013209640979766846, 0.00044087492278777063, 0.024951543658971786, 0.0023977269884198904, -0.03995789587497711, 0.01570226438343525, -0.004409539978951216, 0.02148464508354664, 0.003005066653713584, -0.0026444587856531143, -0.031784117221832275, -0.0005049302708357573, -0.0036851607728749514, 0.004112196620553732, 0.0032676146365702152, 0.007515829987823963, -0.016081850975751877, 0.017448365688323975, 0.040160343050956726, -0.0054154465906322, -0.01553777

In [31]:
# save text and embeddings
dfFinal.to_csv("embeddings_tennis_2022_2024.csv")

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

We first load the necessary packages, embeddings and environment variables.



In [32]:
import os
import tiktoken
from openai import OpenAI
import pandas as pd
import numpy as np
import ast

# define models and constants
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME="gpt-3.5-turbo-instruct"
MAX_PROMPT_TOKENS=1800
MAX_RESPONSE_TOKENS=150
TIKTOKEN_ENCODING="cl100k_base"


# load dataframe with latest tennis news/data
df = pd.read_csv('embeddings_tennis_2022_2024.csv')

# embeddings are read as a string, so we need to convert them to a list of floats. read_csv issue.
df["embeddings"] = df["embeddings"].apply(lambda x : ast.literal_eval(x))

# load OPENAI API key from .env file
openai_api_key = os.getenv("OPENAI_API_KEY")

# create session with openai
client = OpenAI(api_key=openai_api_key)

We now create a few utility functions to help us with the project.

In [33]:
def num_tokens_from_string(string):
    """Helper function which returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(TIKTOKEN_ENCODING)
    num_tokens = len(encoding.encode(string))
    return num_tokens

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_embedding(text, model=EMBEDDING_MODEL_NAME): 
    return client.embeddings.create(input = [text], model=model).data[0].embedding

def get_rows_sorted_by_relevance(question, df):
    """
    Function which takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, model=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question

    df_copy = df.copy()
    
    # compute cosine similarities between each dataframe row embedding and the question embedding to identify most relevants rows.

    df_copy["distances"] = df.embeddings.apply(lambda x: 1-cosine_similarity(x, question_embeddings))

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

We now create a function which will create a prompt based on a template.

In [34]:
def create_prompt(question, custom_model, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    if (custom_model):

        # Count the number of tokens in the prompt template and question
        prompt_template = """
            You are a tennis expert. Answer the question and if the question
            can't be answered based on the context, provide a response which starts with "You can't be serious!".
            When the question is "quit", answer with a tennis joke.

            Context:

            {} 

            ---
            Question: {}
            Answer:
        """
        current_token_count = num_tokens_from_string(prompt_template) + num_tokens_from_string(question)

        context = []
        
        # add context to the question
        if custom_model :
            context = []
            
            for text in get_rows_sorted_by_relevance(question, df)["text"].values:

                # Increase the counter based on the number of tokens in this row
                text_token_count = num_tokens_from_string(text)
                current_token_count += text_token_count

                # Add the row of text to the list if we haven't exceeded the max
                if current_token_count <= max_token_count:
                    context.append(text)
                else:
                    break
        
        prompt=prompt_template.format("\n\n###\n\n".join(context), question)
    else: prompt=question
    return prompt

We now create a function which answers the questions.

In [35]:
def answer_question(question, df, custom_model=True, max_prompt_tokens=MAX_PROMPT_TOKENS, max_answer_tokens=MAX_RESPONSE_TOKENS):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, print the error and return an error message
    """
    prompt = create_prompt(question, custom_model, df, max_prompt_tokens)

    try:
        response = client.completions.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response.choices[0].text.strip()
    except Exception as e:
        print(e)
        return "Oops! Something went wrong..."

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [37]:
question1 = "Who won the Australian Open Men's Singles Final in 2023?"
print(question1)
# call answer question with basic model (e.g. custom_model=False)
print("Response with basic model:", answer_question(question1, df, False))
# call answer question with customized model (e.g. custom_model=True)
print("Response with customized model:", answer_question(question1, df))

Who won the Australian Open Men's Singles Final in 2023?
Response with basic model: It is impossible to accurately predict who will win the Australian Open Men's Singles Final in 2023 as it is several years in the future and the players participating in the tournament may change.
Response with customized model: Novak Djokovic won the men's singles title at the 2023 Australian Open.


The answer provided by our tennisBot to Question 1 is correct. Novak Djokovic won the Australian Open Men's singles final in 2023. The basic model didn't know this information since it was trained before the event occured.

### Question 2

In [38]:
question2 = "When and where did Roger Federer retire from tennis?"
print(question2)
# call answer question with basic model (e.g. custom_model=False)
print("Response with basic model:", answer_question(question2, df, False))
# call answer question with customized model (e.g. custom_model=True)
print("Response with customized model:", answer_question(question2, df))

When and where did Roger Federer retire from tennis?
Response with basic model: Roger Federer has not yet retired from tennis. He is still actively playing on the professional circuit and has expressed a desire to continue playing for several more years. As of 2021, he is ranked #8 in the world and has recently returned to the courts after a knee injury.
Response with customized model: You can't be serious! As a tennis expert, you should know that Roger Federer retired from professional tennis at the end of 2022 Laver Cup.


The answer provided by the custom tennisbot is correct again. Roger Federer officially retired at the Laver Cup event in 2022 where he played his last match with Rafa Nadal. The basic chatbot wasn't trained on this data since the event occured at the end of 2022.

### Question 3

In [40]:
question3 = "Who won the women's and men's final at the 2024 Australian Open?"
print(question3)
# call answer question with basic model (e.g. custom_model=False)
print("Response with basic model:", answer_question(question3, df, False))
# call answer question with customized model (e.g. custom_model=True)
print("Response with customized model:", answer_question(question3, df))

Who won the women's and men's final at the 2024 Australian Open?
Response with basic model: The Australian Open is held in January each year, so the 2023 Australian Open has not yet taken place. Therefore, it is impossible to accurately answer who won the women's and men's final at the 2024 Australian Open.
Response with customized model: Jannik Sinner defeated Daniil Medvedev in the final, 3–6, 3–6, 6–4, 6–4, 6–3 to win the men's singles tennis title at the 2024 Australian Open

Defending champion Aryna Sabalenka defeated Zheng Qinwen in the final, 6–3, 6–2 to win the women's singles tennis title at the 2024 Australian Open.


Question 3 was answered correctly by the custom tennisbot. Jannik Sinner just won the Australian Open Men's final and Aryna Sabalenka won the Women's final.

### Findings
We can see that the customized model answers questions related to events which happened after the basic model was trained (2021). So, overall, it meets the project requirements.

### Improvements
The dataset could be cleaned up a little bit as it includes some irrelevant records. Some sentences were also split into multiple records which breaks some context a little bit. As a next step, we could create a new csv file (embeddings_tennis_2022_2024_clean.csv) to address the issues mentioned above. The issues were addressed manually, by following the steps below:

1. Create a new dataframe with the index and text columns.
2. Remove and combine the records as needed.
3. Generate embeddings for the new records and add to new dataframe.
4. Save new dataframe as "embeddings_tennis_2022_2024_clean.csv"

### Wait! There is more...
To provide a more engaging experience, we also created a separate python program (tennisbot.py) which can be run via the command line:
> python tennisbot.py

Type in your question or 'quit' to engage with the tennis bot. Be careful, it was trained with a personality!