## Create your own Chatbot 

In today's rapidly evolving digital landscape, the demand for intelligent and context-aware chatbots is at an all-time high. It is critical to understan the importance of leveraging advanced AI models to meet this demand.

<img src="img/superheroes.png" alt="Alt Text" width="500"/>

This project aims to build a sophisticated OpenAI Chatbot that utilizes a dataset of superheroes to demonstrate the enhanced performance achieved through custom prompts. By integrating detailed and context-specific information from our dataset, we aim to significantly improve the relevance and specificity of the chatbot's responses.

Our objectives are to showcase the chatbot's capabilities in understanding nuanced queries, highlight the effectiveness of custom prompts, and provide a comprehensive analysis of how dataset-driven approaches can elevate AI performance.

Dataset can be found here: https://www.kaggle.com/datasets/jonathanbesomi/superheroes-nlp-dataset?select=superheroes_nlp_dataset.csv

### Imports

In [1]:
import os

import pandas as pd
import numpy as np
import openai
from openai import OpenAI
import tiktoken
from sklearn.metrics.pairwise import cosine_similarity

### Function Definitions

The functions provided serve a range of purposes in handling data wrangling and interfacing with OpenAI's language models.
- generate_embeddings(text) creates vector embeddings of the input text using a specified OpenAI model, which are essential for comparing textual similarity. 
- get_openai_response(query) communicates with the OpenAI API to generate a response to the input query using the GPT-3.5-turbo model. 
- create_prompt(question, df, prompt_template, with_context=True) constructs a prompt by combining the user's question with relevant context from a DataFrame of text data, ensuring the prompt stays within token limits using a tokenizer. 
- answer_my_question(question, df, prompt_template, with_context=True) utilizes the previous functions to generate an embedding for the question, calculate cosine similarities to sort context data, create a prompt, and retrieve a response from OpenAI, handling any errors gracefully. 

These functions collectively facilitate sophisticated natural language querying and context-aware responses.

#### Data Processing Functions

In [2]:
def clean_df(df, column_names):
    df = df[selected_columns]
    df=df.dropna()
    return df

#### LLM & Embedding Functions

In [3]:
def generate_embeddings(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )

    return response.data[0].embedding

def batch_generate_embeddings(df, text_column, batch_size=100):
    embeddings = []
    for i in range(0, len(df), batch_size):
        batch_texts = df[text_column].iloc[i:i+batch_size].tolist()
        batch_embeddings = generate_embeddings_batch(batch_texts)
        embeddings.extend(batch_embeddings)
    
    return embeddings

def get_openai_response(query):
    response = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": query},
      ]
    )
    return response.choices[0].message.content

def create_prompt(question, df, prompt_template, with_context=True):
    # Hardcoded configuration values
    MAX_PROMPT_TOKENS = 2048
    ENCODING = "p50k_base"  # Or whatever encoding you're using

    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding(ENCODING)

    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
    context = []

    if with_context:
        for text in df["concatenated_text"].values:
            # Count tokens for each piece of context
            text_token_count = len(tokenizer.encode(text))
            current_token_count += text_token_count
            
            # Check if adding this context exceeds the max token limit
            if current_token_count <= MAX_PROMPT_TOKENS:
                context.append(text)
            else:
                break
        
        # Format the prompt with the included context
        prompt = prompt_template.format("\n\n###\n\n".join(context), question)
    else:
        # Format the prompt without context
        prompt = prompt_template.format("", question)

    return prompt

def answer_my_question(question, df, prompt_template, with_context=True):
    # Generate embeddings for the question
    query_embedding = generate_embeddings(question)

    # Compute cosine similarities
    df['similarity'] = df['embeddings'].apply(lambda x: cosine_similarity([x], [query_embedding])[0][0])
    df = df.sort_values(by='similarity', ascending=False)
    
    # Create the prompt
    prompt = create_prompt(question, df, prompt_template, with_context)
   
    try:
        # Call the OpenAI Completion model
        return get_openai_response(prompt)
    except Exception as e:
        print(e)
        return "An error occurred"


The interactive_chat function facilitates an interactive chat session where users can ask questions and receive answers generated by a language model. The function repeatedly prompts the user for a question, optionally includes context from a DataFrame, and then prints the response.

In [4]:
def interactive_chat(df, prompt_template):
    while True:
        question = input("Please enter your question (or type 'exit' to quit): ")
        if question.lower() == 'exit':
            break
        
        with_context_input = input("Do you want to include context? (yes/no): ").strip().lower()
        with_context = with_context_input == 'yes'
        
        answer = answer_my_question(question, df, prompt_template, with_context)
        print(f"Answer: {answer}")

In [5]:
os.environ["OPENAI_API_KEY"] = 'YOUR API KEY'

### Data wrangling

In this section, we  load our dataset of superheroes, eliminate unnecessary columns, and concatenate relevant columns to form a comprehensive text column. This consolidated text will be used to generate embbedings then used in context-specific prompts for our OpenAI chatbot. 

Actions performed. 
- Select only 7 columns
- Eliminate rows with NaNs
- Select a sample of 100 superheroes (dataset is large)

This dataset is valuable because it contains detailed information about superheroes that OpenAI's model has not been trained on, making it challenging for the model to answer related questions without this data. The dataset is largely well-curated, requiring minimal transformations or cleaning.

In [6]:
file_path='./data/superheroes_nlp_dataset.csv'
df=pd.read_csv(file_path)
df.rename(columns={'history_text': 'text'}, inplace=True)

In [7]:
selected_columns = ['name', 'real_name', 'full_name', 'overall_score', 'text', 'powers_text', 'occupation']
df= clean_df(df,selected_columns)

In [8]:
df=df[:20]
df

Unnamed: 0,name,real_name,full_name,overall_score,text,powers_text,occupation
2,A-Bomb,Richard Milhouse Jones,Richard Milhouse Jones,20,"Richard ""Rick"" Jones was orphaned at a young ...","On rare occasions, and through unusual circu...","Musician, adventurer, author; formerly talk sh..."
6,Abe Sapien,Abraham Sapien,Abraham Sapien,10,"Sapien began life as Langdon Everett Caul, a ...",Abe is a humanoid amphibious creature. He has...,Paranormal Investigator
8,Abomination,Emil Blonsky,Emil Blonsky,22,"Formerly known as Emil Blonsky, a spy of Sovie...",'Blonsky''s transformation into the Abominatio...,Ex-Spy
9,Abra Kadabra (CW),Unknown,Unknown,13,"""Abra Kadabra"" was a criminal time traveler fr...",Abra Kadabra was augmented with various nanot...,Time-Travelling Criminal
11,Abraxas,Abraxas,Abraxas,∞,"Born within the abstract entity Eternity, Abra...","As antithesis to the cosmic entity Eternity, A...",Dimensional destroyer
12,Absorbing Man (MCU),Carl Creel,Carl Creel,8,"Carl ""Crusher"" Creel was an enhanced individua...",Carl Creel was able to duplicate at will the ...,"Government Agent, Bodyguard"
13,Absorbing Man,Carl Creel,Carl Creel,13,"Before he turned to crime, Creel fought as a...",The Absorbing Man possesses the ability to ...,Professional criminal; former professional boxer
16,Acidicus,Acidicus,Acidicus,10,"During the Serpentine Wars, Acidicus and the o...",Acidicus has the power just like The other Ven...,Leader Of The Venomari Tribe
19,Adam Strange,Adam Strange,Adam Strange,7,"While working on a dig in Caramanga, South Ame...","On Rann, the planet's gravity enhances Adam's...","Adventurer, archaelogist, ambassador"
22,Agent Bob,Bob,Bob,2,"Bob, Agent of Hydra, is a sidekick of Deadpo...",Bob is a normal guy in absolutely every way. ...,"Mercenary, janitor; former pirate, terrorist"


### Embedding Generation

The next line transforms and enriches a DataFrame df by concatenating multiple columns into a single text field within each row. This new field, concatenated_text, combines information from various columns such as name, real_name, full_name, overall_score, text, powers_text, and occupation.

In [9]:
df['concatenated_text'] = df.apply(
    lambda row: f"{row['name']} whose real name is {row['real_name']} and full name is {row['full_name']} "
                f"has an overall score of {row['overall_score']}. History: {row['text']} "
                f"with powers: {row['powers_text']}. Occupation: {row['occupation']}",
    axis=1
)
df

Unnamed: 0,name,real_name,full_name,overall_score,text,powers_text,occupation,concatenated_text
2,A-Bomb,Richard Milhouse Jones,Richard Milhouse Jones,20,"Richard ""Rick"" Jones was orphaned at a young ...","On rare occasions, and through unusual circu...","Musician, adventurer, author; formerly talk sh...",A-Bomb whose real name is Richard Milhouse Jon...
6,Abe Sapien,Abraham Sapien,Abraham Sapien,10,"Sapien began life as Langdon Everett Caul, a ...",Abe is a humanoid amphibious creature. He has...,Paranormal Investigator,Abe Sapien whose real name is Abraham Sapien a...
8,Abomination,Emil Blonsky,Emil Blonsky,22,"Formerly known as Emil Blonsky, a spy of Sovie...",'Blonsky''s transformation into the Abominatio...,Ex-Spy,Abomination whose real name is Emil Blonsky an...
9,Abra Kadabra (CW),Unknown,Unknown,13,"""Abra Kadabra"" was a criminal time traveler fr...",Abra Kadabra was augmented with various nanot...,Time-Travelling Criminal,Abra Kadabra (CW) whose real name is Unknown a...
11,Abraxas,Abraxas,Abraxas,∞,"Born within the abstract entity Eternity, Abra...","As antithesis to the cosmic entity Eternity, A...",Dimensional destroyer,Abraxas whose real name is Abraxas and full na...
12,Absorbing Man (MCU),Carl Creel,Carl Creel,8,"Carl ""Crusher"" Creel was an enhanced individua...",Carl Creel was able to duplicate at will the ...,"Government Agent, Bodyguard",Absorbing Man (MCU) whose real name is Carl Cr...
13,Absorbing Man,Carl Creel,Carl Creel,13,"Before he turned to crime, Creel fought as a...",The Absorbing Man possesses the ability to ...,Professional criminal; former professional boxer,Absorbing Man whose real name is Carl Creel an...
16,Acidicus,Acidicus,Acidicus,10,"During the Serpentine Wars, Acidicus and the o...",Acidicus has the power just like The other Ven...,Leader Of The Venomari Tribe,Acidicus whose real name is Acidicus and full ...
19,Adam Strange,Adam Strange,Adam Strange,7,"While working on a dig in Caramanga, South Ame...","On Rann, the planet's gravity enhances Adam's...","Adventurer, archaelogist, ambassador",Adam Strange whose real name is Adam Strange a...
22,Agent Bob,Bob,Bob,2,"Bob, Agent of Hydra, is a sidekick of Deadpo...",Bob is a normal guy in absolutely every way. ...,"Mercenary, janitor; former pirate, terrorist",Agent Bob whose real name is Bob and full name...


By applying the generate_embeddings function to the concatenated_text column, the code generates and stores vector embeddings in the embeddings column. This transformation is essential for enabling efficient and effective semantic analysis of the text, facilitating tasks like similarity computation and context-aware prompt creation, ultimately leading to more accurate and relevant responses from language models.

This transformation is a crucial step in preparing the data for effective interaction with language models, ensuring that all relevant character details are encapsulated in a format that enhances the model's understanding and response quality.

In [10]:
client = OpenAI()

In [11]:
df['embeddings'] = df['concatenated_text'].apply(generate_embeddings)

In [12]:
df['embeddings'][:5] #Only 5  rows of embbedings as an example

2     [0.03193848207592964, 0.02529870718717575, 0.0...
6     [0.049436766654253006, 0.06654949486255646, 0....
8     [-0.001264746766537428, 0.05437133461236954, -...
9     [0.0024886061437427998, 0.005265010055154562, ...
11    [0.011677389964461327, 0.03187060356140137, 0....
Name: embeddings, dtype: object

### Custom Query Completions

Here is a list of names to help you ask about heroes if you're not familiar with the world of superheroes

In [13]:
df['name'][:20]

2                    A-Bomb
6                Abe Sapien
8               Abomination
9         Abra Kadabra (CW)
11                  Abraxas
12      Absorbing Man (MCU)
13            Absorbing Man
16                 Acidicus
19             Adam Strange
22                Agent Bob
23             Agent Carter
28               Agent Zero
30               Air-Walker
33                    Akita
37        Alfred Pennyworth
45            Amanda Waller
53           Ando Masahashi
56    Angel Salvadore (FOX)
62                Annihilus
63               Ant-Man II
Name: name, dtype: object

#### --------------------------------

In [21]:
# Define the prompt template
prompt_template = """
                    Context: {}
                    ---
                    Question: {}
                    """

In [None]:
interactive_chat(df, prompt_template)

Please enter your question (or type 'exit' to quit): what is Angel Salvadore main occupation
Do you want to include context? (yes/no): no
Answer: Angel Salvadore is a fictional character from the Marvel Comics universe. She is a mutant with insect-like wings and the ability to fly. In the comics, Angel has been affiliated with different groups such as the X-Men and the New Warriors. Her main occupation varies depending on the storyline, but she is often depicted as a superhero or a member of mutant teams fighting for justice and equality.
Please enter your question (or type 'exit' to quit): what is angel Salvadore main occupation
Do you want to include context? (yes/no): yes
Answer: Angel Salvadore's main occupation is an Exotic Dancer.
Please enter your question (or type 'exit' to quit): what are Ando Masahashi powers?
Do you want to include context? (yes/no): no
Answer: Ando Masahashi, a character from the television series "Heroes," does not possess any special abilities or powers. 

Providing context helps the language model focus on the most relevant information, leading to more accurate and specific answers. In this case, the context allowed the model to identify and prioritize Angel Salvadore's main occupation over her general characteristics and abilities. In the case of Ando, the context allowed the model to recognize Ando Masahashi's powers and their effects, correcting the initial misconception from the response without context.