<a href="https://colab.research.google.com/github/m-azra3l/star-trek-chat-bot/blob/main/My_Copy_of_TrekBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TrekBot Colab

Install a few necessary packages.

In [None]:
!pip install numpy
!pip install pandas
!pip install scikit-learn
!pip install openai
!pip install python-dotenv

Next we import the packages we'll need.


**pandas** is a package that allows us to conveniently store and manipulate data in a data structure known as a Dataframe. (This is similar to a Dataframe in R, for those familiar with R.) It’s a very common tool for anyone doing data science in python.

**sklearn** is the package formally called “scikit-learn”, and contains a wide range of statistical and machine learning methods. It’s another very common package for data scientists in python.

**numpy** is python’s main numeric library, and allows us to do things like work with arrays, matrices, dot products, etc.

**json** is a package for interacting with json files. Our data is formatted as a single json file, so this is useful for us here.

**os** helps us with file management and command-line commands.

**openai** is a package containing functions that allow us to easily make API calls to OpenAI’s models in python.

import **cosine_similarity** from sklearn, since it’s a specialized function that we need.

import the **dotenv** module and load the environment variables using the **load_dotenv()** function.

Finally, we import **google.colab** to have access to google drive

In [16]:
# Import the required libraries
import pandas as pd
import numpy as np
import json
import openai
from sklearn.metrics.pairwise import cosine_similarity
import os
import dotenv
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Load environment variables from the .env file
dotenv_path = '/content/drive/MyDrive/Colab Notebooks/.env'  # Replace with the correct path to your .env file
dotenv.load_dotenv(dotenv_path)

# Define constants
CHUNK_SIZE = 600
OVERLAP = 20

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Remember the OpenAI API key you created? Copy and paste it in in your .env file. API_KEY = your apikey

In [17]:
# Set the OpenAI API key from the environment variable
openai.api_key = os.getenv('API_KEY')  # Replace 'API_KEY' with the name of your environment variable


Here's what the model is doing: we have a long piece of text that we want ChatGPT to be able to answer questions about. We first break that text up into chunks containing 600 words (technically called “tokens”), where each chunk overlaps 20 words with the following chunk. We then send these chunks to OpenAI to obtain their embeddings. When we ask a question about our text, we find the question’s embedding, and use cosine similarity to find the chunk of text that is closest to our question. We then send a query to ChatGPT that includes our original question, as well as the chunk of text as context.

We loop over all the chunks, and send each one to OpenAI, get back the embedding, and then write a new line to the Dataframe df. Note that we are casting the embedding response (a string) to a numpy array. We do this because we will be doing numerical operations on the embedding in just a moment.

In [18]:
# Load the scripts from the JSON file
scripts = json.load(open("/content/drive/MyDrive/Colab Notebooks/data/all_scripts_raw.json", encoding='ascii'))  # Replace with your correct path

# Prompt the user to enter the text to include for the bot's knowledge
text = input("Enter the text to include for the bot's knowledge: ")

# Split the text into a list of words
text_list = text.split()

# Divide the text into chunks
chunks = [text_list[i:i+CHUNK_SIZE] for i in range(0, len(text_list), CHUNK_SIZE-OVERLAP)]

# Create an empty DataFrame to store the chunks, GPT response, and embeddings
df = pd.DataFrame(columns=['chunk', 'gpt_raw', 'embedding'])

# Process each chunk
for chunk in chunks:
    # Generate the embedding for the chunk using OpenAI Embedding API
    f = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=" ".join(chunk),
    )

    # Store the chunk, GPT response, and embedding in the DataFrame
    df.loc[len(df.index)] = (chunk, f, np.array(f['data'][0]['embedding']))

Enter the text to include for the bot's knowledge: testing if the code didn't break


In [19]:
# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,chunk,gpt_raw,embedding
0,"[testing, if, the, code, didn't, break]","{'object': 'list', 'data': [{'object': 'embedd...","[3.693080725497566e-05, 0.003650631522759795, ..."


Now, let’s define our query and get its embedding. We’ll customize the bot to include knowledge about text user provides. We’ll see that with the right chunk of text, identified by cosine similarity, ChatGPT can answer correctly.

We calculate the cosine distance from our query to each chunk, and save the chunk that is most similar to a variable called context_chunk.

Finally, we assemble the full query, including the chunk we identified, and send it to ChatGPT via the API:

In [20]:
# Get user input for the query
query = input("Enter your query: ")

# Generate embedding for the query
f = openai.Embedding.create(
    model="text-embedding-ada-002",
    input=query
)
query_embedding = np.array(f['data'][0]['embedding'])

# Calculate similarity scores
similarity = []
for arr in df['embedding'].values:
    similarity.extend(cosine_similarity(query_embedding.reshape(1, -1), arr.reshape(1, -1)))
context_chunk = chunks[np.argmax(similarity)]

# Prepare query and context for OpenAI completion
query_to_send = "CONTEXT: " + " ".join(context_chunk) + "\n\n" + query
response = openai.Completion.create(
  model="text-davinci-003",
  prompt= query_to_send,
  max_tokens=100,
  temperature=0
)

Enter your query: what did I test?


In [21]:
# Print the query and context sent to the GPT model
print(query_to_send)

CONTEXT: testing if the code didn't break

what did I test?


Let's test our bot. Did it get it right? Execute the cell below to find out!

In [22]:
# Print the generated response from GPT
print(response['choices'][0]['text'].strip())

You tested the code to see if it still works correctly after making changes or updates.
