### Build AI products with ChatGPT and your own data

This notebook is a production implementation of the [OpenAI Recipe](https://github.com/openai/openai-cookbook/blob/main/apps/chatbot-kickstarter/powering_your_products_with_chatgpt_and_your_data.ipynb) for building a custom AI product using your own data. Steps:
- **Setup:** 
    - Import libraries and get the custom data to load
- **Build database:**
    - Set up the vector database
    - Load database: Read pdf files, split into chunks for embedding and store in a redis database
- **Build a Prompt based product:**
    - Ask a prompt and get the most relevant entries using semantic search
    - Pass results to GPT-3 for summarization
- **Build a Chat based product:**
    - Use ChatGPT to answer questions using semantic search context

By completion you'll have:
- Your data stored in a vector database
- Prompt api that can be used to build a prompt based product
- Chat api that can be used to build a chat based product

This notebook is a production implementation of this [openai-cookbook](https://github.com/openai/openai-cookbook/blob/main/apps/chatbot-kickstarter/powering_your_products_with_chatgpt_and_your_data.ipynb) which was presented with [these slides](https://drive.google.com/file/d/1dB-RQhZC_Q1iAsHkNNdkqtxxXqYODFYy/view?usp=share_link). We recommend reading the notebook and slides for full context.

In [1]:
%load_ext autoreload
%autoreload 2

## Setup

Import libraries and get the files to load

In [2]:
from typing import Iterator

import os
import openai
import requests
import numpy as np
import pandas as pd
import tiktoken
import textract
from numpy import array, average

from assistant.database import get_redis_connection
from assistant.settings import assistant_settings
from workspace.settings import ws_settings

In [3]:
data_dir = ws_settings.ws_root.joinpath("data")

pdf_files = sorted([x for x in data_dir.rglob("*.pdf")])
print(f"Found {len(pdf_files)} files")

Found 5 files


## Build database

### Set up vector database

We use Redis as the database for storing document contents and vector embeddings. Your workspace includes a redis-stack that can be started using `phi ws up dev:docker:redis`

After the `redis-stack-container` is running, run the following commands to create a `Redis` connection and search index.

In [4]:
from redis import Redis
from redis.commands.search.query import Query
from redis.commands.search.field import TextField, VectorField, NumericField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType

redis_client = get_redis_connection()
print("Redis available: {}".format(redis_client.ping()))

Redis available: True


In [5]:
# Length of vectors
VECTOR_DIM = 1536
# Prefix for the documents
PREFIX = "sportsdoc"
# Distance metric for the vectors (ex. COSINE, IP, L2)
DISTANCE_METRIC = "COSINE"
# Name of the search index
INDEX_NAME = "f1-index"
# Name of search vector field
VECTOR_FIELD_NAME = "content_vector"

In [6]:
# Create a Hierarchical Navigable Small World (HNSW) index for semantic search.

# Define RediSearch fields for each of the columns in the dataset
# NOTE: This is where you should add any additional metadata you want to capture
filename = TextField("filename")
text_chunk = TextField("text_chunk")
file_chunk_index = NumericField("file_chunk_index")

# Define RediSearch vector fields to use HNSW index
text_embedding = VectorField(
    VECTOR_FIELD_NAME,
    "HNSW",
    {"TYPE": "FLOAT32", "DIM": VECTOR_DIM, "DISTANCE_METRIC": DISTANCE_METRIC},
)

# Add all field objects to a list to be created as an index
fields = [filename, text_chunk, file_chunk_index, text_embedding]

In [7]:
# Optional: Drop the index if it already exists
# redis_client.ft(INDEX_NAME).dropindex()

# Check index exists
try:
    _info = redis_client.ft(INDEX_NAME).info()
    print("Index exists")
except Exception as e:
    # Create RediSearch Index
    print("Creating RediSearch Index")
    redis_client.ft(INDEX_NAME).create_index(
        fields=fields,
        definition=IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH),
    )
    print("Index created")

Index exists


### Ingestion

Steps to load data:
- Initiate the tokenizer
- Run a pipeline to:
    - Read pdf files into text
    - Split into chunks and create embeddings
    - Store in a Redis database

In [8]:
# The assistant/transformers.py file contains all the transforming functions
# including ones to chunk, embed and load data
from assistant.transformers import handle_file_string

In [9]:
%%time
# Note: This step takes time as it loads the pdf_files

# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

# Process each PDF file and prepare for embedding
for pdf_file_path in pdf_files:
    print(f"Loading: {pdf_file_path}")

    # Extract the raw text from each PDF using textract
    text = textract.process(str(pdf_file_path), method="pdfminer")

    # Chunk each document, embed the contents and load to Redis
    handle_file_string(
        (pdf_file_path, text.decode("utf-8")),
        tokenizer,
        redis_client,
        VECTOR_FIELD_NAME,
        INDEX_NAME,
    )

Loading: /mnt/workspaces/ai-app/data/FIA Practice Directions - Competitor's Staff Registration System.pdf
[transformer] Created embedding for /mnt/workspaces/ai-app/data/FIA Practice Directions - Competitor's Staff Registration System.pdf
Loading: /mnt/workspaces/ai-app/data/fia_2022_formula_1_sporting_regulations_-_issue_9_-_2022-10-19_0.pdf
[transformer] Created embedding for /mnt/workspaces/ai-app/data/fia_2022_formula_1_sporting_regulations_-_issue_9_-_2022-10-19_0.pdf
Loading: /mnt/workspaces/ai-app/data/fia_2023_formula_1_technical_regulations_-_issue_4_-_2022-12-07.pdf
[transformer] Created embedding for /mnt/workspaces/ai-app/data/fia_2023_formula_1_technical_regulations_-_issue_4_-_2022-12-07.pdf
Loading: /mnt/workspaces/ai-app/data/fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf
[transformer] Created embedding for /mnt/workspaces/ai-app/data/fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf
Loading: /mnt/workspaces/ai-app/data/fia_formula_1_fin

In [10]:
# Check if documents have been inserted
redis_client.ft(INDEX_NAME).info()["num_docs"]

'1320'

## Build a Prompt based product

Build a prompt based product by:
- Getting results from Redis for a prompt using semantic search
- Passing the results to GPT-3 for summarisation

In [11]:
from assistant.database import get_redis_results

In [12]:
%%time

prompt_query = "what is the cost cap for a power unit in 2023"

result_df = get_redis_results(redis_client, prompt_query, index_name=INDEX_NAME)
print(f"Found {len(result_df)} matching results")
result_df.tail(2)

Found 2 matching results
CPU times: user 5.89 ms, sys: 3.55 ms, total: 9.45 ms
Wall time: 167 ms


Unnamed: 0,id,result,certainty
0,0,1 Each Power Unit Manufacturer must: (...,0.160094380379
1,1,1 Each Power Unit Manufacturer must: (...,0.160094380379


### Build a prompt that provides the original query, the results and asks GPT to summarize them.

In [13]:
summary_prompt = """Summarise this result in a bulleted list to answer the search query a customer has sent.
Search query: {}
Search result: {}
Summary:
""".format(
    prompt_query, result_df["result"][0]
)
# print(f"Prompt: {summary_prompt}")

In [14]:
summary = openai.Completion.create(
    engine=assistant_settings.completions_model, prompt=summary_prompt, max_tokens=500
)
# Response provided by GPT-3
print(summary["choices"][0]["text"])


- The Power Unit Cost Cap for 2023 is set at 95,000,000 US Dollars.
- The Cost Cap is adjusted for Indexation in the Full Year Reporting Periods ending on 31 December 2024 and 2025. 
- For the Full Year Reporting Period ending on 31 December 2026 and subsequent periods, the cost cap is set at 130,000,000 US Dollars. 
- Each Power Unit Manufacturer must demonstrate and cooperate with the Cost Cap Administration in its regulatory function, including providing information and documentation and acting in a spirit of Good Faith and cooperation.


### Test Prompt App

Now that we've have our knowledge embedded and stored in Redis, we can create a streamlit applications to test prompts internally. 

Run `phi ws up dev:docker:app` to run the Streamlit App on http://localhost:9095

Open the Streamlit app in your browser and click on the Prompt Product to ask questions from your embedded data.

__Example Questions__:
- what is the cost cap for a power unit in 2023
- what should competitors include on their application form

## Build a Chat based product

The Prompt product was useful, but fairly limited in the complexity of interaction. If the user asks a sub-optimal question, there is no assistance from the system to prompt them for more info or conversation to lead them down the right path.

Lets build a Chat product using the Chat Completions endpoint, which will:
- Accept instructions on how it should act and what the goals of its users are
- Accept some required information that it needs to collect
- Go back and forth with the customer until it has populated that information
- Say a trigger word that will kick off semantic search and summarisation of the response

For more details on our Chat Completions endpoint and how to interact with it, please check out the docs [here](https://platform.openai.com/docs/guides/chat).

### Framework

This section outlines a basic framework for working with the API and storing context of previous conversation "turns". Once this is established, we'll extend it to use our retrieval endpoint.

In [15]:
# A basic example of how to interact with the ChatCompletion endpoint
# It requires a list of "messages", consisting of a "role" (one of system, user or assistant) and "content"

question = "How can you help me"
completion = openai.ChatCompletion.create(
    model="gpt-3.5-turbo", messages=[{"role": "user", "content": question}]
)
print(
    f"{completion['choices'][0]['message']['role']}: {completion['choices'][0]['message']['content']}"
)

assistant: As an AI language model, I can help you in many ways! Here are some examples:

1. Answering questions: I can provide you with quick and accurate answers to any questions you may have on a variety of topics such as science, history, geography, and more.

2. Generating ideas: I can provide you with ideas and suggestions for creative endeavors such as writing, art, or even gift ideas.

3. Writing assistance: If you need help with writing, I can offer suggestions for grammar, spelling, syntax, and formatting to help improve your work.

4. Entertainment: I can provide engaging conversation, play games with you or even offer fun anecdotes to brighten up your day.

5. Personal assistant: I can help you with scheduling, planning, setting reminders, and even alarms.

6. Language translation: I can assist you by translating phrases, sentences, or paragraphs to different languages.

There are so many ways that I can help you with everyday tasks or with specific projects. Just let me kn

In [16]:
from assistant.message import Message
from assistant.chatbot import Chatbot

In [17]:
# Initiate a Chatbot conversation
conversation = Chatbot()

# Create a list to hold the messages and insert:
# 1. A system message to guide behaviour
# 2. The first user question

messages = []
system_message = Message(
    "system", "You are a helpful business assistant who has innovative ideas"
)
user_message = Message("user", "What can you do to help me")
messages.append(system_message.message())
messages.append(user_message.message())
messages

[{'role': 'system',
  'content': 'You are a helpful business assistant who has innovative ideas'},
 {'role': 'user', 'content': 'What can you do to help me'}]

In [18]:
# Get back a response from the Chatbot to our question
response_message = conversation.ask_assistant(messages)
print(response_message["content"])

As a business assistant, I can help you in various ways. Here are some innovative ideas that I can offer:

1. Social media management: I can help you manage your social media accounts and create engaging content to increase your online presence.

2. Email marketing: I can help you create and send out email campaigns to your subscribers to promote your products or services.

3. Customer service: I can assist you in providing excellent customer service by responding to inquiries and resolving issues promptly.

4. Market research: I can conduct market research to help you identify new opportunities and stay ahead of your competitors.

5. Content creation: I can help you create high-quality content for your website, blog, or social media channels to attract and engage your target audience.

6. Project management: I can assist you in managing your projects, ensuring that they are completed on time and within budget.

7. Business strategy: I can help you develop a comprehensive business stra

In [19]:
next_question = "Tell me more about option 2"

# Initiate a fresh messages list and insert our next question
messages = []
user_message = Message("user", next_question)
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
print(response_message["content"])

Email marketing is a powerful tool that can help you reach out to your customers and promote your products or services. Here are some ways I can help you with email marketing:

1. Email campaign creation: I can help you create engaging email campaigns that are tailored to your target audience. This includes designing email templates, writing compelling copy, and creating eye-catching graphics.

2. Email list management: I can help you manage your email list by segmenting it based on demographics, interests, and behavior. This ensures that your emails are targeted and relevant to your subscribers.

3. Email automation: I can help you set up email automation workflows that send out emails based on triggers such as sign-ups, purchases, or abandoned carts. This saves you time and ensures that your subscribers receive timely and relevant emails.

4. Email analytics: I can help you track and analyze your email campaigns' performance, including open rates, click-through rates, and conversions

In [20]:
# Print out a log of our conversation so far

conversation.pretty_print_conversation_history()

user:
What can you do to help me
[32massistant:
As a business assistant, I can help you in various ways. Here are some innovative ideas that I can offer:

1. Social media management: I can help you manage your social media accounts and create engaging content to increase your online presence.

2. Email marketing: I can help you create and send out email campaigns to your subscribers to promote your products or services.

3. Customer service: I can assist you in providing excellent customer service by responding to inquiries and resolving issues promptly.

4. Market research: I can conduct market research to help you identify new opportunities and stay ahead of your competitors.

5. Content creation: I can help you create high-quality content for your website, blog, or social media channels to attract and engage your target audience.

6. Project management: I can assist you in managing your projects, ensuring that they are completed on time and within budget.

7. Business strategy: I c

### Knowledge retrieval

Now we'll extend the class to call a downstream service when a stop sequence is spoken by the Chatbot.

The main changes are:
- The system message is more comprehensive, giving criteria for the Chatbot to advance the conversation
- Adding an explicit stop sequence for it to use when it has the info it needs
- Extending the class with a function ```_get_search_results``` which sources Redis results

In [21]:
# Updated system prompt requiring Question and Year to be extracted from the user

system_prompt = """
You are a helpful Formula 1 knowledge base assistant. You need to capture a Question and Year from each customer.
The Question is their query on Formula 1, and the Year is the year of the applicable Formula 1 season.
If they haven't provided the Year, ask them for it again.
Once you have the Year, say "searching for answers".

Example 1:

User: I'd like to know the cost cap for a power unit

Assistant: Certainly, what year would you like this for?

User: 2023 please.

Assistant: Searching for answers.
"""

In [22]:
# Create another Chatbot conversation
conversation = Chatbot()

# Create a list to hold the messages and insert:
# 1. A system_prompt to guide behaviour
# 2. The first user question
messages = []
system_message = Message("system", system_prompt)
user_message = Message("user", "How can a competitor be disqualified from competition")
messages.append(system_message.message())
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
response_message

{'role': 'assistant', 'content': 'Sure, what year would you like this for?'}

In [23]:
messages = []
user_message = Message("user", "For 2023 please.")
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
# response_message

In [24]:
conversation.pretty_print_conversation_history()

user:
How can a competitor be disqualified from competition
[32massistant:
Sure, what year would you like this for?[0m
user:
For 2023 please.
[32massistant:
According to the FIA Sporting Regulations for the 2023 Formula One season, a competitor can be disqualified from the competition if they breach the FIA regulations. The International Tribunal (IT) will investigate and establish the existence of the breach and impose any sanction upon the person and competitor concerned. The President of the FIA, in its capacity as prosecuting authority, will ask for the imposition of a suspension upon Competitor’s Staff Certificate of Registration holders who have contravened the FIA Code of Good Standing or the withdrawal of the Competitor’s Staff Certificate of Registration. The person and/or competitor sanctioned may bring an appeal before the ICA against the IT’s decision.[0m


### Test Chatbot App

Run `phi ws up dev:docker:app` to run the Streamlit App on http://localhost:9095

Open the Streamlit app in your browser and click on the Chat Product to chat with your embedded data.

__Example Questions__:
- what is the cost cap for a power unit in 2023
- what should competitors include on their application form
- how can a competitor be disqualified

### Summary

By now you have:
- Your data embedded in a vector database.
- A prompt application to answer basic questions on that data.
- A chat application to chat with your data.

These are the foundational building blocks for any Prompt or Chat based products using OpenAI - these are your starting point, and we look forward to seeing what you build with them!