<a href="https://colab.research.google.com/github/moadtahri/gpt3/blob/master/Chatbot_with_custom_knowledge_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction

In this notebook we are going to create custom knowledge base to analyse a transcript using GPT-3.

#Install main dependencies

In [27]:
#%pip install llama-index
#%pip install langchain
#%pip install openai

# Some helper functions
The following code defines the functions we need to construct indexation of chunks of our transcript in order to overcome the limitations offered by OpenAI

This section, the parameters are still subjects to fine-tuning.

In [13]:
from llama_index import SimpleDirectoryReader, GPTListIndex, readers, GPTSimpleVectorIndex, LLMPredictor, PromptHelper, ServiceContext
from langchain import OpenAI
import sys
import os
from IPython.display import Markdown, display

def construct_index(directory_path):
    # set maximum input size
    max_input_size = 4096
    # set number of output tokens
    num_outputs = 2000
    # set maximum chunk overlap
    max_chunk_overlap = 20
    # set chunk size limit
    chunk_size_limit = 600 

    # define prompt helper
    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

    # define LLM model_name="text-davinci-003"
    llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.5, model_name="text-babbage-001", max_tokens=num_outputs))
 
    documents = SimpleDirectoryReader(directory_path).load_data()
    
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
    index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

    index.save_to_disk('index.json')

    return index

def consult_canyas():
        
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    res=True
    while res:
            try:
                query = input("Please enter you request:")
                response = index.query(query)
                print(response)
                display(Markdown(f"Response: <b>{response.response}</b>"))
            except:
                 print('empty request')
                 res=False
                 

        

# OpenAI API Key
You need an Openai API key to be able to run this code.

If you don't have one yet, get it by [signing up](https://platform.openai.com/overview). Then click your account icon on the top right of the screen and select "View API Keys". Create an API key.

Then run the code below and paste your API key into the text input.

In [3]:
os.environ["OPENAI_API_KEY"] = input("Paste your OpenAI key here and hit enter:")

# Indexation

Now we are ready to construct the index. This will take every file in the folder 'data', split it into chunks, and embed it with OpenAI's embeddings API.

**Notice:** running this code will cost you credits on your OpenAPI account ($0.02 for every 1,000 tokens). If you've just set up your account, the free credits that you have should be more than enough for this experiment.

In [11]:
construct_index("knowledge_base/data")

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 13944 tokens


<llama_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex at 0x2517f7af4c0>

# run some thematic coding and text mining analysis

Here are a few queries you can run:
1. Who are the individuals participating in this interview?
2. What the main topics discussed?
3. what are the main themes?
4. what are the main relevant words with their counts?
5. what are the pains of Rens Erwin van in bullet points?

In [14]:
consult_canyas()

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3053 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 2 tokens


about PKI spotlight?

Muneer Mubashir and Rens, Erwin van talk about PKI spotlight.


Response: <b>about PKI spotlight?

Muneer Mubashir and Rens, Erwin van talk about PKI spotlight.</b>

INFO:openai:error_code=None error_message="[''] is not valid under any of the given schemas - 'input'" error_param=None error_type=invalid_request_error message='OpenAI API error received' stream_error=False
INFO:openai:error_code=None error_message="[''] is not valid under any of the given schemas - 'input'" error_param=None error_type=invalid_request_error message='OpenAI API error received' stream_error=False
INFO:openai:error_code=None error_message="[''] is not valid under any of the given schemas - 'input'" error_param=None error_type=invalid_request_error message='OpenAI API error received' stream_error=False
INFO:openai:error_code=None error_message="[''] is not valid under any of the given schemas - 'input'" error_param=None error_type=invalid_request_error message='OpenAI API error received' stream_error=False
INFO:openai:error_code=None error_message="[''] is not valid under any of the given schemas - 'input'" error_param=None error_type=invalid_request_error message='OpenAI

empty request
