<a href="https://colab.research.google.com/github/moadtahri/gpt3/blob/master/Chatbot_with_custom_knowledge_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction

In this notebook we are going to create custom knowledge base to analyse a transcript using GPT-3.

#Install main dependencies

In [21]:
#%pip install llama-index
#%pip install langchain

# Some helper functions
The following code defines the functions we need to construct indexation of chunks of our transcript in order to overcome the limitations offered by OpenAI

This section, the parameters are still subjects to fine-tuning.

In [23]:
from llama_index import SimpleDirectoryReader, GPTListIndex, readers, GPTSimpleVectorIndex, LLMPredictor, PromptHelper, ServiceContext
from langchain import OpenAI
import sys
import os
from IPython.display import Markdown, display

def construct_index(directory_path):
    # set maximum input size
    max_input_size = 4096
    # set number of output tokens
    num_outputs = 2000
    # set maximum chunk overlap
    max_chunk_overlap = 20
    # set chunk size limit
    chunk_size_limit = 600 

    # define prompt helper
    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

    # define LLM
    llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.5, model_name="text-davinci-003", max_tokens=num_outputs))
 
    documents = SimpleDirectoryReader(directory_path).load_data()
    
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
    index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

    index.save_to_disk('index.json')

    return index

def consult_canyas():
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    while True: 
        query = input("What do you want to ask? ")
        response = index.query(query)
        display(Markdown(f"Response: <b>{response.response}</b>"))

# OpenAI API Key
You need an Openai API key to be able to run this code.

If you don't have one yet, get it by [signing up](https://platform.openai.com/overview). Then click your account icon on the top right of the screen and select "View API Keys". Create an API key.

Then run the code below and paste your API key into the text input.

In [16]:
os.environ["OPENAI_API_KEY"] = input("Paste your OpenAI key here and hit enter:")

# Indexation

Now we are ready to construct the index. This will take every file in the folder 'data', split it into chunks, and embed it with OpenAI's embeddings API.

**Notice:** running this code will cost you credits on your OpenAPI account ($0.02 for every 1,000 tokens). If you've just set up your account, the free credits that you have should be more than enough for this experiment.

In [24]:
construct_index("knowledge_base/data")

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 13944 tokens


<llama_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex at 0x1c2ddcf1400>

# run some thematic coding and text mining analysis

Here are a few queries you can run:
1. Who are the individuals participating in this interview?
2. What the main topics discussed?
3. what are the main themes?
4. what are the main relevant words with their counts?
5. what are the pains of Rens Erwin van in bullet points?

In [27]:
consult_canyas()

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3055 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 9 tokens


Response: <b>
The individuals participating in this interview are Muneer Mubashir and Rens Erwin van.</b>

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3076 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 6 tokens


Response: <b>
The main topics discussed are PKI Spotlight, best practices for PKI, NTLM authentication, threat analysis engines, template misconfiguration checks, and the need for organizations to be serious about their PKI environment.</b>

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4069 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 6 tokens


Response: <b>

The main themes discussed in the context information are cyber security, compliance, PKI, threat landscape, attack surface, risk assessment, encryption, smart card logons, BitLocker encryption, and simplifying the use of PKI certificates for end user logons.</b>

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4278 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 10 tokens


Response: <b>
PKI - 5
Certificates - 5
Security - 5
Data - 5
Risk - 5
Protection - 4
Attack - 3
Recovery - 3
Active Directory - 3
Smart Card - 3
Logons - 3
Encryption - 3
Internet - 2
Vendors - 2
Vouchers - 2
Threat - 2
Landscape - 2
Business - 2
Access - 2
Systems - 2
BitLocker - 1
Recovery Keys - 1
Smart Card Logons - 1
PKI Generated Certificates - 1
Simplifying - 1</b>

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4216 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 11 tokens


Response: <b>

The pains of Rens Erwin van are that organizations do not understand the value of a PKI and the importance of protecting the keys. He also sees that organizations are choosing the cheapest certificate vendors instead of the most trustworthy, and that the Dutch government stopped their PKI hierarchy. He also believes that organizations lack the awareness of misconfigurations and vulnerabilities in their DCS that can expose the entire organization, as well as the lack of an internal certificate policy, trust policy, or guidelines on how to request and obtain certificates, and what can and cannot be used in a certificate.</b>

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4252 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 14 tokens


Response: <b>

• Difficulty in demonstrating the value of PKI to organizations
• Organizations opting for cheaper certificates instead of trusted certificates
• Lack of government decree requiring government organizations to use trusted certificates
• Difficulty in showing the difference in value between trusted and untrusted certificates
• Expensive HSM solutions
• Lack of visibility and understanding of misconfigurations and vulnerabilities in PKI systems
• Lack of organizational maturity and awareness of misconfigurations and vulnerabilities in PKI systems
• Difficulty in finding organizations with an internal certificate policy or trust policy
• Difficulty in obtaining certificates with proper guidelines on how to request and use them</b>

KeyboardInterrupt: Interrupted by user