# SAP by IEM: Data Science in Use - Demo 3: Chatbot
This notebooks creates a very basic chatbot that allows you to speak about books and articles (stored in HTM and pdf files). </br>**Author**: marton.szel@lynxanalytics.com

In [None]:
# Imports

import pandas as pd
from src.preprocess_utils import page_preprocessor, knowledge_base_maker
from src.gpt_utils import answer_question

# setting up autoreload
%reload_ext autoreload
%autoreload 2

In [None]:
# Parameters
_foldername = 'book_robot'

## Building up the knowledge base
Create a knowledge base that the chatbot can use as a base.

### Preprocessing the files
The first step is to create a converter that makes the basic cleaning on the files. **TODO**: You can develop it by adding more preprocessors, or start using langchain library.

In [None]:
page_preprocessor(
    path_in_folder=f'./data/01_raw/{_foldername}/', 
    path_out_folder=f'./data/02_preprocessed/{_foldername}/',
    _encoding='utf-8', _verbose=False)

### Create embeddings
The second step is splitting the files, and vectorize them. Finally, a knowledge base can be made. **TODO**: You can add a summary from all page and copy it to the top of all lines for getting better retrieval results while using the bot. Also, you can use much better text splitters.

In [None]:
knowledge_base_maker(
    path_in_folder=f'./data/02_preprocessed/{_foldername}/', 
    path_out_folder=f'./data/03_knowledge_base/{_foldername}/',
    _encoding='latin1', min_token_size=128, max_token_size=512, 
    model='openai', _sleeptime=0.1, _verbose=True)

pdf_knowledge_base = pd.read_pickle(
    f'./data/03_knowledge_base/{_foldername}/knowledge_base.pickle')

pdf_knowledge_base.head(2)

In [None]:
pdf_knowledge_base.n_tokens.plot(kind='hist', bins=20)

## Asking questions from the knowledge base, using Open AI
Converting the question to a vector, and collect the top info from the knowledge base. After, ask chat GPT to answer the question using the context. **TODO**: You can make it better by: 
 * handling question/answer history
 * adding better bot prompts
 * writing better functions for retrieving the right information

In [None]:
user_question = "How many parameters GPT-3 has?"
user_question = "What is the attention layer?"

In [None]:
_bot_answer = answer_question(
    path_in_knowledge_base=f'./data/03_knowledge_base/{_foldername}/knowledge_base.pickle', 
    in_question=user_question, project_name=_foldername, model='gpt3.5', 
    max_context_len=1500, _verbose=True)

In [None]:
print(_bot_answer)

**F.I.N.**