# Chatbot Implementation

## Source Preparation

Before developing the chatbot, the sources it will retrieve information from need to be formatted properly in a readable format. Code files (ipynb format) and the latest report (Milestone 2 as pdf) will be converted to txt files. These will be the main sources of information for the chatbot. "data_access_info.txt" and the GitHub README will also be included. The material will be stored in a folder called "Chatbot_Knowledge".

We will first convert ipynb notebooks to txt files ...

In [1]:
import nbformat
import os

def convert_ipynb_to_txt(ipynb_path, txt_path):

    """ Converts ipynb to txt file """

    # Load ipynb file
    with open(ipynb_path, 'r', encoding='utf-8') as notebook_file:
        notebook_content = nbformat.read(notebook_file, as_version=4)
    
    # Write content to txt file
    with open(txt_path, 'w', encoding='utf-8') as txt_file:

        # Write name of original file
        txt_file.write(f'\n\nThis file is the content of "{ipynb_path}"\n\n')

        for cell in notebook_content['cells']:
            # Handle code cells
            if cell['cell_type'] == 'code':
                txt_file.write('Code Cell:\n')
                txt_file.write('```python\n')
                txt_file.write(''.join(cell['source']))
                txt_file.write('\n```\n\n')
            
            # Handle markdown cells
            elif cell['cell_type'] == 'markdown':
                txt_file.write('Markdown Cell:\n')
                txt_file.write('## Markdown Content:\n')
                txt_file.write(''.join(cell['source']))
                txt_file.write('\n\n')
        
        print(f"{ipynb_path} content successfully written to '{txt_path}'")

# Convert ipynb notebooks to txt
storage_folder = '../Chatbot_Knowledge/'

if not os.path.exists(storage_folder):
    os.mkdir(storage_folder)

notebooks = ['data_collection.ipynb', 'data_processing.ipynb', 'training.ipynb', 'testing.ipynb']

for ipynb_path in notebooks:
    root_name = ipynb_path.split('.')[0]
    txt_path = root_name + '.txt'
    convert_ipynb_to_txt(ipynb_path, storage_folder + txt_path)

data_collection.ipynb content successfully written to '../Chatbot_Knowledge/data_collection.txt'
data_processing.ipynb content successfully written to '../Chatbot_Knowledge/data_processing.txt'
training.ipynb content successfully written to '../Chatbot_Knowledge/training.txt'
testing.ipynb content successfully written to '../Chatbot_Knowledge/testing.txt'


Now we will convert the latest report (pdf) to a txt file.

In [2]:
import pdfplumber

latest_report_path = '../Report/Milestone2.pdf'

with pdfplumber.open(latest_report_path) as pdf:

    text = ''

    # Extract text from all pages
    for i, page in enumerate(pdf.pages):

        text += page.extract_text()

    # Write to txt
    root_name = latest_report_path.split('/')[-1].split('.pdf')[0]
    txt_path = storage_folder + root_name + '.txt'

    with open(txt_path, 'w') as txt_file:
        txt_file.write(f'\n\nThis file is the content of "{latest_report_path}"\n\n')
        txt_file.write(text)

    print(f'{latest_report_path} content successfully written to {txt_path}')

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


../Report/Milestone2.pdf content successfully written to ../Chatbot_Knowledge/Milestone2.txt


We will show the files currently in the knowledge base. Some were generated with the previous code. A few others such as "data_access_info.txt" and "requirements.txt" were dragged into the folder. We can also get an estimate of the number of tokens in the txt files by using the T5 tokenizer. T5 is a free and open source model on HuggingFace that can handle large inputs.

In [3]:
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('google-t5/t5-large')

# Show files in knowledge base and tokens
print('Files in Knowledge Base')
print('---------------------------')

for i, source_name in enumerate(os.listdir(storage_folder)):
    source_path = storage_folder + source_name

    with open(source_path, 'r') as txt_file:
        content = txt_file.read()
        num_tokens = len(tokenizer(content)['input_ids'])
        print(f'{i+1}. {source_name}: {num_tokens} tokens')

Files in Knowledge Base
---------------------------
1. data_access_info.txt: 645 tokens
2. data_collection.txt: 1316 tokens
3. data_processing.txt: 9342 tokens
4. Milestone2.txt: 8839 tokens
5. README.txt: 812 tokens
6. requirements.txt: 141 tokens
7. testing.txt: 6769 tokens
8. training.txt: 16367 tokens


The content in all these files can then be combined into one txt file.

In [4]:
import os

# Load all text
storage_folder = '../Chatbot_Knowledge/'
knowledge_base = ''

for source_name in os.listdir(storage_folder):

    source_path = storage_folder + source_name

    with open(source_path, 'r') as file:
        content = file.read()
        knowledge_base += content

# Combine all knowledge into one file
combined_path = storage_folder + 'combined_knowledge.txt'

with open(combined_path, 'w') as file:
    file.write(knowledge_base)

# Show number of tokens
num_tokens = len(tokenizer(knowledge_base)['input_ids'])
print(f'Combined knowledge base was successfully stored in "{combined_path}" and is equivalent to about {num_tokens} tokens')

Combined knowledge base was successfully stored in "../Chatbot_Knowledge/combined_knowledge.txt" and is equivalent to about 44226 tokens


## Using Open AI API

To use Open AI API to ask questions about the knowledge base, the content of the knowledge base needs to be loaded back in. The API can then be called to answer a question.

A simple flask app will be used for the user interface. To access it, simply run *chatbot_app.py* in *Scripts*. It will provide you with a locally run website that you can visit to interact with the chatbot. You can change the script to use a different port if desired. Behind the scenes, it will call a Google Cloud function that contacts Open AI API and returns responses to the user's prompts. Google Cloud was used to protect the API key for Open AI so here is the code that is hosted there (API key not shown):

In [None]:
import functions_framework
from openai import OpenAI

@functions_framework.http
def ask_question(request):

    """ User asks a question and chatbot's response is returned """

    if request.method == 'POST':
        try:
            # Read request
            data = request.get_json()
            user_prompt = data.get('message', 'Sorry ... no message recognized.')
            
            # API key: DO NOT SHARE!!!
            API_KEY = ''

            # Load knowledge base
            combined_path = 'combined_knowledge.txt'
            knowledge_base = ''

            with open(combined_path, 'r') as file:
                knowledge_base = file.read()

            # Contact API with key
            client = OpenAI(
            api_key=API_KEY
            )

            # Set up chatbot's expected behavior and feed it the knowledge base of the project
            behavior_msg = "Michael Calderin did a project using machine learning to classify transits from NASA's Kepler mission as true planets or not. The work is hosted on GitHub at https://github.com/michaelcalderin/cap5771sp25-project. Your job is to answer questions related to the project. The content in the code files, written report, etc. will be the following so use this material to answer questions and do not stray from this task:"
            system_content = f'{behavior_msg}\n{knowledge_base}'

            # Fetch the response
            completion = client.chat.completions.create(
                model='gpt-4.1-nano',
                messages=[
                {'role': 'system', 'content': system_content},
                {'role': 'user', 'content': user_prompt}
                ],
                max_completion_tokens=300
            )

            response = completion.choices[0].message.content
            return response
        
        except Exception as e:
            return f'Error: {str(e)}'

    return ''