# Easy-Train Chatbot

## Installing Dependancies
When you first run this project, it will be necessary to install the packages that this project depends on. To simplify this, we have used the -r flag to read the file `requirements.txt`. This file will have all the packages and their version numbers seperated by new lines so that we can just read from this file and update it as we see fit.

In [None]:
# Run this block to install the necessary dependancies
%pip install -r requirements.txt

## Importing Libraries
Next, we need to import the packages into the program. Installing them means that we have downloaded them to our computer, imoprting them means our code will be able to access all of the contents of those packages when we call on them later in the code. 

In [2]:
# Use this block to import the functions that will be used in this program.
    # Note: import PACKAGE_NAME will import the whole package and all of its contents, 
    #       from PACKAGE_NAME import THING will import a specific thing from that package and not the whole package.
    #  This is an important distinction when using large packages as having the whole package available when you are only using a small part of it can slow down your code.
import numpy as np
import pandas as pd

In [None]:
dataset = pd.read_csv('datasets/demoDataset.csv')
dataset.head() # Previews the top 5 rows

# Data Parsing
This is where we will preprocess the data for the model to take in. Working with a language model, we need to convert the text to numbers. To do this, we can use techniques like number mapping or vectorizing.

In [39]:
# Caching a copy of the dataset to be modified
parsedDb = dataset

In [43]:
# Getting a list of all unique responses and mapping them to their index in the list
AnswerKey = dataset[' Response'].unique()

print(AnswerKey)
print(f"The answer key has {len(AnswerKey)} unique keys.")

18

In [None]:
# Using the Sci-Kit Learn CountVectorizer to parse the Questions
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
tokenized_questions = vectorizer.fit_transform(dataset['Question'])
tokenized_questions = tokenized_questions.toarray()

# Appending the vectorized questions to the dataset
tokenized_questions_df = pd.DataFrame(tokenized_questions, columns=vectorizer.get_feature_names_out())

parsedDb['question_tokenized'] = tokenized_questions_df.values.tolist()

parsedDb

In [42]:
# Adding a column with the index of each answer in the AnswerKey
keyCol = []

for i in range(len(dataset[' Response'])):
    ans = dataset[' Response'][i]

    for j in range(len(AnswerKey)):
        if ans == AnswerKey[j]:
            index = j

    keyCol.append(index)

parsedDb['AnswerKey'] = keyCol

parsedDb

Unnamed: 0,Question,Response,question_tokenized,AnswerKey
0,Where is the ILC?,On Union and Division,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
1,Where is Mithcel?,University and Union,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1
2,Where is Clark Hall?,Fifth Street,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...",2
3,Where is Biosci?,Barrie Street,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3
4,Where is Victoria Hall?,Bader Lane,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",4
5,Where is Stauffer?,University and Union,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1
6,Where is Goodes Hall?,Union and Alfred,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",5
7,Where is Douglas Library?,University and Union,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",1
8,Where is Stirling Hall?,Bader Lane,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",4
9,Where is Dunning?,University and Union,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",1


In [None]:
# Save the newly parsed dataset to a csv file to load in next time
dataset = parsedDb
dataset.to_csv('Parsed_Demo_Dataset', sep='\n')

In [None]:
# Load in existing dataset file
dataset = pd.read_csv('datasets/FILENAME.csv')

# Building the Model

This is where we will build and impliment different models which will relate our inputs and outputs. 

`Feel free to make multiple code blocks to test out how different models perform on the same dataset.`

In [53]:
# Importing the model from Sci-Kit Learn
from sklearn.naive_bayes import GaussianNB

# Splitting the data into the inputs (x) and outputs (y).
x = dataset['question_tokenized'].to_list()
y = dataset['AnswerKey'].to_list()

# Creating an instance of the model
model = GaussianNB()
# Fitting the model to the data
model.fit(x,y)

# Running a model prediction on the thrid question in the dataset
print(model.predict([x[2]]))

[2]


## Evaluating the Performance of the Model
Once a model is output from the previous block, we will need to evaluate this model against some metrics to act as quality control. These metrics will be defined as a group and will help us gage how well our program is working.

In [None]:
# Create methods of testing the model against some metrics to see how well it is performing. 


## Testing Interface
The testing interface is for debugging and interaction. This should be where we as the programmers can interact with the model and give it prompts and see the responses. For now, it will run within the notebook but if we have time, we can develop a seperate GUI that will let us chat with the bot in a more user-friendly manner.

In [None]:
# Take in user input for the prompt, pass it through the model, and display the otuput
