# Introduction to Chatbots

## **A. What are Chatbots?**
**Chatbots** are conversational programs that automate interactions. They are artificial intelligence (AI) softwares designed to simulate conversation with human users, typically through text or voice. Chatbots are used to automate customer support, provide information, and even entertain. They interact with users by responding to their questions, giving helpful information, or carrying out tasks based on the input.

- **Examples**: 
  - A chatbot on a bank's website that helps with account inquiries.
  - A chatbot on an e-commerce site that tracks orders or provides product recommendations.
  - Virtual assistants like **Siri** and **Alexa** are advanced chatbots.

## **B. Difference Between Chatbots and Bots**

**Chatbots** are a subset of bots. They are specifically designed for conversation, meaning they are programmed to interact using natural language processing (NLP) to simulate human conversation. 

**Bots**, on the other hand, are more general-purpose programs designed to automate tasks. They don’t necessarily interact with users in natural language, but they perform specific functions like web scraping, sending reminders, or managing social media posts.

- **Chatbot**: Focuses on conversation (e.g., answering customer queries).
- **Bot**: Focuses on automating repetitive tasks (e.g., posting scheduled tweets).

## **C. Types of Chatbots**

Chatbots are generally classified into three categories based on how they respond to user input:

### 1. **Rule-Based Chatbots**:
- **How they work**: Think of rule-based chatbots like a robot that follows a set of instructions or rules. If you say something it recognizes, it will respond with a pre-written answer. For example, if you ask "What’s your name?", the bot might always reply, "I’m Botty!". It works by looking for specific keywords or patterns in what you say and then picking the correct response from its list.
- **Limitation**: The problem is, if you ask something it wasn’t programmed for, like "What’s your favorite color?", it might get confused or give a response that doesn’t make sense. It’s like only being able to talk to someone about a few topics—if you go off-script, the conversation won’t flow.
- **Example**: Imagine a chatbot for a pizza place. If you ask, "What are your hours?", it will answer something like, "We’re open from 10 AM to 10 PM." But if you ask, "What’s your favorite pizza?", it might just say, "I don’t understand."

### 2. **Retrieval-Based Chatbots**:
- **How they work**: These chatbots are a bit smarter than rule-based ones. Instead of giving a fixed reply, they search through a bunch of pre-written responses and try to find the best one based on what you said. It’s like going through a library to find the book that most closely answers your question. 
- **Techniques Used**:
  - **Jaccard Similarity**: Imagine you ask a question like, "What’s the weather today?" The bot checks which of its stored answers have the most words in common with your question. The more words they share, the more likely it is to pick that answer.
  - **Cosine Similarity**: This is like comparing two texts using math. It turns your words into numbers and checks how similar they are. If the numbers line up, the bot figures that the answer might be a good fit.
  - **Machine learning models like `Naive Bayes`**: This is where the bot starts to guess what you’re talking about by learning from past examples. If it’s been trained to answer questions about sports, it’ll know that when you ask about “football”, it should probably give a sports-related response.
- **Example**: Think of a customer service chatbot. If you type "I need help with my order", the bot searches for similar phrases it knows, like "I have a problem with my order", and then provides the best response, like "Please provide your order number so I can help."

#### Let's break down Retrieval-Based methods a bit more

##### **Jaccard Similarity**:

- **What it is**: Jaccard similarity compares two sets of words and checks how similar they are by looking at how many words they share. It’s like checking how much two circles overlap when placed on top of each other. The more overlap, the more similar they are.
  
- **Formula**:  
  $
  \text{Jaccard Similarity} = \frac{|A \cap B|}{|A \cup B|}
  $
  Where:
  - $A \cap B$ is the number of words that both sets (A and B) have in common.
  - $A \cup B$ is the total number of unique words in both sets combined.

- **Example**:
  If you have two sentences:
  - Sentence 1: "I love cats"
  - Sentence 2: "I love dogs"

  The Jaccard similarity would be calculated by comparing the words:
  - Common words (intersection): "I", "love" (2 words)
  - Total unique words (union): "I", "love", "cats", "dogs" (4 words)

  Jaccard Similarity = $(\frac{2}{4} = 0.5)$

In [1]:
# Code example

# Function to compute the Jaccard Similarity between two sets
def jaccard_similarity(set1, set2):
    # Calculate the number of elements in the intersection of the two sets
    intersection = len(set(set1).intersection(set(set2)))
    
    # Calculate the number of elements in the union of the two sets
    union = len(set(set1).union(set(set2)))
    
    # Return the Jaccard Similarity (ratio of intersection over union)
    return intersection / union

# Example: Calculate Jaccard Similarity between two sentences
# The sentences are split into words (tokens), which are compared
sentence1 = "I love cats".split()  # Split sentence1 into ['I', 'love', 'cats']
sentence2 = "I love dogs".split()  # Split sentence2 into ['I', 'love', 'dogs']

# Print the Jaccard Similarity between the two sentences (based on word overlap)
print(jaccard_similarity(sentence1, sentence2))  

0.5


##### Cosine Similarity:

- **What it is**: Cosine similarity compares two sentences (or documents) by turning them into vectors (a way to represent words as numbers) and measuring the angle between them. If the angle is small, the sentences are more similar. Think of it like checking how close two arrows point in the same direction.
  
- **Formula**:
  $
  \text{Cosine Similarity} = \frac{A \cdot B}{||A|| \times ||B||}
  $
  Where:
  - $A \cdot B$ is the dot product of the two vectors (basically multiplying each pair of numbers from both vectors and adding them up).
  - $||A||$ and $||B||$ are the magnitudes (or lengths) of the vectors.

- **Example**:
  If we use a simple example with word counts, where each word represents a dimension:
  - Sentence 1: "I love cats" → [1, 1, 1, 0] (This is a vector with: 1 "I", 1 "love", 1 "cats", 0 "dogs") 
  - Sentence 2: "I love dogs" → [1, 1, 0, 1] (1 "I", 1 "love", 0 "cats", 1 "dogs")

  Cosine similarity measures how closely these two lists of numbers (vectors) align.

In [5]:
# Code Example

# Import CountVectorizer to convert text data into vectors (token counts)
from sklearn.feature_extraction.text import CountVectorizer

# Import cosine_similarity to compute similarity between vectors
from sklearn.metrics.pairwise import cosine_similarity

# Example sentences to compare
sentences = ["I love cats", "I love dogs"]

# Convert sentences to vectors based on word counts (Bag-of-Words model)
vectorizer = CountVectorizer()  # Initialize CountVectorizer
vector_matrix = vectorizer.fit_transform(sentences).toarray()  
# fit_transform: Tokenizes the sentences and counts the frequency of each word, 
# resulting in a vector representation of each sentence.
# toarray(): Converts the resulting sparse matrix into a dense numpy array.

# Compute Cosine Similarity between the vectors of the sentences
cosine_sim = cosine_similarity(vector_matrix)
# cosine_similarity calculates the cosine of the angle between the vectors,
# giving a similarity measure between 0 (no similarity) and 1 (identical).

# Print the Cosine Similarity matrix
print(cosine_sim) # Output: [[1. 0.5], [0.5 1.]]

# In this case, the cosine similarity between "I love cats" and "I love dogs" is 0.5, meaning they’re somewhat similar
# (since they share "I" and "love").

[[1.  0.5]
 [0.5 1. ]]


#### Summary of Differences:
- **Jaccard Similarity** is simpler and great for comparing the overall similarity between two sets of words based on how much they overlap (shared words). 
- **Cosine Similarity** is more useful when the sentences are longer and you want to compare how similar they are based on the direction of the vectors (word counts).

##### Naive Bayes for Classification:

Naive Bayes is a type of algorithm that helps computers make decisions or **classify** things. When you use it in a chatbot, it can help figure out what the user is asking about (the user's intent), then pick the right answer from a set of possible replies.

##### **How It Works**:

1. **Training the Bot**:
   You first need to give the chatbot some **training data**. This is like teaching it by showing examples of questions and their categories, or what they're asking for. For example:
   - "What time is it?" would be categorized as a **time question**.
   - "Tell me a joke" would be categorized as a **joke request**.

   Each example teaches the chatbot what types of questions belong to each category.

2. **Classifying the User's Question**:
   Once the chatbot is trained, it can take a new question from a user and use the **Naive Bayes** algorithm to guess what category that question falls into. For example, if someone asks, "What's the time?", the chatbot might classify it as a **time request**.

3. **Finding the Answer**:
   After it figures out what the question is about, the chatbot then **retrieves** (picks) the right answer from a list of possible responses. For example:
   - If the chatbot classifies the question as a **time request**, it will respond with something like, "It's 3 PM."

##### **In Simple Terms**:
Naive Bayes helps the chatbot **understand** what type of question you're asking. Even though it uses the Naive Bayes algorithm to figure out the question type, the chatbot still pulls the answer from a **pre-written list** of answers—it doesn’t make up answers on its own.

### 3. **Generative Chatbots**:
- **How they work**: These are the most advanced chatbots. Instead of pulling from a list of pre-written answers, they create their own responses based on what you said. It’s like having a conversation with someone who thinks on the spot and makes up their answers. 
    - They use advanced machine learning models, typically deep learning models like RNNs, LSTMs, or transformers (like GPT), to generate new sentences based on the input.

- **Limitation**: Generative chatbots need a lot of training to get good at answering questions, and sometimes, they say things that don’t make much sense because they’re making everything up as they go along. They can get confused if they haven’t been trained well.

- **Example**: ChatGPT is a generative chatbot. When you ask it something like, "What’s your favorite book?", it doesn’t pick from a list. Instead, it thinks about the question and creates an answer based on patterns it learned from reading lots of text. So, you might get something like, "I don't have favorites, but I’ve read a lot about Harry Potter!"

### An example illustrating rule-based, retrieval-based, and generative chatbots using a simple customer service scenario related to order tracking


##### Scenario 
The user asks: **"Where is my order?"**

**1. Rule-Based Chatbot Example:**
In a rule-based chatbot, predefined keywords like "order" and "track" are used to trigger specific responses.

In [6]:
# Define a function for a simple rule-based chatbot
def rule_based_chatbot(user_input):
    # Check if the user input contains the words "track" or "order"
    if "track" in user_input.lower() or "order" in user_input.lower():
        # Respond with a prompt to provide an order number
        return "Please provide your order number to track your order."
    
    # Check if the user input contains the word "refund"
    elif "refund" in user_input.lower():
        # Respond with information about the refund policy
        return "For a refund, please visit our refund policy page."
    
    # If the input doesn't match any of the predefined rules
    else:
        # Respond with a message indicating the chatbot doesn't understand the query
        return "I'm sorry, I didn't understand that. Can you try again?"

# Example user input
user_query = "Where is my order?"

# Call the rule-based chatbot function with the user's query
print(rule_based_chatbot(user_query))

# Output: "Please provide your order number to track your order."

Please provide your order number to track your order.


In [9]:
rule_based_chatbot("I need a refund")

'For a refund, please visit our refund policy page.'

**How it works**: It looks for the keywords **"track"** or **"order"** and returns a fixed response.

#### **2. Retrieval-Based Chatbot Example (Jaccard Similarity):**
In a retrieval-based chatbot, the bot looks for similar sentences in a predefined set of responses.

In [1]:
# Sorry, but you may have to skip to section E to install and download the necessary nltk stuff first. 
# Thanks for your understanding

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import string

# A predefined set of possible responses for the chatbot
responses = [
    "Please provide your order number to track your order.",
    "For a refund, please visit our refund policy page.",
    "Our customer service is available 24/7."
]

# Load a set of English stopwords (common words that may be removed in text preprocessing)
stop_words = set(stopwords.words('english'))

# Preprocess function to clean and prepare text data
def preprocess(text):
    # Tokenize the input text into individual words and convert them to lowercase
    words = word_tokenize(text.lower())
    
    # Remove stopwords (e.g., 'the', 'is') and punctuation
    words = [word for word in words if word not in stop_words and word not in string.punctuation]
    
    # Return the cleaned list of words
    return words

# Function to calculate Jaccard similarity between two sentences
def jaccard_similarity(query, sentence):
    # Preprocess the query and the sentence
    query_set = set(preprocess(query))
    sentence_set = set(preprocess(sentence))
    
    # Calculate the intersection and union of the sets and return the Jaccard similarity score
    return len(query_set.intersection(sentence_set)) / len(query_set.union(sentence_set))

# Function to find the most relevant response based on user input
def retrieval_based_chatbot(user_input):
    best_response = ""  # Placeholder for the best matching response
    highest_similarity = 0  # Keep track of the highest Jaccard similarity score
    
    # Loop through each predefined response and calculate the Jaccard similarity with the user input
    for response in responses:
        similarity = jaccard_similarity(user_input, response)
        
        # Update the best response if the current response has a higher similarity score
        if similarity > highest_similarity:
            highest_similarity = similarity
            best_response = response
    
    # Return the best response if found, otherwise return a fallback message
    return best_response if best_response else "I'm sorry, I couldn't find a relevant response."

# Example user input
user_query = "Where is my order?"

# Call the chatbot function with the user's query and print the response
print(retrieval_based_chatbot(user_query))

# Output: "Please provide your order number to track your order."

Please provide your order number to track your order.


**How it works**: It compares the user's query with predefined responses and returns the most similar one using **Jaccard similarity**.

#### **3. Generative Chatbot**
In a generative chatbot, the response is generated dynamically using a machine learning model (like GPT). In a real scenario, this would involve training a deep learning model.

Example response to a scenario where the user asks: **"Where is my order?"**:

```
"To track your order, please provide your order number or check the tracking link sent to your email."
```
- **How it works**: The generative chatbot creates a new response based on the user input, generating an original sentence that wasn't pre-programmed or retrieved from a predefined list.

## **D. Common Terms in Natural Language Processing (NLP)**

### 1. **Natural Language Processing (NLP)**
NLP is a way for computers to understand, interpret, and respond to human language. Think about how you talk to your friends through texting, and imagine if a computer could understand and respond to those texts. With NLP, computers can read, listen, and even reply like humans! It’s used in many things, like virtual assistants (Siri, Alexa), chatbots, and even spell checkers.

---

### 2. **Tokenization**
Tokenization is like breaking down a sentence into smaller pieces that a computer can understand. Think of it as cutting a big cake (the sentence) into small slices (the words). Each of these slices is called a token, and it can be a word or even a punctuation mark. For example, in the sentence "I love pizza!", the tokens would be "I", "love", "pizza", and "!".
- **Why is it important?** Tokenization helps the computer to focus on individual words or parts of a sentence to figure out what it means.

---

### 3. **Lemmatization**
Lemmatization is when the computer changes words to their simplest form, called the **lemma**. For example, the word "running" changes to "run" or "better" changes to "good". This helps the computer group similar words together and understand the overall meaning of a sentence.
- **Why is it important?** Lemmatization helps computers understand the meaning of words even when they’re written in different forms (like "ran" and "running").

---

### 4. **Stemming**
Stemming is when the computer cuts off the ends of words to get the base form, or **stem**. For example, "playing", "played", and "plays" all become "play". This is different from lemmatization because it’s more about quickly chopping off word endings, even if it doesn’t always create a real word.
- **Why is it important?** Stemming helps computers group words with similar meanings together by chopping off extra endings.

##### Stemming vs. Lemmatization
- Both **stemming** and **lemmatization** help to find the basic form of a word (root word), but they do it differently.
- **Stemming** is like cutting off the end of a word to get the root. For example, "running" becomes "run" by removing the "-ing", but sometimes it cuts too much, making words that don’t look right, like "studies" becoming "studi."
- **Lemmatization** is smarter. It looks at the whole sentence to understand the word's meaning before changing it. So, if you have "better," lemmatization knows it should turn into "good" because that’s the correct form.
- **Lemmatization** is more accurate, but it takes longer because it has to think more about the words. Still, it’s better at keeping the meaning of the words correct in different sentences.


### 5. **Stopwords**
Stopwords are very common words, like "the", "is", "and", "in", that computers often ignore when analyzing a sentence. These words don’t add much meaning to the sentence and are usually just "fillers."
- **Why is it important?** By skipping these stopwords, the computer can focus on the important words in a sentence to understand what you’re really saying.

---

### 6. **Corpora**
A **corpus** (plural: corpora) is a large collection of written or spoken texts that computers use to learn and analyze language. It’s like giving the computer lots of books to read and study from. This is where NLP models get their training—by reading through corpora to understand how humans write or speak.
- **Why is it important?** Corpora help computers get better at understanding language by giving them real examples of how words and sentences are used.

---

### 7. **Bag of Words (BoW)**
Bag of Words is a simple way for computers to represent text. It works by counting how many times each word appears in a text, without caring about the order of the words. Imagine you have a bag and throw all the words from a sentence into it; the computer only knows how many of each word you have, not the sequence.
- **Why is it important?** BoW helps computers recognize which words are important by counting how often they show up, even though it doesn’t consider the sentence structure.

---

### 8. **TF-IDF (Term Frequency-Inverse Document Frequency)**
TF-IDF is a more advanced version of Bag of Words. It doesn’t just count how often a word appears in a text (like BoW); it also checks how rare or important that word is across many documents. For example, common words like "the" will be ignored, but rare words like "pizza" in a group of recipes might be more important.
- **Why is it important?** TF-IDF helps computers figure out which words are important in a group of texts by focusing on less common, more meaningful words.

---

### 9. **Bot Frameworks (e.g., Rasa, Microsoft Bot Framework)**
Bot frameworks are tools that help people build chatbots. It’s like using a set of Lego blocks to quickly build your own chatbot without starting from scratch. These frameworks provide all the basic tools to create, train, and deploy chatbots that can understand and respond to people’s messages.
- **Why is it important?** Bot frameworks make it easier for people to create chatbots that can talk with users, answer questions, and perform tasks.

---

### 10. **Transformers**
Transformers are a special kind of model in NLP that help computers understand language better. They can look at a sentence as a whole instead of just one word at a time. Famous transformer models include **BERT** (used for understanding text) and **GPT** (used for generating text). Transformers have made it possible for computers to have deeper conversations and understand complex language.
- **Why is it important?** Transformers help computers process entire sentences, making them better at answering questions, generating stories, and even holding conversations.

## **E. Possible Workflow for Building a Simple Chatbot using NLTK (Natural Language Toolkit)**

Here’s a simplified workflow for building a basic retrieval-based chatbot using NLTK (a popular Python library for text processing):


#### **Step 1: Install NLTK and Download Resources**

In [None]:
# Install NLTK and download necessary resources such as tokenizers and stopwords.
#!pip install nltk  # Uncomment to install

In [1]:
import nltk

In [None]:
# First time only
nltk.download('punkt')  # Sentence and word tokenizer
nltk.download('stopwords')  # Common words to exclude (e.g., 'the', 'is')
nltk.download('wordnet')  # Lexical database for lemmatization

In [None]:
# Alternatively
# nltk.download()

#### **Step 2: Load the Data**

In [2]:
# Assume you want to create a chatbot based on the text of a book (e.g., "Alice in Wonderland").

# Load text file
with open('alice_in_wonderland.txt', 'r', encoding='utf-8') as f:
    text = f.read().replace('\n', ' ')

#### **Step 3: Preprocess the Data**

In [3]:
# Preprocessing involves tokenizing, removing stopwords, and lemmatizing (reducing words to their base form).
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Preprocess each sentence
def preprocess(sentence):
    tokens = word_tokenize(sentence.lower())
    tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]
    return [lemmatizer.lemmatize(token) for token in tokens]

# Tokenize text into sentences
sentences = nltk.sent_tokenize(text)
corpus = [preprocess(sentence) for sentence in sentences]

#### **Step 4: Implement Jaccard Similarity for Response Matching**
Now, we implement the **Jaccard Similarity** to find the most relevant response to a user’s query.

In [18]:
def jaccard_similarity(query, sentence):
    query_set = set(preprocess(query))
    sentence_set = set(sentence)
    return len(query_set.intersection(sentence_set)) / len(query_set.union(sentence_set))

def get_response(query):
    max_similarity = 0
    best_response = ""
    for i, sentence in enumerate(corpus):
        similarity = jaccard_similarity(query, sentence)
        if similarity > max_similarity:
            max_similarity = similarity
            #best_response = " ".join(sentence)
            best_response = sentences[i] # Use the original sentence with stopwords included
    return best_response

In [19]:
# Example query
user_query = "Who does Alice meet first in Wonderland?"
response = get_response(user_query)
print(response)

He had been looking at Alice for some time with great curiosity, and this was his first speech.


#### **Step 5: Testing the Chatbot**
You can now interact with your chatbot by entering different queries.
- Sample questions:
    1. Who does Alice meet first in Wonderland?
    2. What is the Cheshire Cat's famous line?
    3. How does Alice enter Wonderland?
    4. What is the Queen of Hearts known for?
    5. Why did Alice follow the White Rabbit?
    6. What was Alice's reaction to the Mad Hatter's tea party?
    7. What advice does the Caterpillar give Alice?
    8. What is the significance of the bottle labeled 'Drink Me'?
    9. How does the story of Alice in Wonderland end?
    10. What game does the Queen of Hearts play with Alice?

In [15]:
while True:
    print("Type in 'quit' to quit")
    user_input = input("You: ")
    if user_input.lower() == "quit":
        break
    response = get_response(user_input)
    print("Bot:", response)

Type in 'quit' to quit


You:  What advice does the Caterpillar give Alice?


Bot: Poor Alice!
Type in 'quit' to quit


You:  What is the significance of the bottle labeled 'Drink Me'?


Bot: The poor little Lizard, Bill, was in the middle, being held up by two guinea-pigs, who were giving it something out of a bottle.
Type in 'quit' to quit


You:  quit


## F. Build a Chatbot in Streamlit

In [None]:
# Create the file chatbot_wonderland.py in write mode
with open("chatbot_wonderland.py", "w") as file:
    # Writing the Streamlit code into the file
    file.write('''
    
##### Let's build a beginner-friendly chatbot in Streamlit #####
# This project will build a chatbot that reads a text file, processes it, and returns relevant answers based on user input.

# Importing necessary libraries

# nltk (Natural Language Toolkit) library for various text processing tasks
import nltk
import streamlit as st  # Streamlit is used for building interactive web applications
from nltk.tokenize import word_tokenize, sent_tokenize  # Tokenizers for splitting text into words and sentences
from nltk.corpus import stopwords  # List of common words (stopwords) that are usually removed from text (like "is", "the", "and")
from nltk.stem import WordNetLemmatizer  # Lemmatizer to reduce words to their base form (e.g., 'running' -> 'run')
import string  # Python's built-in library for handling strings and punctuation

# Uncomment to download necessary NLTK resources if not downloaded already
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

# Load stopwords and initialize lemmatizer
stop_words = set(stopwords.words('english'))  # Load a set of common English stopwords to filter out later
lemmatizer = WordNetLemmatizer()  # Initialize a lemmatizer to reduce words to their base form

# Define a function to preprocess text (tokenizing, removing stopwords and punctuation, lemmatizing)
def preprocess(sentence):
    # Tokenize the sentence into words and convert to lowercase
    words = word_tokenize(sentence.lower())
    
    # Remove stopwords and punctuation from the list of words
    words = [word for word in words if word not in stop_words and word not in string.punctuation]
    
    # Lemmatize each word to convert it to its base form (e.g., 'running' -> 'run')
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Return the list of processed words
    return words

# Load the text file (Alice in Wonderland)
def load_text():
    try:
        # Provide the path to the text file
        file_path = r'C:\\Users\\pc\\Desktop\\B-older\\Data and Stuff\\GMC\\ML GMC\\alice_in_wonderland.txt'
        
        # Open the file, read its content, and replace newline characters with spaces
        with open(file_path, 'r', encoding='utf-8') as file:
            return file.read().replace('\\n', ' ')
    
    # Handle case where the file is not found and display an error message in Streamlit
    except FileNotFoundError:
        st.error("Text file not found.")
        return ""

# Tokenize the text into sentences and preprocess them
def prepare_corpus(text):
    # Tokenize the text into individual sentences using sent_tokenize
    sentences = sent_tokenize(text)
    
    # Preprocess each sentence (tokenizing, removing stopwords/punctuation, and lemmatizing)
    return [preprocess(sentence) for sentence in sentences]

# Calculate Jaccard similarity between two sets
def jaccard_similarity(query, sentence):
    # Convert both the query and sentence to sets (unique words)
    query_set = set(query)
    sentence_set = set(sentence)
    
    # If the union of both sets is zero, return 0 to avoid division by zero
    if len(query_set.union(sentence_set)) == 0:
        return 0
    
    # Calculate the Jaccard similarity as the size of intersection divided by the size of union
    return len(query_set.intersection(sentence_set)) / len(query_set.union(sentence_set))

# Find the most relevant sentence using Jaccard similarity
def get_most_relevant_sentence(query, corpus, original_sentences):
    # Preprocess the user query (tokenization, stopword removal, etc.)
    query = preprocess(query)
    
    # Initialize variables to store the maximum similarity and best matching sentence
    max_similarity = 0
    best_sentence = "I couldn't find a relevant answer."  # Default response if no match is found
    
    # Iterate over the corpus of preprocessed sentences to find the best match
    for i, sentence in enumerate(corpus):
        # Calculate the Jaccard similarity between the user query and the current sentence
        similarity = jaccard_similarity(query, sentence)
        
        # If the similarity score is higher than the current maximum, update the best sentence
        if similarity > max_similarity:
            max_similarity = similarity
            best_sentence = original_sentences[i]  # Retrieve the original sentence (before preprocessing)
    
    # Return the most relevant sentence (or the default response if no match is found)
    return best_sentence

# Main function to create the chatbot interface in Streamlit
def main():
    # Title for the app
    st.title("Wonderland's Novice Chatbot")
    
    # A brief description of the chatbot's purpose
    st.write("Hello! Ask me anything related to Alice in Wonderland!")
    
    # Add a dropdown (expander) for suggested questions
    with st.expander("Click me for suggestions"):
        st.write("""
        1. Who does Alice meet first in Wonderland?
        2. What is the Cheshire Cat's famous line?
        3. How does Alice enter Wonderland?
        4. What is the Queen of Hearts known for?
        5. Why did Alice follow the White Rabbit?
        6. What was Alice's reaction to the Mad Hatter's tea party?
        7. What advice does the Caterpillar give Alice?
        8. What is the significance of the bottle labeled 'Drink Me'?
        9. How does the story of Alice in Wonderland end?
        10. What game does the Queen of Hearts play with Alice?
        """)
    # Load and prepare text corpus
    text = load_text()  # Load the text from the file (Alice in Wonderland)
    if text:
        # Preprocess the text to create a corpus of tokenized sentences
        corpus = prepare_corpus(text)  # Prepares the text into a list of preprocessed sentences
        original_sentences = sent_tokenize(text)  # Tokenizes the original text into sentences for later reference

        # Get user input from the Streamlit interface
        user_input = st.text_input("Enter your question:")  # Input field for the user's question

        # If the user clicks the submit button
        if st.button("Submit"):
            if user_input:
                # Get the most relevant sentence from the corpus based on the user's input
                response = get_most_relevant_sentence(user_input, corpus, original_sentences)
                st.write(f"Chatbot: {response}")  # Display the chatbot's response
            else:
                st.write("Please enter a question.")  # Prompt user to enter a question if the input is empty

# Run the Streamlit app
if __name__ == "__main__":
    main()  # Call the main function to run the Streamlit app
    ''')

print("chatbot_wonderland.py creation executed successfully!")

### **Steps to Run This Chatbot in Streamlit**

1. **Install the required libraries**:
   Open your terminal or command prompt and run:
   ```bash
   pip install streamlit nltk
   ```

2. **Place the text file**:
   Download the text of **Alice in Wonderland** (or any other text) and save it as `alice_in_wonderland.txt` in the same directory as the Python file.

3. **Run the Streamlit app**:
   In the terminal, navigate to the directory where the script is saved and run the following command:
   ```bash
   streamlit run your_script_name.py
   ```
   This will open a new window in your browser where you can interact with the chatbot.

---
_**Your Dataness**_,  
**`Obinna Oliseneku`** (_**Hybraid**_)  
**[LinkedIn](https://www.linkedin.com/in/obinnao/)** | **[GitHub](https://github.com/hybraid6)**  