<a href="https://colab.research.google.com/github/ilirjanahyseni/data-science-chatbot/blob/main/DataScience_Chatbot_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This program is designed to function as a chatbot with the ability to scrape data from the Data Science Wikipedia page and answer user questions. It combines traditional text processing methods with advanced machine learning models for question answering. Here's a breakdown of its main components:

1. Library Imports: The script begins by importing necessary Python libraries for HTTP requests, HTML parsing, natural language processing, vectorization, and machine learning.

2. NLTK Data Download: It ensures that essential NLTK data packages (punkt and wordnet) are available, which are crucial for text tokenization and lemmatization.

3. Question-Answering Pipeline: The script initializes a question-answering pipeline using Hugging Face's Transformers library, which employs a pre-trained model capable of understanding and answering natural language questions.

4. Web Scraping: It fetches and parses the content from the Data Science Wikipedia page using requests and BeautifulSoup, storing the text in a lowercased format for further processing.

5. Text Preprocessing: The script sets up functions for tokenizing and normalizing text, including lemmatization and punctuation removal, to prepare the data for analysis.

6. Response Generation: The response function uses TF-IDF vectorization and cosine similarity to generate responses based on the user's input, leveraging the scraped Wikipedia content.

7. Enhanced Question Answering: For specific questions (detected by keywords like "what is" or "explain"), the script uses the pre-trained model to generate more accurate and context-aware responses.

8. Interaction Loop: Finally, the script enters a loop where it interacts with the user, processing input, generating responses using either the traditional method or the pre-trained model, and continuing the conversation until the user opts to exit.


In [68]:
import requests # Used to make HTTP requests to fetch the Wikipedia page
from bs4 import BeautifulSoup # Parses the HTML content of the Wikipedia page to extract text
import nltk # The Natural Language Toolkit, used for text processing like tokenization and lemmatization
import numpy as np # Numerical operations
import random # Generating random choices
import string # Provides a list of punctuation characters for text cleaning

# Transform text into a vectorized form and calculating similarity between text segments
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Provides access to the question-answering pipeline and pre-trained models
from transformers import pipeline

In [69]:
# Downloading NLTK packages
nltk.download('punkt', quiet=True) # for sentence tokenization
nltk.download('wordnet', quiet=True) #  for lemmatization

True

**Question-Answering Pipeline Initialization**:

- Initializes a pipeline for question-answering tasks using a pre-trained model.
- The default model is associated with the question-answering pipeline from Hugging Face's Transformers library. When you call pipeline('question-answering') without specifying a model, it defaults to a pre-trained model (distilbert-base-cased-distilled-squad.) that's optimized for the question-answering task.

In [70]:
# Initialize the question-answering pipeline with a pre-trained model
qa_pipeline = pipeline('question-answering')

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


To specify a different model for the task, you can do so by providing the model's identifier when you initialize the pipeline. For example:

qa_pipeline = pipeline('question-answering', model='bert-large-uncased-whole-word-masking-finetuned-squad')

This would initialize the pipeline with a different BERT variant that's also fine-tuned on the SQuAD dataset.


**Web Scraping:**

- The script fetches the Data Science Wikipedia page using requests.
- BeautifulSoup is used to parse the HTML and extract all paragraph text (p tags), which is then converted to lowercase.

In [71]:
# Web scraping to fetch the content
url = "https://en.wikipedia.org/wiki/Data_science"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
text = " ".join([p.text for p in soup.find_all('p')]).lower()
sent_tokens = nltk.sent_tokenize(text)

**Text Preprocessing Functions**:

- Lemmatization: The script initializes a WordNetLemmatizer to convert words to their base form.
- Tokenization and Cleaning: Functions are defined to tokenize text and remove punctuation, preparing it for vectorization.

In [72]:
# Text preprocessing functions
lemmer = nltk.stem.WordNetLemmatizer()
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

**Response Generation (Traditional Method)**:

- The response function uses TF-IDF to vectorize the text and calculates cosine similarity to find the most similar sentence in the scraped content to the user's input.
- If the similarity is too low (indicating no relevant information was found), it returns an apology; otherwise, it returns the most similar sentence.

In [73]:
# Response generation function
def response(user_response):
    robo_response = ''
    sent_tokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx = vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if req_tfidf == 0:
        robo_response += "I am sorry! I don't understand you."
        return robo_response
    else:
        robo_response += sent_tokens[idx]
        return robo_response

**Enhanced Question Answering**:

- For queries that are identified as questions (containing phrases like "what is" or "explain"), the script uses the question-answering pipeline.
- get_answer_from_model sends the user's query and the scraped text as context to the pipeline, which then returns a specific answer.

In [74]:
# Enhancing the chatbot's ability to answer specific questions
def get_answer_from_model(question, context):
    return qa_pipeline(question=question, context=context)['answer']

**Main Interaction Loop**:

- The chatbot introduces itself and invites the user to ask questions.
- In a loop, the chatbot takes user input, checks if the user wants to exit, and then decides whether to use the traditional response generation method or the enhanced question-answering pipeline based on the type of query.
- The chatbot responds with either the generated answer or a farewell message, continuing until the user exits.

In [75]:
# Main interaction loop
flag = True
print("BOT: My name is DataBot. I can answer your questions about data science. If you want to exit, type Bye!")
while flag:
    user_response = input("You: ")
    user_response = user_response.lower()
    if user_response != 'bye':
        if user_response == 'thanks' or user_response == 'thank you':
            flag = False
            print("BOT: You're welcome!")
        else:
            if 'what is' in user_response or 'explain' in user_response:
                print("BOT: ", end="")
                print(get_answer_from_model(user_response, context=text))
            else:
                print("BOT: ", end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag = False
        print("BOT: Goodbye!")

BOT: My name is DataBot. I can answer your questions about data science. If you want to exit, type Bye!
You: What is data science?
BOT: the sexiest job of the 21st century
You: wow
BOT: I am sorry! I don't understand you.
You: What is data science ?
BOT: the sexiest job of the 21st century
You: bye
BOT: Goodbye!




---

