# <font color='red'> SciBot trained with Wiki, selected text, pdf folder, or SciBERT 

### <font color='blue'> * NLTK to process the text
    
### <font color='blue'> * Scikit-learn or TensorFlow to train model 

### <font color='blue'> * Chatbot framework (ChatterBot or Rasa) to revieve user input and generate responses.
    
####    References: 
####         https://www.kaggle.com/code/rajkumarl/wiki-ir-chatbot
####         https://github.com/allenai/scibert
####         https://arxiv.org/abs/1903.10676
####         https://arxiv.org/pdf/1908.08835.pdf

# <font color='green'> 1. If training chatbot with SciBert 
    
SciBERT is a pre-trained transformer-based model that can be fine-tuned for various natural language understanding tasks, such as text classification, named entity recognition and question answering. 

Here is an example of Python code that demonstrates how to fine-tune a SciBERT model for a text classification task using the Hugging Face's Transformers library:


import transformers

from transformers import AutoModelForSequenceClassification, AutoTokenizer

### Load the SciBERT model and its tokenizer
model = AutoModelForSequenceClassification.from_pretrained("scibert-scivocab-cased")

tokenizer = AutoTokenizer.from_pretrained("scibert-scivocab-cased")

### Prepare the training data
### load, preprocess the training data

### Fine-tune the SciBERT model on the training data

### Integrate the fine-tuned model into a chatbot framework such as ChatterBot or Rasa
 
 
 This code uses the Hugging Face's Transformers library to load the pre-trained SciBERT model and tokenizer, it could be used to fine-tune the model using the training data. After fine-tuning the model, it can be integrated into a chatbot framework like Chatterbot or Rasa to handle user input and generate responses.

Please note that this is just a skeleton code, you will need to add the preprocessing of the data, loading the training dataset, fine-tuning the model and integrating it into a chatbot framework. Also, make sure to adjust the model name to the one you want to use (scibert-scivocab-cased in this example) and the number of classes you have in your dataset.

# <font color='green'> 2. If use a selected folder of pdf files to train

import os

import PyPDF2

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

### Define the folder path where the PDF files are located
folder_path = 'training_data_folder'

### Initialize an empty string to store the text from all PDF files
all_text = ""

### Iterate over all PDF files in the folder
    
for filename in os.listdir(folder_path):
    if filename.endswith('.pdf'):
        # Open the PDF file
        with open(os.path.join(folder_path, filename), 'rb') as file:
            pdf_reader = PyPDF2.PdfFileReader(file)

            # Extract the text from the PDF file
            for page_num in range(pdf_reader.numPages):
                page = pdf_reader.getPage(page_num)
                text = page.extractText()

                # Add the text to the overall text from all PDF files
                all_text += text

### Preprocess the text data
lemmatizer = WordNetLemmatizer()
words = word_tokenize(all_text)
words = [lemmatizer.lemmatize(word) for word in words]

### Use the preprocessed text to train a chatbot model (using a library like NLTK or scikit-learn)
# ... (model training code here)

### Integrate the trained model into a chatbot framework (like ChatterBot or Rasa)
### ... (chatbot framework code here)

# <font color='green'> 3. If using a selected text file to train Sci_Bot

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

### Reading text file
file_path = 'chatbot_training_data.txt'
with open(file_path, 'r') as file:
    text = file.read()

### Preprocessing the text 
lemmatizer = WordNetLemmatizer()
words = word_tokenize(text)
words = [lemmatizer.lemmatize(word) for word in words]

### Then use NLTK or scikit-learn to train a chatbot
 
### Integrate the trained model into ChatterBot or Rasa

# <font color='green'> 4. If using Wikipedia API

### Creating API object
wiki = wikipediaapi.Wikipedia(language='en',
                              extract_format=wikipediaapi.ExtractFormat.WIKI)
                              
                              
### Extracting text                              
page_title = 'Chatbot'
page = wiki.page(page_title)
if page.exists():
    print(page.text)
else:
    print("Page does not exist")

### Then use NLTK or scikit-learn to train a chatbot
 
### Integrate the trained model into ChatterBot or Rasa
 

# <font color='green'> 5. Scaping without using Wikipedia API


In [None]:
# To scrape Wikipedia
from bs4 import BeautifulSoup
# To access contents from URLs
import requests
# to preprocess text
import nltk
# to handle punctuations
from string import punctuation
# TF-IDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# cosine similarity score
from sklearn.metrics.pairwise import cosine_similarity 
# to do array operations
import numpy as np
# to have sleep option
from time import sleep 

#nltk.download('stopwords')

### <font color='green'> A multicomponent chatbot class 
### a. Initialization
### b. Greetings from the Scibot
### c. Chat function (called by user) that controls inputs, responses, data scraping, preprocessing, modeling.
### d. Recieve_input from user 
### e. Respond from Sci_bot
### f. Extracting information from Sources
### g. Text Preprocessing

In [2]:
class ChatBot():
    
    # initialize bot
    def __init__(self):
        # flag whether to end chat
        self.end_chat = False
        # flag whether topic is found in wikipedia
        self.got_topic = False
        # flag whether to call respond()
        # in some cases, response be made already
        self.do_not_respond = True
        
        # wikipedia title
        self.title = None
        # wikipedia scraped para and description data
        self.text_data = []
        # data as sentences
        self.sentences = []
        # to keep track of paragraph indices
        # corresponding to all sentences
        self.para_indices = []
        # currently retrieved sentence id
        self.current_sent_idx = None
        
        # a punctuation dictionary
        self.punctuation_dict = str.maketrans({p:None for p in punctuation})
        # wordnet lemmatizer for preprocessing text
        self.lemmatizer = nltk.stem.WordNetLemmatizer()
        # collection of stopwords
        self.stopwords = nltk.corpus.stopwords.words('english')
        # initialize chatting
        self.greeting()

    # greeting method - to be called internally
    # chatbot initializing chat on screen with greetings
    def greeting(self):
        print("Initializing ChatBot ...")
        # some time to get user ready
        sleep(2)
        # chat ending tags
        print('Type "bye" or "quit" or "exit" to end chat')
        sleep(2)
        # chatbot descriptions
        print('\nEnter your topic of interest when prompted. \
        \nChaBot will access Wikipedia, prepare itself to \
        \nrespond to your queries on that topic. \n')
        sleep(3)
        print('ChatBot will respond with short info. \
        \nIf you input "more", it will give you detailed info \
        \nYou can also jump to next query')
        # give time to read what has been printed
        sleep(3)
        print('-'*50)
        # Greet and introduce
        greet = "Hello, Great day! Please give me a topic of your interest. "
        print("ChatBot >>  " + greet)
        
    # chat method - should be called by user
    # chat method controls inputs, responses, data scraping, preprocessing, modeling.
    # once an instance of ChatBot class is initialized, chat method should be called
    # to do the entire chatting on one go!
    def chat(self):
        # continue chat
        while not self.end_chat:
            # receive input
            self.receive_input()
            # finish chat if opted by user
            if self.end_chat:
                print('ChatBot >>  See you soon! Bye!')
                sleep(2)
                print('\nQuitting ChatBot ...')
            # if data scraping successful
            elif self.got_topic:
                # in case not already responded
                if not self.do_not_respond:
                    self.respond()
                # clear flag so that bot can respond next time
                self.do_not_respond = False
    
    # receive_input method - to be called internally
    # recieves input from user and makes preliminary decisions
    def receive_input(self):
        # receive input from user
        text = input("User    >> ")
        # end conversation if user wishes so
        if text.lower().strip() in ['bye', 'quit', 'exit']:
            # turn flag on 
            self.end_chat=True
        # if user needs more information 
        elif text.lower().strip() == 'more':
            # respond here itself
            self.do_not_respond = True
            # if at least one query has been received 
            if self.current_sent_idx != None:
                response = self.text_data[self.para_indices[self.current_sent_idx]]
            # prompt user to start querying
            else:
                response = "Please input your query first!"
            print("ChatBot >>  " + response)
        # if topic is not chosen
        elif not self.got_topic:
            self.scrape_wiki(text)
        else:
            # add user input to sentences, so that we can vectorize in whole
            self.sentences.append(text)
                
    # respond method - to be called internally
    def respond(self):
        # tf-idf-modeling
        vectorizer = TfidfVectorizer(tokenizer=self.preprocess)
        # fit data and obtain tf-idf vector
        tfidf = vectorizer.fit_transform(self.sentences)
        # calculate cosine similarity scores
        scores = cosine_similarity(tfidf[-1],tfidf) 
        # identify the most closest sentence
        self.current_sent_idx = scores.argsort()[0][-2]
        # find the corresponding score value
        scores = scores.flatten()
        scores.sort()
        value = scores[-2]
        # if there is matching sentence
        if value != 0:
            print("ChatBot >>  " + self.sentences[self.current_sent_idx]) 
        # if no sentence is matching the query
        else:
            print("ChatBot >>  I am not sure. Sorry!" )
        # remove the user query from sentences
        del self.sentences[-1]
        
    # scrape_wiki method - to be called internally.
    # called when user inputs topic of interest.
    # employs requests to access Wikipedia via URL.
    # employs BeautifulSoup to scrape paragraph tagged data
    # and h1 tagged article heading.
    # employs NLTK to tokenize data
    def scrape_wiki(self,topic):
        # process topic as required by Wikipedia URL system
        topic = topic.lower().strip().capitalize().split(' ')
        topic = '_'.join(topic)
        try:
            # creata an url
            link = 'https://en.wikipedia.org/wiki/'+ topic
            # access contents via url
            data = requests.get(link).content
            # parse data as soup object
            soup = BeautifulSoup(data, 'html.parser')
            # extract all paragraph data
            # scrape strings with html tag 'p'
            p_data = soup.findAll('p')
            # scrape strings with html tag 'dd'
            dd_data = soup.findAll('dd')
            # scrape strings with html tag 'li'
            #li_data = soup.findAll('li')
            p_list = [p for p in p_data]
            dd_list = [dd for dd in dd_data]
            #li_list = [li for li in li_data]
            # iterate over all data
            for tag in p_list+dd_list: #+li_list:
                # a bucket to collect processed data
                a = []
                # iterate over para, desc data and list items contents
                for i in tag.contents:
                    # exclude references, superscripts, formattings
                    if i.name != 'sup' and i.string != None:
                        stripped = ' '.join(i.string.strip().split())
                        # collect data pieces
                        a.append(stripped)
                # with collected string pieces formulate a single string
                # each string is a paragraph
                self.text_data.append(' '.join(a))
            
            # obtain sentences from paragraphs
            for i,para in enumerate(self.text_data):
                sentences = nltk.sent_tokenize(para)
                self.sentences.extend(sentences)
                # for each sentence, its para index must be known
                # it will be useful in case user prompts "more" info
                index = [i]*len(sentences)
                self.para_indices.extend(index)
            
            # extract h1 heading tag from soup object
            self.title = soup.find('h1').string
            # turn respective flag on
            self.got_topic = True
            # announce user that chatbot is ready now
            print('ChatBot >>  Topic is "Wikipedia: {}". Let\'s chat!'.format(self.title)) 
        # in case of unavailable topics
        except Exception as e:
            print('ChatBot >>  Error: {}. \
            Please input some other topic!'.format(e))
        
    # preprocess method - to be called internally by Tf-Idf vectorizer
    # text preprocessing, stopword removal, lemmatization, word tokenization
    def preprocess(self, text):
        # remove punctuations
        text = text.lower().strip().translate(self.punctuation_dict) 
        # tokenize into words
        words = nltk.word_tokenize(text)
        # remove stopwords
        words = [w for w in words if w not in self.stopwords]
        # lemmatize 
        return [self.lemmatizer.lemmatize(w) for w in words]


### <font color='green'> Start a Chat on Machine Learning Now ...

In [None]:
wiki = ChatBot()
# call chat method
wiki.chat()

Initializing ChatBot ...
Type "bye" or "quit" or "exit" to end chat

Enter your topic of interest when prompted.         
ChaBot will access Wikipedia, prepare itself to         
respond to your queries on that topic. 

ChatBot will respond with short info.         
If you input "more", it will give you detailed info         
You can also jump to next query
--------------------------------------------------
ChatBot >>  Hello, Great day! Please give me a topic of your interest. 


User    >>  machine learning


ChatBot >>  Topic is "Wikipedia: Machine learning". Let's chat!


User    >>  ok


ChatBot >>  I am not sure. Sorry!


User    >>  what is machine learning


ChatBot >>  A representative book on research into machine learning during the 1960s was Nilsson's book on Learning Machines, dealing mostly with machine learning for pattern classification.


User    >>  how old is machine learning


ChatBot >>  A representative book on research into machine learning during the 1960s was Nilsson's book on Learning Machines, dealing mostly with machine learning for pattern classification.


User    >>  branches of machine learning


ChatBot >>  The computational analysis of machine learning algorithms and their performance is a branch of theoretical computer science known as computational learning theory via the Probably Approximately Correct Learning (PAC) model.
