# Asking ChatBot about chosen information


## Libraries or concepts used in the process
 
*   NLTK - a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

*   TF-IDF - statistical method of evaluating the significance of a word in a given document.

*   Cosine similarity - denotes the similarity between the two words

*   WordNet -  a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members.

*   WordNetLemmatizer -  Lemmatize using WordNet's built-in morphy function. Lemmatization 

*   Wikipedia - Python library that makes it easy to access and parse data from Wikipedia.



## Installs libraries

In [23]:
!pip install wikipedia

Collecting wikipedia
  Downloading https://files.pythonhosted.org/packages/67/35/25e68fbc99e672127cc6fbb14b8ec1ba3dfef035bf1e4c90f78f24a80b7d/wikipedia-1.4.0.tar.gz
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-cp36-none-any.whl size=11686 sha256=e33375fd23bf36a300a9e2d54d5f794bedc353444849a361231d2c01b019dd64
  Stored in directory: /root/.cache/pip/wheels/87/2a/18/4e471fd96d12114d16fe4a446d00c3b38fb9efcb744bd31f4a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


## Imports libraries

In [24]:
import nltk
import random
import string
import nltk
nltk.download('averaged_perceptron_tagger')
import re, string, unicodedata
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
import wikipedia as wk
from collections import defaultdict
import warnings
warnings.filterwarnings("ignore")
nltk.download('punkt') 
nltk.download('wordnet')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
import urllib.request
import re

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


## Gets text data from url and cleans it

In [0]:
uf = urllib.request.urlopen("https://plato.stanford.edu/entries/linguistics/")
html = uf.read()

In [0]:
def cleanhtml(raw_html):
  cleanr = re.compile('\n|<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

In [0]:
html = html.decode("utf-8") 
html = cleanhtml(html)
raw = html.lower()

## Shows the beginning of the text

In [61]:
raw[:300]

'                  -->  philosophy of linguistics (stanford encyclopedia of philosophy)                      stanford encyclopedia of philosophy                                              menu                                            browse                                      table of contents  '

## Sentence tokenizer

In [0]:
sent_tokens = nltk.sent_tokenize(raw)

In [56]:
sent_tokens[1:5]

['this differentiates it sharply from the philosophy of language, traditionally concerned with matters of meaning and reference.as with the philosophy of other special sciences, there are general topics relating to matters like methodology and explanation (e.g., the status of statistical explanations in psychology and sociology, or the physics-chemistry relation in philosophy of chemistry), and more specific philosophical issues that come up in the special science at issue (simultaneity for philosophy of physics; individuation of species and ecosystems for the philosophy of biology).',
 'general topics of the first type in the philosophy of linguistics include:what the subject matter is,what the theoretical goals are,what form theories should take, andwhat counts as data.specific topics include issues in language learnability, language change, the competence-performance distinction, and the expressive power of linguistic theories.there are also topics that fall on the borderline betwee

## Text normalisation

*   Word tokenization - Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.

*   Remove ascii - 
*   Remove tags
*   pos tagging and lemmatization

In [0]:
def Normalize(text):
    remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
    #word tokenization
    word_token = nltk.word_tokenize(text.lower().translate(remove_punct_dict))
    #remove ascii
    new_words = []
    for word in word_token:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    # print(new_word[:100])
    #Remove tags
    rmv = []
    for w in new_words:
        text=re.sub("&lt;/?.*?&gt;","&lt;&gt;",w)
        rmv.append(text)
        
    #pos tagging and lemmatization
    tag_map = defaultdict(lambda : wn.NOUN)
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB
    tag_map['R'] = wn.ADV
    lmtzr = WordNetLemmatizer()
    lemma_list = []
    rmv = [i for i in rmv if i]
    for token, tag in nltk.pos_tag(rmv):
        lemma = lmtzr.lemmatize(token, tag_map[tag[0]])
        lemma_list.append(lemma)
    return lemma_list

In [0]:
# TfidfVec = TfidfVectorizer(tokenizer=Normalize, stop_words='english')
# TfidfVec.fit_transform(sent_tokens)

## Creating greeting responses

In [0]:
# defining welcome input that will be recognized by bot
welcome_input = ("hello", "hi", "greetings", "sup", "what's up","hey",)
# defining welcome output from bot 
welcome_response = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]

def welcome(user_response):
    for word in user_response.split():
        if word.lower() in welcome_input:
            return random.choice(welcome_response)

## Generating response for the knowledge question

In [0]:
def generateResponse(user_response):
    robo_response=''
    sent_tokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=Normalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    #vals = cosine_similarity(tfidf[-1], tfidf)
    vals = linear_kernel(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf==0) or "tell me about" in user_response:
        print("Checking Wikipedia")
        if user_response:
            robo_response = wikipedia_data(user_response)
            return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response
        #wikipedia search
def wikipedia_data(input):
    reg_ex = re.search('tell me about (.*)', input)
    try:
        if reg_ex:
            topic = reg_ex.group(1)
            wiki = wk.summary(topic, sentences = 3)
            return wiki
    except Exception as e:
            print("No content has been found")

## Running the bot while True

In [60]:
flag=True
print("My name is Chatterbot and I'm a chatbot. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    # changing input to lowercase
    user_response=user_response.lower()
    # checking if the user want to exit
    if(user_response not in ['bye','shutdown','exit', 'quit']):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("Chatterbot : You are welcome..")
        else:
            if(welcome(user_response) in welcome_input):
                print("Chatterbot : "+welcome(user_response))
            else:
                print("Chatterbot : ",end="")
                print(generateResponse(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("Chatterbot : Bye!!! ")

My name is Chatterbot and I'm a chatbot. If you want to exit, type Bye!
what is externalists
Chatterbot : it divides nearly all emergentists and externalists from most essentialists.


KeyboardInterrupt: ignored

In [0]:
Source: https://towardsdatascience.com/lets-build-an-intelligent-chatbot-7ea7f215ada6