# Chatbot à partir d'une page wikipédia
On reprend pas à pas l'exemple traité dans [cet article](https://medium.com/analytics-vidhya/building-a-simple-chatbot-in-python-using-nltk-7c8c8215ac6e) pour illustrer des cas de chatbots allant chercher automatiquement des réponses à partir d'une source d'informations extérieures

In [2]:
import nltk
import numpy as np
import random
import string
import re

On importe et on fait quelques modifs:  

In [4]:
f=open('chatbot.txt','r',errors = 'ignore', encoding = "utf8")
raw=f.read()
raw=raw.lower()
# quelques modifications : 
raw = re.sub(r"\ufeff", "", raw)
raw = re.sub(r"\[.{1,2}\]", "", raw)
raw

'a chatbot is a piece of software that conducts a conversation via auditory or textual methods. such programs are often designed to convincingly simulate how a human would behave as a conversational partner, although as of 2019, they are far short of being able to pass the turing test. chatbots are typically used in dialog systems for various practical purposes including customer service or information acquisition. some chatbots use sophisticated natural language processing systems, but many simpler ones scan for keywords within the input, then pull a reply with the most matching keywords, or the most similar wording pattern, from a database.\n\nthe term "chatterbot" was originally coined by michael mauldin (creator of the first verbot, julia) in 1994 to describe these conversational programs. today, most chatbots are accessed via virtual assistants such as google assistant and amazon alexa, via messaging apps such as facebook messenger or wechat, or via individual organizations\' apps

On splitte par phrases et par mot : 

In [5]:
nltk.download('punkt')
nltk.download('wordnet')
sent_tokens = nltk.sent_tokenize(raw)
sent_tokens

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\antoi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\antoi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['a chatbot is a piece of software that conducts a conversation via auditory or textual methods.',
 'such programs are often designed to convincingly simulate how a human would behave as a conversational partner, although as of 2019, they are far short of being able to pass the turing test.',
 'chatbots are typically used in dialog systems for various practical purposes including customer service or information acquisition.',
 'some chatbots use sophisticated natural language processing systems, but many simpler ones scan for keywords within the input, then pull a reply with the most matching keywords, or the most similar wording pattern, from a database.',
 'the term "chatterbot" was originally coined by michael mauldin (creator of the first verbot, julia) in 1994 to describe these conversational programs.',
 "today, most chatbots are accessed via virtual assistants such as google assistant and amazon alexa, via messaging apps such as facebook messenger or wechat, or via individual or

In [6]:
# pour info : 
nltk.word_tokenize(raw)

['a',
 'chatbot',
 'is',
 'a',
 'piece',
 'of',
 'software',
 'that',
 'conducts',
 'a',
 'conversation',
 'via',
 'auditory',
 'or',
 'textual',
 'methods',
 '.',
 'such',
 'programs',
 'are',
 'often',
 'designed',
 'to',
 'convincingly',
 'simulate',
 'how',
 'a',
 'human',
 'would',
 'behave',
 'as',
 'a',
 'conversational',
 'partner',
 ',',
 'although',
 'as',
 'of',
 '2019',
 ',',
 'they',
 'are',
 'far',
 'short',
 'of',
 'being',
 'able',
 'to',
 'pass',
 'the',
 'turing',
 'test',
 '.',
 'chatbots',
 'are',
 'typically',
 'used',
 'in',
 'dialog',
 'systems',
 'for',
 'various',
 'practical',
 'purposes',
 'including',
 'customer',
 'service',
 'or',
 'information',
 'acquisition',
 '.',
 'some',
 'chatbots',
 'use',
 'sophisticated',
 'natural',
 'language',
 'processing',
 'systems',
 ',',
 'but',
 'many',
 'simpler',
 'ones',
 'scan',
 'for',
 'keywords',
 'within',
 'the',
 'input',
 ',',
 'then',
 'pull',
 'a',
 'reply',
 'with',
 'the',
 'most',
 'matching',
 'keywords'

On construit une fonction visant à "lemmatizer". Cette fonction sera ensuite utilisée comme paramètre au moment de la création de la matrice TF-IDF.

In [7]:
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

# un exemple : 
LemNormalize("you are pretty sure they are cats, aren't you? God you're so stupid")

['you',
 'are',
 'pretty',
 'sure',
 'they',
 'are',
 'cat',
 'arent',
 'you',
 'god',
 'youre',
 'so',
 'stupid']

C'est ici le coeur du code :  
- On crée la matrice TF-IDF qui nous permet d'avoir une mesure de l'importance d'un terme d'une phrase par rapport à la phrase et par rapport à l'ensemble des autres phrases de notre corpus.  
- On calcule la similarité de la phrase entrée par l'utilisateur avec l'ensemble des phrases du corpus en fonction de cette matrice.  
- On sort la phrase la plus proche comme réponse à l'utilisateur.  
  
  
Si vous voulez creuser les questions de matrice TF-IDF et de la similarité cosinus vous pouvez lire [cet article](https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/).  Si vous vous voulez une source en français, [cet article](https://www.quentinfily.fr/tf-idf-pertinence-lexicale/) est pas mal mais moins complet.  
Pour aller encore plus loin sur le fait d'utiliser une distance cosinus et non euclidienne comme d'habitude, vous pouvez aller voir [cet article](https://cmry.github.io/notes/euclidean-v-cosine) (pas prioritaire).

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def response(user_response):
    # on a besoin de passer la chaîne de caractère dans une liste :
    phrase_user = [phrase_user]
    # On calcule les valuers TF-IDF pour la phrase de l'utilisateur
    user_tf = tfidf.transform(phrase_user)
    # on calcule la similarité entre la question posée par l'utilisateur
    # et l'ensemble des phrases de la page wiki
    similarity = cosine_similarity(user_tf, phrases_tf).flatten()
    # on sort l'index de la phrase étant la plus similaire
    index_max_sim = np.argmax(similarity)
    # Si la similarité max ets égale à 0 == pas de correspondance trouvée
    if(similarity[index_max_sim] == 0):
        robo_response = "I didn't find this info, sorry"
    # Sinon, on sort la phrase correspondant le plus : 
    else:
        robo_response = sent_tokens[index_max_sim]
    return robo_response

On crée aussi des réponses classiques au cas où l'utilisateur se contente de saluer le bot (on pourrait en faire autant qu'on veut):

In [9]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):
 
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

Une fois ces fonctions créées, il n'y a plus qu'à créer notre bot comme on le faisait avec les exemples simples : 

In [None]:
flag=True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("ROBO: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("ROBO: "+greeting(user_response))
            else:
                print("ROBO: "+response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("ROBO: Bye! take care..")

ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!
hi robot
ROBO: hey
what is a chatbot?


  'stop_words.' % sorted(inconsistent))


ROBO: hello barbie is an internet-connected version of the doll that uses a chatbot provided by the company toytalk, which previously used the chatbot for a range of smartphone-based characters for children.
who created chatbots?


  'stop_words.' % sorted(inconsistent))


ROBO: dbpedia created a chatbot during the gsoc of 2017. and can communicate through facebook messenger.
