Practicing using NLTK by creating a retrieval-based chatbot. The chatbot responses will be drawn from a selection of reddit comments taken from Kaggle.

In [122]:
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re
import string

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mike\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Mike\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Mike\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [123]:
reddit_text_df = pd.read_csv('kaggle_RC_2019-05.csv')

# Examining data
print(reddit_text_df.head())
print(len(reddit_text_df))
print(reddit_text_df.subreddit.nunique())
print(reddit_text_df.subreddit.unique())


       subreddit                                               body  \
0  gameofthrones  Your submission has been automatically removed...   
1            aww  Dont squeeze her with you massive hand, you me...   
2         gaming  It's pretty well known and it was a paid produ...   
3           news  You know we have laws against that currently c...   
4       politics  Yes, there is a difference between gentle supp...   

   controversiality  score  
0                 0      1  
1                 0     19  
2                 0      3  
3                 0     10  
4                 0      1  
1000000
40
['gameofthrones' 'aww' 'gaming' 'news' 'politics' 'dankmemes'
 'relationship_advice' 'nba' 'worldnews' 'AskReddit' 'AmItheAsshole'
 'SquaredCircle' 'The_Donald' 'leagueoflegends' 'hockey' 'videos'
 'teenagers' 'gonewild' 'movies' 'funny' 'pics' 'marvelstudios' 'memes'
 'soccer' 'freefolk' 'MortalKombat' 'todayilearned' 'apexlegends' 'asoiaf'
 'Market76' 'Animemes' 'FortNiteBR' 'nfl' 't

The data set contains 1,000,000 reddit posts drawn from 40 subreddits. This is probably too much data for our chatbot to work reasonably quickly; further, some of these subreddits contain a lot of offensive content. I will therefore choose to use responses taken solely from r/relationship_advice, making this a relationship advice chatbot! 

In [124]:
relad_df = reddit_text_df[reddit_text_df['subreddit'] == 'relationship_advice']

print(len(relad_df))
print(relad_df.head())
print(relad_df.body.iloc[0])

25000
               subreddit                                               body  \
6    relationship_advice  I would be less worried about how he fucked up...   
18   relationship_advice  I would actually just like to say that there a...   
209  relationship_advice  Do you find it relevant when it happens or if ...   
300  relationship_advice  Don't bother giving her the power to make a ch...   
342  relationship_advice  If the relationship is not wholly fulfilling, ...   

     controversiality  score  
6                   0      7  
18                  0      0  
209                 0      1  
300                 0      2  
342                 0      1  
I would be less worried about how he fucked up in the past and more worried that I still can't get a straight answer about it now.


25000 responses, especially since they are not single sentences, is far too many responses, so we choose only the first 1000. We next transfer the responses from the Pandas Dataframe into a list and then filter for offensive terms, dropping any responses which contain such terms.

In [125]:
text = [response for response in relad_df.body.iloc[0:100]]
print(text[0:10])

["I would be less worried about how he fucked up in the past and more worried that I still can't get a straight answer about it now.", "I would actually just like to say that there are other possibilities at play here.\n\nI laughed when I read this, because just today I downloaded tinder because we were discussing dating at our age, and the single people were saying  it's difficult to find cool people.    So I downloaded it, swiped through, and deleted all within 10 minutes.\n\nI have 0 interest in other woman, and I love my wife, and there's no way I'm cheating on her....but I still downloaded it.   Btw my wife knows, and thinks it's harmless fun.   \n\n So anyway, she might be cheating, but no need to go crazy just yet.", 'Do you find it relevant when it happens or if the topic happens to come up?', "Don't bother giving her the power to make a choice. Walk away from her and never look back that way you can keep your dignity! You will thank yourself later in life for standing up for y

In [126]:
clean_text = []
for item in text:
    if re.search("cu+nt|\w*shi+t\w*|fu+ck\w*|ni+gg\w*", item.lower()):
        continue
    else:
        clean_text.append(item)

print(len(clean_text))
print(clean_text[0:2])
        

85
["I would actually just like to say that there are other possibilities at play here.\n\nI laughed when I read this, because just today I downloaded tinder because we were discussing dating at our age, and the single people were saying  it's difficult to find cool people.    So I downloaded it, swiped through, and deleted all within 10 minutes.\n\nI have 0 interest in other woman, and I love my wife, and there's no way I'm cheating on her....but I still downloaded it.   Btw my wife knows, and thinks it's harmless fun.   \n\n So anyway, she might be cheating, but no need to go crazy just yet.", 'Do you find it relevant when it happens or if the topic happens to come up?']


With profanity removed and the text placed in a list, I will now begin processing the text for use with the chatbot. We will be using NLTKs TfIdfVectorizer and cosine similarity to determine which response would be most appropriate for the chatbot. However, as we want to keep the original version of the responses so that they can be returned by the chatbot, we will separate the data between it's original version and its lemmatized version.

In [127]:
# Using string.punctuation to remove punctuation

def remove_punct(text):
    punctuation_dictionary = dict((ord(item), None) for item in string.punctuation)
    new_text = [item.lower().translate(punctuation_dictionary) for item in text]
    return new_text
    
filtered_text = remove_punct(clean_text)
print(filtered_text[0:2])

# Using NLKT to tokenize the responses,remove stopwords and tag the parts of speech (for more accurate lemmatization)

stop_words = stopwords.words('english') 

# Function to translate pos_tag tags into the form needed for WordNet

def get_wordnet_pos(sentence):
    tag_list = []
    for i in range(len(sentence)):
        tag = pos_tag(sentence)[i][1][0].upper()
        tag_dict = {
            "R": wordnet.ADV,
            "N": wordnet.NOUN,
            "V": wordnet.VERB,
            "J": wordnet.ADJ
        }
        tag = tag_dict.get(tag, wordnet.NOUN)
        tag_list.append((sentence[i], tag))    
    return tag_list

# Determining parts of speech and converting into the appropriate form for WordNet
def pos_words(text):
    new_text = []
    for item in text: # Accessing each response
        new_item = []
        for word in word_tokenize(item): # Tokenising each response
            if word not in stop_words: # Checking the tokenised words against stopwords
                new_item.append(word)
        pos_new_item = get_wordnet_pos(new_item) # Tagging the part of speech for each word
        new_text.append(pos_new_item)
    return new_text

filtered_text = pos_words(filtered_text)
print(filtered_text[0:2])



['i would actually just like to say that there are other possibilities at play here\n\ni laughed when i read this because just today i downloaded tinder because we were discussing dating at our age and the single people were saying  its difficult to find cool people    so i downloaded it swiped through and deleted all within 10 minutes\n\ni have 0 interest in other woman and i love my wife and theres no way im cheating on herbut i still downloaded it   btw my wife knows and thinks its harmless fun   \n\n so anyway she might be cheating but no need to go crazy just yet', 'do you find it relevant when it happens or if the topic happens to come up']
[[('would', 'n'), ('actually', 'r'), ('like', 'v'), ('say', 'n'), ('possibilities', 'n'), ('play', 'v'), ('laughed', 'v'), ('read', 'v'), ('today', 'n'), ('downloaded', 'v'), ('tinder', 'n'), ('discussing', 'v'), ('dating', 'v'), ('age', 'n'), ('single', 'a'), ('people', 'n'), ('saying', 'v'), ('difficult', 'a'), ('find', 'a'), ('cool', 'a'), 

In [128]:
# Lemmatizing our tokenized words

lemmatizer = WordNetLemmatizer()

def lemmatize_text(pos_text):
    new_text = []
    for response in pos_text:
        new_response = []
        for word_tag in response:
            new_response.append(lemmatizer.lemmatize(word_tag[0], word_tag[1]))
        new_text.append(new_response)
    return new_text

lemma_text = lemmatize_text(filtered_text)
print(lemma_text[0:2])

lemma_text = [" ".join(word) for word in lemma_text[0:3]]
print(lemma_text[0:3])

[['would', 'actually', 'like', 'say', 'possibility', 'play', 'laugh', 'read', 'today', 'download', 'tinder', 'discuss', 'date', 'age', 'single', 'people', 'say', 'difficult', 'find', 'cool', 'people', 'download', 'swipe', 'delete', 'within', '10', 'minute', '0', 'interest', 'woman', 'love', 'wife', 'there', 'way', 'im', 'cheat', 'herbut', 'still', 'download', 'btw', 'wife', 'know', 'think', 'harmless', 'fun', 'anyway', 'might', 'cheat', 'need', 'go', 'crazy', 'yet'], ['find', 'relevant', 'happens', 'topic', 'happen', 'come']]
['would actually like say possibility play laugh read today download tinder discuss date age single people say difficult find cool people download swipe delete within 10 minute 0 interest woman love wife there way im cheat herbut still download btw wife know think harmless fun anyway might cheat need go crazy yet', 'find relevant happens topic happen come', 'dont bother give power make choice walk away never look back way keep dignity thank later life stand']


In [129]:
# Combining the previous steps into one function which can tokenise and lemmatize text
def lemmafunction(text):
    de_punct_text = remove_punct(text)
    pos_text = pos_words(de_punct_text)
    lemmad_text = lemmatize_text(pos_text)
    lemma_response = [" ".join(response) for response in lemmad_text]
    return lemma_response

print(lemmafunction(clean_text[0:2]))



['would actually like say possibility play laugh read today download tinder discuss date age single people say difficult find cool people download swipe delete within 10 minute 0 interest woman love wife there way im cheat herbut still download btw wife know think harmless fun anyway might cheat need go crazy yet', 'find relevant happens topic happen come']


With a lemmatizing function built, I can now build a Tfidf Vectorizer using Scikit-Learn. This vectorizer will take in the user response and our possible replies.

In [130]:
# Importing cosine similarities
from sklearn.metrics.pairwise import cosine_similarity
# Building Tfidf Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


tfidfVec = TfidfVectorizer(tokenizer=lemmafunction)


We now have everything we need on the vectorizing side. Now I will build the chatbot.

In [142]:
import random # using random so that it will not always be the same response

class Chatbot():
    def __init__(self):
        self.exit_commands = ['end', 'exit', 'goodbye', 'quit', 'stop']
        
    # Function to handle conversation opening with user    
    def intro(self):
        print('Welcome to the relationship advice chatbot!')
        user_input = input('Are you able to chat now? (y/n)\n> ').lower()
        if user_input in ['y', 'yes', 'yeah']:
            self.chatting()
        else:
            print('Okay, talk to you later!')
    
    # Function to handle general conversation
    def chatting(self):
        print('\nAsk me a question about relationships!')
        user_input = input('> ')
        while user_input not in self.exit_commands:
            user_input = input(self.generate_response(user_input))
        print('Goodbye then!')
    
    def generate_response(self, user_input):
        chatbot_response = ''
        # Appending the user response to the end of our cleaned text
        new_text = ''
        new_text = clean_text + [user_input]
        tfidf_text = tfidfVec.fit_transform(new_text)
        
        #identifying idx of best match
        values = cosine_similarity(tfidf_text[-1], tfidf_text) # Comparing similarities
        random_idx = random.choice(range(-2, -6, -1))
        idx = values.argsort()[0][random_idx]
        flat = values.flatten()
        req_tfidf = flat[random_idx]
        print('req_tfidf: ', req_tfidf, idx)
        if req_tfidf == 0:
            chatbot_response = 'Sorry, I can\'t help you.'
            return chatbot_response + '\n> '
        else:
            chatbot_response += clean_text[idx]
            return chatbot_response +'\n> '
        
        
    
chatbot = Chatbot()
chatbot.intro()
        

Welcome to the relationship advice chatbot!
Are you able to chat now? (y/n)
> y

Ask me a question about relationships!
> exit
Goodbye then!
