Practicing using NLTK by creating a retrieval-based chatbot. The chatbot responses will be drawn from a selection of reddit comments taken from Kaggle.

In [125]:
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re
import string

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mike\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Mike\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Mike\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


In [126]:
reddit_text_df = pd.read_csv('kaggle_RC_2019-05.csv')

# Examining data
print(reddit_text_df.head())
print(len(reddit_text_df))
print(reddit_text_df.subreddit.nunique())
print(reddit_text_df.subreddit.unique())


       subreddit                                               body  \
0  gameofthrones  Your submission has been automatically removed...   
1            aww  Dont squeeze her with you massive hand, you me...   
2         gaming  It's pretty well known and it was a paid produ...   
3           news  You know we have laws against that currently c...   
4       politics  Yes, there is a difference between gentle supp...   

   controversiality  score  
0                 0      1  
1                 0     19  
2                 0      3  
3                 0     10  
4                 0      1  
1000000
40
['gameofthrones' 'aww' 'gaming' 'news' 'politics' 'dankmemes'
 'relationship_advice' 'nba' 'worldnews' 'AskReddit' 'AmItheAsshole'
 'SquaredCircle' 'The_Donald' 'leagueoflegends' 'hockey' 'videos'
 'teenagers' 'gonewild' 'movies' 'funny' 'pics' 'marvelstudios' 'memes'
 'soccer' 'freefolk' 'MortalKombat' 'todayilearned' 'apexlegends' 'asoiaf'
 'Market76' 'Animemes' 'FortNiteBR' 'nfl' 't

The data set contains 1,000,000 reddit posts drawn from 40 subreddits. This is probably too much data for our chatbot to work reasonably quickly; further, some of these subreddits contain a lot of offensive content. I will therefore choose to use responses taken solely from r/relationship_advice, making this a relationship advice chatbot! 

In [127]:
relad_df = reddit_text_df[reddit_text_df['subreddit'] == 'relationship_advice']

print(len(relad_df))
print(relad_df.head())
print(relad_df.body.iloc[0])

25000
               subreddit                                               body  \
6    relationship_advice  I would be less worried about how he fucked up...   
18   relationship_advice  I would actually just like to say that there a...   
209  relationship_advice  Do you find it relevant when it happens or if ...   
300  relationship_advice  Don't bother giving her the power to make a ch...   
342  relationship_advice  If the relationship is not wholly fulfilling, ...   

     controversiality  score  
6                   0      7  
18                  0      0  
209                 0      1  
300                 0      2  
342                 0      1  
I would be less worried about how he fucked up in the past and more worried that I still can't get a straight answer about it now.


25000 responses, especially since they are not single sentences, is far too many responses, so we choose only the first 2,500. We next transfer the responses from the Pandas Dataframe into a list and then filter for offensive terms, dropping any responses which contain such terms.

In [128]:
text = [response for response in relad_df.body.iloc[0:2500]]
print(text[0:10])

["I would be less worried about how he fucked up in the past and more worried that I still can't get a straight answer about it now.", "I would actually just like to say that there are other possibilities at play here.\n\nI laughed when I read this, because just today I downloaded tinder because we were discussing dating at our age, and the single people were saying  it's difficult to find cool people.    So I downloaded it, swiped through, and deleted all within 10 minutes.\n\nI have 0 interest in other woman, and I love my wife, and there's no way I'm cheating on her....but I still downloaded it.   Btw my wife knows, and thinks it's harmless fun.   \n\n So anyway, she might be cheating, but no need to go crazy just yet.", 'Do you find it relevant when it happens or if the topic happens to come up?', "Don't bother giving her the power to make a choice. Walk away from her and never look back that way you can keep your dignity! You will thank yourself later in life for standing up for y

In [129]:
clean_text = []
for item in text:
    if re.search("cu+nt|\w*shi+t\w*|fu+ck\w*|ni+gg\w*", item.lower()):
        continue
    else:
        clean_text.append(item)

print(len(clean_text))
print(clean_text[0:10])
        

2263
["I would actually just like to say that there are other possibilities at play here.\n\nI laughed when I read this, because just today I downloaded tinder because we were discussing dating at our age, and the single people were saying  it's difficult to find cool people.    So I downloaded it, swiped through, and deleted all within 10 minutes.\n\nI have 0 interest in other woman, and I love my wife, and there's no way I'm cheating on her....but I still downloaded it.   Btw my wife knows, and thinks it's harmless fun.   \n\n So anyway, she might be cheating, but no need to go crazy just yet.", 'Do you find it relevant when it happens or if the topic happens to come up?', "Don't bother giving her the power to make a choice. Walk away from her and never look back that way you can keep your dignity! You will thank yourself later in life for standing up for yourself", "If the relationship is not wholly fulfilling, it's probably a sign that the two of you aren't meant to be together. Al

With profanity removed and the text placed in a list, I will now begin processing the text for use with the chatbot. We will be using NLTKs TfIdfVectorizer and cosine similarity to determine which response would be most appropriate for the chatbot. However, as we want to keep the original version of the responses so that they can be returned by the chatbot, we will separate the data between it's original version and its lemmatized version.

In [130]:
# Using string.punctuation to remove punctuation

def remove_punct(text):
    punctuation_dictionary = dict((ord(item), None) for item in string.punctuation)
    new_text = [item.lower().translate(punctuation_dictionary) for item in text]
    return new_text
    
filtered_text = remove_punct(clean_text)
print(filtered_text[0:2])

# Using NLKT to tokenize the responses,remove stopwords and tag the parts of speech (for more accurate lemmatization)

stop_words = stopwords.words('english') 

def pos_words(text):
    new_text = []
    for item in text: # Accessing each response
        new_item = []
        for word in word_tokenize(item): # Tokenising each response
            if word not in stop_words: # Checking the tokenised words against stopwords
                new_item.append(word)
        pos_new_item = pos_tag(new_item) # Tagging the part of speech for each word
        new_text.append(pos_new_item)
    return new_text

filtered_text = tokenize_words(filtered_text)
print(filtered_text[0:2])

['i would actually just like to say that there are other possibilities at play here\n\ni laughed when i read this because just today i downloaded tinder because we were discussing dating at our age and the single people were saying  its difficult to find cool people    so i downloaded it swiped through and deleted all within 10 minutes\n\ni have 0 interest in other woman and i love my wife and theres no way im cheating on herbut i still downloaded it   btw my wife knows and thinks its harmless fun   \n\n so anyway she might be cheating but no need to go crazy just yet', 'do you find it relevant when it happens or if the topic happens to come up']
[[('would', 'MD'), ('actually', 'RB'), ('like', 'VB'), ('say', 'NN'), ('possibilities', 'NNS'), ('play', 'VBP'), ('laughed', 'VBN'), ('read', 'VBP'), ('today', 'NN'), ('downloaded', 'VBD'), ('tinder', 'NN'), ('discussing', 'VBG'), ('dating', 'VBG'), ('age', 'NN'), ('single', 'JJ'), ('people', 'NNS'), ('saying', 'VBG'), ('difficult', 'JJ'), ('f

In [131]:
# Lemmatizing our tokenized words

lemmatizer = WordNetLemmatizer()

def lemmatize_text(pos_text):
    new_text = []
    for response in pos_text:
        new_response = []
        for word_tag in response:
            new_response.append(lemmatizer.lemmatize(word_tag[0], word_tag[1]))
        new_text.append(new_response)
    return new_text

print(lemmatize_text(filtered_text))

KeyError: 'MD'