<a href="https://colab.research.google.com/github/jenka13all/nlp-python-chatbot/blob/master/nlp_python_code_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import nltk
import os
import pandas as pd

In [0]:
# Download some text from NLTK
# http://www.nltk.org/nltk_data/
# I downloaded Project Gutenberg Selections to use Lewis Carrol's "Alice's adventures in Wonderland"

In [3]:
# Fetch text
from google.colab import files
uploaded = files.upload()
print("len(uploaded.keys():", len(uploaded.keys()))

# Load text into an object called "text"
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))
  with open(fn, 'r', encoding='utf8', errors='ignore') as f:
    text = f.read()


Saving carroll-alice.txt to carroll-alice.txt
len(uploaded.keys(): 1
User uploaded file "carroll-alice.txt" with length 144395 bytes


In [4]:
# Preprocess text
nltk.download('punkt')

# Split text into sentences
sentences = nltk.sent_tokenize(text)
for i in range(1,5):
    print(sentences[i] + "\n")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!

Oh dear!



In [5]:
# Split into words
tokens = nltk.word_tokenize(text)
for i in range(1, 20):
    print(tokens[i])

Alice
's
Adventures
in
Wonderland
by
Lewis
Carroll
1865
]
CHAPTER
I
.
Down
the
Rabbit-Hole
Alice
was
beginning


In [6]:
# Make everything lower-case
tokens = [w.lower() for w in tokens]

for i in range(1, 20):
   print(tokens[i])

alice
's
adventures
in
wonderland
by
lewis
carroll
1865
]
chapter
i
.
down
the
rabbit-hole
alice
was
beginning


In [7]:
# Remove all punctuation (we only want real words)
import string 

translation = str.maketrans('', '', string.punctuation)
real_words = [w.translate(translation) for w in tokens]

for i in range(1, 20):
    print(real_words[i])

alice
s
adventures
in
wonderland
by
lewis
carroll
1865

chapter
i

down
the
rabbithole
alice
was
beginning


In [8]:
# Remove stop-words - and empty string!
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
words = [w for w in real_words if not w in stop_words and w != '']
for i in range(1, 20):
    print(words[i])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
adventures
wonderland
lewis
carroll
1865
chapter
rabbithole
alice
beginning
get
tired
sitting
sister
bank
nothing
twice
peeped
book
sister


In [9]:
# Stemming Example
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
stemmed = [porter.stem(word) for word in words]
for i in range(1, 20):
    print(stemmed[i])

adventur
wonderland
lewi
carrol
1865
chapter
rabbithol
alic
begin
get
tire
sit
sister
bank
noth
twice
peep
book
sister


In [10]:
# Lemmatize Example
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')
  
lemmatizer = WordNetLemmatizer() 
lemma = [lemmatizer.lemmatize(word) for word in words]

for i in range(1,20):
    print(lemma[i])

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
adventure
wonderland
lewis
carroll
1865
chapter
rabbithole
alice
beginning
get
tired
sitting
sister
bank
nothing
twice
peeped
book
sister


In [11]:
# Part of Speech (POS) tagging
nltk.download('averaged_perceptron_tagger')
pos = nltk.pos_tag(words)

for i in range(1,10):
    print(pos[i])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
('adventures', 'NNS')
('wonderland', 'VBP')
('lewis', 'RB')
('carroll', 'JJ')
('1865', 'CD')
('chapter', 'NN')
('rabbithole', 'JJ')
('alice', 'NN')
('beginning', 'VBG')


In [12]:
# Prepare sentences for BOW
print('original:')
for i in range(1,2):
    print(sentences[i])
print("\n")

# lowercase
lower = [w.lower() for w in sentences]
print('lowercase:')
for i in range(1,2):
    print(lower[i])
print("\n")

# remove punctuation
no_punct = [w.translate(translation) for w in lower]
print('remove punctuation:')

for i in range(1,2):
    print(no_punct[i])


original:
Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'


lowercase:
down the rabbit-hole

alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought alice 'without pictures or
conversation?'


remove punctuation:
down the rabbithole

alice was beginning to get very tired of sitting by her sister on the
bank and of having nothing to do once or twice she had peeped into the
book her sister was reading but it had no pictures or conversations in
it and what is the use of a book thought alice without pictures or
convers

In [13]:
# Bag of Words (BOW)
from sklearn.feature_extraction.text import CountVectorizer

CountVec = CountVectorizer() #see documentation for options!
bow = CountVec.fit_transform(no_punct)

X = pd.DataFrame(bow.toarray(), columns = CountVec.get_feature_names(), dtype='float32')

# overall structure
print(X.head())

# show us the rows (sentences) where 'rabbit' occurs
print(X[X['rabbit'] > 0]['rabbit'].head())

   1865  abide  able  about  above  ...  yourself  youth  youve  zealand  zigzag
0   1.0    0.0   0.0    0.0    0.0  ...       0.0    0.0    0.0      0.0     0.0
1   0.0    0.0   0.0    0.0    0.0  ...       0.0    0.0    0.0      0.0     0.0
2   0.0    0.0   0.0    0.0    0.0  ...       0.0    0.0    0.0      0.0     0.0
3   0.0    0.0   0.0    0.0    0.0  ...       0.0    0.0    0.0      0.0     0.0
4   0.0    0.0   0.0    0.0    0.0  ...       0.0    0.0    0.0      0.0     0.0

[5 rows x 2744 columns]
2     1.0
3     1.0
6     2.0
50    1.0
52    1.0
Name: rabbit, dtype: float32


In [14]:
# TFIDF - Term Frequency Inverse Document Frequency
from sklearn.feature_extraction.text import TfidfVectorizer

TfidfVec = TfidfVectorizer() #see documentation for options!
tfidf = TfidfVec.fit_transform(no_punct)

X = pd.DataFrame(tfidf.toarray(), columns = TfidfVec.get_feature_names(), dtype='float32')

# overall structure
print(X.head())

# show us the rows (sentences) where 'rabbit' occurs
print(X[X['rabbit'] > 0]['rabbit'].head())

       1865  abide  able  about  above  ...  yourself  youth  youve  zealand  zigzag
0  0.404084    0.0   0.0    0.0    0.0  ...       0.0    0.0    0.0      0.0     0.0
1  0.000000    0.0   0.0    0.0    0.0  ...       0.0    0.0    0.0      0.0     0.0
2  0.000000    0.0   0.0    0.0    0.0  ...       0.0    0.0    0.0      0.0     0.0
3  0.000000    0.0   0.0    0.0    0.0  ...       0.0    0.0    0.0      0.0     0.0
4  0.000000    0.0   0.0    0.0    0.0  ...       0.0    0.0    0.0      0.0     0.0

[5 rows x 2744 columns]
2     0.125458
3     0.199265
6     0.161356
50    0.148478
52    0.151654
Name: rabbit, dtype: float32


In [15]:
# Cosine similarity: Example
# transform user query so we can compare it to our matrix of TFIDF weighted word features
query = TfidfVec.transform(['Well hello there little rabbit! How curious!'])
print(query)

  (0, 2614)	0.39403466673912846
  (0, 2356)	0.3762611521737101
  (0, 1831)	0.43285798098243383
  (0, 1341)	0.33570156409033447
  (0, 1113)	0.3955553558017727
  (0, 507)	0.49662723992582514


In [16]:
#what's the maximum closeness we can achieve to our saved dialogues?
cosine_sim = query.dot(X.T)
print(cosine_sim)
print(cosine_sim.argmax())

[[0.         0.         0.09930625 ... 0.03181921 0.         0.14246259]]
12


In [17]:
# plug index of closest vector into our array of SENTENCES
bot_response = sentences[cosine_sim.argmax()]
print(bot_response)

'Well!'
