#Everyting I learned (and did not memorize) from my 3-month Pro trial in Codecademy

Here I'm going to write all my notes and paste all the topics I'm probably not going to remember after a while my subscription has ended, so here we go.

# 1. Intro to NLP

## 1.1 Text Preprocessing

> "You never know what you have... until you clean your data."
*~ Unknown (or possibly made up)*

Cleaning and preparation are crucial for many tasks, and NLP is no exception. Text preprocessing is usually the first step you’ll take when faced with an NLP task.

Without preprocessing, your computer interprets "the", "The", and "<p>The" as entirely different words. There is a LOT you can do here, depending on the formatting you need. Lucky for you, Regex and NLTK will do most of it for you! Common tasks include:

**Noise removal** — stripping text of formatting (e.g., HTML tags).

**Tokenization** — breaking text into individual words.

**Normalization** — cleaning text data in any other way:

Stemming is a blunt axe to chop off word prefixes and suffixes. “booing” and “booed” become “boo”, but “sing” may become “s” and “sung” would remain “sung.”
Lemmatization is a scalpel to bring words down to their root forms. For example, NLTK’s savvy lemmatizer knows “am” and “are” are related to “be.”
Other common tasks include lowercasing, stopwords removal, spelling correction, etc.

In [None]:
# regex for removing punctuation!
import re
# nltk preprocessing magic
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# grabbing a part of speech function:
from part_of_speech import get_part_of_speech

text = "So many squids are jumping out of suitcases these days that you can barely go anywhere without seeing one burst forth from a tightly packed valise. I went to the dentist the other day, and sure enough I saw an angry one jump out of my dentist's bag within minutes of arriving. She hardly even noticed."

cleaned = re.sub('\W+', ' ', text)
tokenized = word_tokenize(cleaned)

stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]

## -- CHANGE these -- ##
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized]

print("Stemmed text:")
print(stemmed)
print("\nLemmatized text:")
print(lemmatized)

## 1.2 Parsing Text

You now have a preprocessed, clean list of words. Now what? It may be helpful to know how the words relate to each other and the underlying syntax (grammar). ***Parsing*** is a stage of NLP concerned with segmenting text based on syntax.

You probably do not want to be doing any parsing by hand and NLTK has a few tricks up its sleeve to help you out:

***Part-of-speech tagging (POS tagging)*** identifies parts of speech (verbs, nouns, adjectives, etc.). NLTK can do it faster (and maybe more accurately) than your grammar teacher.

***Named entity recognition (NER)*** helps identify the proper nouns (e.g., “Natalia” or “Berlin”) in a text. This can be a clue as to the topic of the text and NLTK captures many for you.

***Dependency grammar trees*** help you understand the relationship between the words in a sentence. It can be a tedious task for a human, so the Python library spaCy is at your service, even if it isn’t always perfect.

In English we leave a lot of ambiguity, so syntax can be tough, even for a computer program. Take a look at the following sentence:

> I saw a cow under a tree with binoculars.  

Do I have the binoculars? Does the cow have binoculars? Does the tree have binoculars?

Regex parsing, using Python’s re library, allows for a bit more nuance. When coupled with POS tagging, you can identify specific phrase chunks. On its own, it can find you addresses, emails, and many other common patterns within large chunks of text.

In [None]:
import spacy
from nltk import Tree
from squids import squids_text

dependency_parser = spacy.load('en')

parsed_squids = dependency_parser(squids_text)

# Assign my_sentence a new value:
my_sentence = "I saw a spider under a tree with a pretty skirt."
my_parsed_sentence = dependency_parser(my_sentence)

def to_nltk_tree(node):
  if node.n_lefts + node.n_rights > 0:
    parsed_child_nodes = [to_nltk_tree(child) for child in node.children]
    return Tree(node.orth_, parsed_child_nodes)
  else:
    return node.orth_

for sent in parsed_squids.sents:
  to_nltk_tree(sent.root).pretty_print()
  
for sent in my_parsed_sentence.sents:
 to_nltk_tree(sent.root).pretty_print()

##1.3 Language Models - Bag-of-Words Approach

How can we help a machine make sense of a bunch of word tokens? We can help computers make predictions about language by training a ***language model*** on a corpus (a bunch of example text).

**Language models** are probabilistic computer models of language. We build and use these models to figure out the likelihood that a given sound, letter, word, or phrase will be used. Once a model has been trained, it can be tested out on new texts.

One of the most common language models is the **unigram** model, a statistical language model commonly known as **bag-of-words**. As its name suggests, bag-of-words does not have much order to its chaos! What it does have is a tally count of each instance for each word. Consider the following text example:

The squids jumped out of the suitcases.
Provided some initial preprocessing, bag-of-words would result in a mapping like:  

```
{"the": 2, "squid": 1, "jump": 1, "out": 1, "of": 1, "suitcase": 1}
```

Now look at this sentence and mapping: “Why are your suitcases full of jumping squids?”

```
{"why": 1, "be": 1, "your": 1, "suitcase": 1, "full": 1, "of": 1, "jump": 1, "squid": 1}
```

You can see how even with different word order and sentence structures, “jump,” “squid,” and “suitcase” are shared topics between the two examples. Bag-of-words can be an excellent way of looking at language when you want to make predictions concerning topic or sentiment of a text. When grammar and word order are irrelevant, this is probably a good model to use.

In [None]:
# importing regex and nltk
import re, nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# importing Counter to get word counts for bag of words
from collections import Counter
# importing a passage from Through the Looking Glass
from looking_glass import looking_glass_text
# importing part-of-speech function for lemmatization
from part_of_speech import get_part_of_speech

# Change text to another string:
text = "So, that's it, I'm here writing this text. This is a very, very nice text, you see. I should say it's the very best text that I've ever wrote in my life. THE very best text. That's it, so, well, bye. That's the end of my text."

cleaned = re.sub('\W+', ' ', text).lower()
tokenized = word_tokenize(cleaned)

stop_words = stopwords.words('english')
filtered = [word for word in tokenized if word not in stop_words]

normalizer = WordNetLemmatizer()
normalized = [normalizer.lemmatize(token, get_part_of_speech(token)) for token in filtered]
# Comment out the print statement below
# print(normalized)

# Define bag_of_looking_glass_words & print:
bag_of_looking_glass_words = Counter(normalized)

print(bag_of_looking_glass_words)

##1.4 Language Models - N-Grams and NLM

For parsing entire phrases or conducting language prediction, you will want to use a model that pays attention to each word’s neighbors. Unlike bag-of-words, the ***n-gram*** model considers a sequence of some number (n) units and calculates the probability of each unit in a body of language given the preceding sequence of length n. Because of this, n-gram probabilities with larger n values can be impressive at language prediction.

Take a look at our revised squid example: “The squids jumped out of the suitcases. The squids were furious.”

A bigram model (where n is 2) might give us the following count frequencies:

```
{('', 'the'): 2, ('the', 'squids'): 2, ('squids', 'jumped'): 1, ('jumped', 'out'): 1, ('out', 'of'): 1, ('of', 'the'): 1, ('the', 'suitcases'): 1, ('suitcases', ''): 1, ('squids', 'were'): 1, ('were', 'furious'): 1, ('furious', ''): 1}
```

There are a couple problems with the n gram model:

How can your language model make sense of the sentence “The cat fell asleep in the mailbox” if it’s never seen the word “mailbox” before? During training, your model will probably come across test words that it has never encountered before (this issue also pertains to bag of words). A tactic known as language smoothing can help adjust probabilities for unknown words, but it isn’t always ideal.

For a model that more accurately predicts human language patterns, you want n (your sequence length) to be as large as possible. That way, you will have more natural sounding language, right? Well, as the sequence length grows, the number of examples of each sequence within your training corpus shrinks. With too few examples, you won’t have enough data to make many predictions.

Enter ***neural language models (NLM)***! Much recent work within NLP has involved developing and training neural networks to approximate the approach our human brains take towards language. This deep learning approach allows computers a much more adaptive tack to processing human language.

In [None]:
import nltk, re
from nltk.tokenize import word_tokenize
# importing ngrams module from nltk
from nltk.util import ngrams
from collections import Counter
from looking_glass import looking_glass_full_text

cleaned = re.sub('\W+', ' ', looking_glass_full_text).lower()
tokenized = word_tokenize(cleaned)

# Change the n value to 2:
looking_glass_bigrams = ngrams(tokenized, 2)
looking_glass_bigrams_frequency = Counter(looking_glass_bigrams)

# Change the n value to 3:
looking_glass_trigrams = ngrams(tokenized, 3)
looking_glass_trigrams_frequency = Counter(looking_glass_trigrams)

# Change the n value to a number greater than 3:
looking_glass_ngrams = ngrams(tokenized, 8)
looking_glass_ngrams_frequency = Counter(looking_glass_ngrams)

print("Looking Glass Bigrams:")
print(looking_glass_bigrams_frequency.most_common(10))

print("\nLooking Glass Trigrams:")
print(looking_glass_trigrams_frequency.most_common(10))

print("\nLooking Glass n-grams:")
print(looking_glass_ngrams_frequency.most_common(10))

##1.5 Topic Models

We’ve touched on the idea of finding topics within a body of language. But what if the text is long and the topics aren’t obvious?

***Topic modeling*** is an area of NLP dedicated to uncovering latent, or hidden, topics within a body of language. For example, one Codecademy curriculum developer used topic modeling to discover patterns within Taylor Swift songs related to love and heartbreak over time.

A common technique is to deprioritize the most common words and prioritize less frequently used terms as topics in a process known as **term frequency-inverse document frequency (tf-idf)**. Say what?! This may sound counter-intuitive at first. Why would you want to give more priority to less-used words? Well, when you’re working with a lot of text, it makes a bit of sense if you don’t want your topics filled with words like “the” and “is.” The Python libraries gensim and sklearn have modules to handle tf-idf.

Whether you use your plain bag of words (which will give you term frequency) or run it through tf-idf, the next step in your topic modeling journey is often latent Dirichlet allocation (LDA). LDA is a statistical model that takes your documents and determines which words keep popping up together in the same contexts (i.e., documents). We’ll use sklearn to tackle this for us.

If you have any interest in visualizing your newly minted topics, word2vec is a great technique to have up your sleeve. word2vec can map out your topic model results spatially as vectors so that similarly used words are closer together. In the case of a language sample consisting of “The squids jumped out of the suitcases. The squids were furious. Why are your suitcases full of jumping squids?”, we might see that “suitcase”, “jump”, and “squid” were words used within similar contexts. This word-to-vector mapping is known as a word embedding.

In [None]:
import nltk, re
from sherlock_holmes import bohemia_ch1, bohemia_ch2, bohemia_ch3, boscombe_ch1, boscombe_ch2, boscombe_ch3
from preprocessing import preprocess_text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# preparing the text
corpus = [bohemia_ch1, bohemia_ch2, bohemia_ch3, boscombe_ch1, boscombe_ch2, boscombe_ch3]
preprocessed_corpus = [preprocess_text(chapter) for chapter in corpus]

# Update stop_list:
stop_list = ['say', 'know', 'see', 'man', 'well', 'upon', 'one', 'could', 'come', 'may', 'would']
# filtering topics for stop words
def filter_out_stop_words(corpus):
  no_stops_corpus = []
  for chapter in corpus:
    no_stops_chapter = " ".join([word for word in chapter.split(" ") if word not in stop_list])
    no_stops_corpus.append(no_stops_chapter)
  return no_stops_corpus
filtered_for_stops = filter_out_stop_words(preprocessed_corpus)

# creating the bag of words model
bag_of_words_creator = CountVectorizer()
bag_of_words = bag_of_words_creator.fit_transform(filtered_for_stops)

# creating the tf-idf model
tfidf_creator = TfidfVectorizer(min_df = 0.2)
tfidf = tfidf_creator.fit_transform(preprocessed_corpus)

# creating the bag of words LDA model
lda_bag_of_words_creator = LatentDirichletAllocation(learning_method='online', n_components=10)
lda_bag_of_words = lda_bag_of_words_creator.fit_transform(bag_of_words)

# creating the tf-idf LDA model
lda_tfidf_creator = LatentDirichletAllocation(learning_method='online', n_components=10)
lda_tfidf = lda_tfidf_creator.fit_transform(tfidf)

print("~~~ Topics found by bag of words LDA ~~~")
for topic_id, topic in enumerate(lda_bag_of_words_creator.components_):
  message = "Topic #{}: ".format(topic_id + 1)
  message += " ".join([bag_of_words_creator.get_feature_names()[i] for i in topic.argsort()[:-5 :-1]])
  print(message)

print("\n\n~~~ Topics found by tf-idf LDA ~~~")
for topic_id, topic in enumerate(lda_tfidf_creator.components_):
  message = "Topic #{}: ".format(topic_id + 1)
  message += " ".join([tfidf_creator.get_feature_names()[i] for i in topic.argsort()[:-5 :-1]])
  print(message)

##1.6 Text Similarity

Most of us have a good autocorrect story. Our phone’s messenger quietly swaps one letter for another as we type and suddenly the meaning of our message has changed (to our horror or pleasure). However, addressing text similarity — including spelling correction — is a major challenge within natural language processing.

Addressing word similarity and misspelling for spellcheck or autocorrect often involves considering the Levenshtein distance or minimal edit distance between two words. The distance is calculated through the minimum number of insertions, deletions, and substitutions that would need to occur for one word to become another. For example, turning “bees” into “beans” would require one substitution (“a” for “e”) and one insertion (“n”), so the Levenshtein distance would be two.

Phonetic similarity is also a major challenge within speech recognition. English-speaking humans can easily tell from context whether someone said “euthanasia” or “youth in Asia,” but it’s a far more challenging task for a machine! More advanced autocorrect and spelling correction technology additionally considers key distance on a keyboard and phonetic similarity (how much two words or phrases sound the same).

It’s also helpful to find out if texts are the same to guard against plagiarism, which we can identify through lexical similarity (the degree to which texts use the same vocabulary and phrases). Meanwhile, semantic similarity (the degree to which documents contain similar meaning or topics) is useful when you want to find (or recommend) an article or book similar to one you recently finished.

In [None]:
import nltk
# NLTK has a built-in function
# to check Levenshtein distance:
from nltk.metrics import edit_distance

def print_levenshtein(string1, string2):
  print("The Levenshtein distance from '{0}' to '{1}' is {2}!".format(string1, string2, edit_distance(string1, string2)))

# Check the distance between
# any two words here!
print_levenshtein("fart", "target")

# Assign passing strings here:
three_away_from_code = "coding"

two_away_from_chunk = "chuml"

print_levenshtein("code", three_away_from_code)
print_levenshtein("chunk", two_away_from_chunk)

##1.7 Language Prediction & Text Generation

How does your favorite search engine complete your search queries? How does your phone’s keyboard know what you want to type next? Language prediction is an application of NLP concerned with predicting text given preceding text. Autosuggest, autocomplete, and suggested replies are common forms of language prediction.

Your first step to language prediction is picking a language model. Bag of words alone is generally not a great model for language prediction; no matter what the preceding word was, you will just get one of the most commonly used words from your training corpus.

If you go the n-gram route, you will most likely rely on Markov chains to predict the statistical likelihood of each following word (or character) based on the training corpus. Markov chains are memory-less and make statistical predictions based entirely on the current n-gram on hand.

For example, let’s take a sentence beginning, “I ate so many grilled cheese”. Using a trigram model (where n is 3), a Markov chain would predict the following word as “sandwiches” based on the number of times the sequence “grilled cheese sandwiches” has appeared in the training data out of all the times “grilled cheese” has appeared in the training data.

A more advanced approach, using a neural language model, is the Long Short Term Memory (LSTM) model. LSTM uses deep learning with a network of artificial “cells” that manage memory, making them better suited for text prediction than traditional neural networks.

In [None]:
import nltk, re, random
from nltk.tokenize import word_tokenize
from collections import defaultdict, deque
from document1 import training_doc1
from document2 import training_doc2
from document3 import training_doc3

class MarkovChain:
  def __init__(self):
    self.lookup_dict = defaultdict(list)
    self._seeded = False
    self.__seed_me()

  def __seed_me(self, rand_seed=None):
    if self._seeded is not True:
      try:
        if rand_seed is not None:
          random.seed(rand_seed)
        else:
          random.seed()
        self._seeded = True
      except NotImplementedError:
        self._seeded = False
    
  def add_document(self, str):
    preprocessed_list = self._preprocess(str)
    pairs = self.__generate_tuple_keys(preprocessed_list)
    for pair in pairs:
      self.lookup_dict[pair[0]].append(pair[1])
  
  def _preprocess(self, str):
    cleaned = re.sub(r'\W+', ' ', str).lower()
    tokenized = word_tokenize(cleaned)
    return tokenized

  def __generate_tuple_keys(self, data):
    if len(data) < 1:
      return

    for i in range(len(data) - 1):
      yield [ data[i], data[i + 1] ]
      
  def generate_text(self, max_length=50):
    context = deque()
    output = []
    if len(self.lookup_dict) > 0:
      self.__seed_me(rand_seed=len(self.lookup_dict))
      chain_head = [list(self.lookup_dict)[0]]
      context.extend(chain_head)
      
      while len(output) < (max_length - 1):
        next_choices = self.lookup_dict[context[-1]]
        if len(next_choices) > 0:
          next_word = random.choice(next_choices)
          context.append(next_word)
          output.append(context.popleft())
        else:
          break
      output.extend(list(context))
    return " ".join(output)

my_markov = MarkovChain()
my_markov.add_document(training_doc1)
my_markov.add_document(training_doc2)
my_markov.add_document(training_doc3)
generated_text = my_markov.generate_text()
print(generated_text)

##1.8 Advanced NLP Topics

Believe it or not, you’ve just scratched the surface of natural language processing. There are a slew of advanced topics and applications of NLP, many of which rely on deep learning and neural networks.

* ***Naive Bayes classifiers*** are supervised machine learning algorithms that leverage a probabilistic theorem to make predictions and classifications. They are widely used for sentiment analysis (determining whether a given block of language expresses negative or positive feelings) and spam filtering.

We’ve made enormous gains in machine translation, but even the most advanced translation software using neural networks and LSTM still has far to go in accurately translating between languages.

Some of the most life-altering applications of NLP are focused on improving language accessibility for people with disabilities. Text-to-speech functionality and speech recognition have improved rapidly thanks to neural language models, making digital spaces far more accessible places.

NLP can also be used to detect bias in writing and speech. Feel like a political candidate, book, or news source is biased but can’t put your finger on exactly how? Natural language processing can help you identify the language at issue.

In [None]:
from reviews import counter, training_counts
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Add your review:
review = "This course have been so nice so far. I think all the codecademy courses are utterly amazing. The thing I dislike the least (in other words, the thing I do not dislike most) is that every lesson is practical, and this is just amazing."
review_counts = counter.transform([review])

classifier = MultinomialNB()
training_labels = [0] * 1000 + [1] * 1000

classifier.fit(training_counts, training_labels)
  
neg = (classifier.predict_proba(review_counts)[0][0] * 100).round()
pos = (classifier.predict_proba(review_counts)[0][1] * 100).round()

if pos > 50:
  print("Thank you for your positive review!")
elif neg > 50:
  print("We're sorry this hasn't been the best possible lesson for you! We're always looking to improve.")
else:
  print("Naive Bayes cannot determine if this is negative or positive. Thank you or we're sorry?")

  
print("\nAccording to our trained Naive Bayes classifier, the probability that your review was negative was {0}% and the probability it was positive was {1}%.".format(neg, pos))

#2. Regex Review!

* ***Regular expressions*** are special sequences of characters that describe a pattern of text that is to be matched

* We can use ***literals*** to match the exact characters that we desire

* ***Alternation***, using the pipe symbol |, allows us to match the text preceding or following the |

* ***Character sets***, denoted by a pair of brackets [], let us match one character from a series of characters

* ***Wildcards***, represented by the period or dot ., will match any single character (letter, number, symbol or whitespace)

* ***Ranges*** allow us to specify a range of characters in which we can make a match

* ***Shorthand character classes*** like \w, \d and \s represent the ranges representing word characters, digit characters, and whitespace characters, respectively

* ***Groupings***, denoted with parentheses (), group parts of a regular expression together, and allows us to limit alternation to part of a regex

* ***Fixed quantifiers***, represented with curly braces {}, let us indicate the exact quantity or a range of quantity of a character we wish to match

* ***Optional quantifiers***, indicated by the question mark ?, allow us to indicate a character in a regex is optional, or can appear either 0 times or 1 time

* The ***Kleene star***, denoted with the asterisk *, is a quantifier that matches the preceding character 0 or more times

* The ***Kleene plus***, denoted by the plus +, matches the preceding character 1 or more times

* The ***anchor*** symbols hat ^ and dollar sign $ are used to match text at the start and end of a string, respectively

#3. Text Preprocessing

## 3.0 What we'll see:

* ***Text preprocessing*** is all about cleaning and prepping text data so that it’s ready for other NLP tasks.

* ***Noise removal*** is a text preprocessing step concerned with removing unnecessary formatting from our text.

* ***Tokenization*** is a text preprocessing step devoted to breaking up text into smaller units (usually words or discrete terms).

* ***Normalization*** is the name we give most other text preprocessing tasks, including stemming, lemmatization, upper and lowercasing, and stopword removal.

* ***Stemming*** is the normalization preprocessing task focused on removing word affixes.

* ***Lemmatization*** is the normalization preprocessing task that more carefully brings words down to their root forms.

## 3.1 Noise removal

We'll see how to remove text noise with the *re* python library.

In [None]:
import re

headline_one = '<h1>Nation\'s Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini</h1>'

tweet = '@fat_meats, veggies are better than you think.'

headline_no_tag = re.sub(r'<.?h1>', '', headline_one)

tweet_no_at = re.sub(r'@', '', tweet)

## 3.2 Tokenization

We can use python's NLTK to perform the separation of the sentences/words in a text, called ***tokenization***.

In [None]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

ecg_text = 'An electrocardiogram is used to record the electrical conduction through a person\'s heart. The readings can be used to diagnose cardiac arrhythmias.'

tokenized_by_word = word_tokenize(ecg_text)

tokenized_by_sentence = sent_tokenize(ecg_text)

## 3.3 Normalization

Tokenization and noise removal are staples of almost all text pre-processing pipelines. However, some data may require further processing through text normalization. Text normalization is a catch-all term for various text pre-processing tasks.

Steps to normalize a text:

* **Upper** or **lower** casing

* **Stopword** removal

* **POS Tagging** and **Lemmatization**

* **Stemming**

In [None]:
#Upper or lower casing

brands = 'Salvation Army, YMCA, Boys & Girls Club of America'

brands_lower = brands.lower()

brands_upper = brands.upper()

In [None]:
#Stopword removal

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

survey_text = 'A YouGov study found that American\'s like Italian food more than any other country\'s cuisine.'

stop_words = set(stopwords.words('english'))

tokenized_survey = word_tokenize(survey_text)

text_no_stops = [a for a in tokenized_survey if a not in stop_words]

In [None]:
#Stemming

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer

populated_island = 'Java is an Indonesian island in the Pacific Ocean. It is the most populated island in the world, with over 140 million people.'

stemmer = PorterStemmer()

island_tokenized = word_tokenize(populated_island)

stemmed = [stemmer.stem(word) for word in island_tokenized]

In [None]:
#POS Tagging and Lemmatization

import nltk
from nltk.corpus import wordnet
from collections import Counter

def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  
  pos_counts = Counter()

  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech

# Now this is the actual lemmatization but you should do POS Tagging first

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

populated_island = 'Indonesia was founded in 1945. It contains the most populated island in the world, Java, with over 140 million people.'

lemmatizer = WordNetLemmatizer()

tokenized_string = word_tokenize(populated_island)

lemmatized_words = [lemmatizer.lemmatize(word) for word in tokenized_string]

In [None]:
#A more secure way for POS Tagging:

import nltk
from nltk import pos_tag
from word_tokenized_oz import word_tokenized_oz

# save and print the sentence stored at index 100 in word_tokenized_oz here

witches_fate = word_tokenized_oz[100]
print(witches_fate)

# create a list to hold part-of-speech tagged sentences here

pos_tagged_oz = []

# create a for loop through each word tokenized sentence in word_tokenized_oz here

for word in word_tokenized_oz:
  pos_tagged_oz.append(pos_tag(word))

  # part-of-speech tag each sentence and append to pos_tagged_oz here

witches_fate_pos = pos_tagged_oz[100]
print(witches_fate_pos)

# store and print the 101st part-of-speech tagged sentence here



#4. Language Parsing

##4.1 Compile, match, search and find

The first method you will explore is ```.compile()```. This method takes a regular expression pattern as an argument and compiles the pattern into a regular expression object, which you can later use to find matching text. 

Regular expression objects have a ```.match()``` method that takes a string of text as an argument and looks for a single match to the regular expression that starts at the beginning of the string.

If ```.match()``` finds a match that starts at the beginning of the string, it will return a match object. The match object lets you know what piece of text the regular expression matched, and at what index the match begins and ends. If there is no match, ```.match()``` will return None.

With the match object stored in result, you can access the matched text by calling ```result.group(0)```. If you use a regex containing capture groups, you can access these groups by calling ```.group()``` with the appropriately numbered capture group as an argument.

You can make your regular expression matches even more dynamic with the help of the ```.search()``` method. Unlike ```.match```() which will only find matches at the start of a string, ```.search()``` will look left to right through an entire piece of text and return a match object for the first match to the regular expression given. If no match is found, ```.search()``` will return None.

Given a regular expression as its first argument and a string as its second argument, ```.findall()``` will return a list of all non-overlapping matches of the regular expression in the string. 

In [None]:
import re

# characters are defined
character_1 = "Dorothy"
character_2 = "Henry"

# compile your regular expression here

regular_expression = re.compile('.{7}')

# check for a match to character_1 here

result_1 = regular_expression.match(character_1)

# store and print the matched text here

match_1 = result_1.group(0)
print(match_1)

# compile a regular expression to match a 7 character string of word characters and check for a match to character_2 here

result_2 = re.match('.{7}', character_2)
print(result_2)



import re

# import L. Frank Baum's The Wonderful Wizard of Oz
oz_text = open("the_wizard_of_oz_text.txt",encoding='utf-8').read().lower()

# search oz_text for an occurrence of 'wizard' here

found_wizard = re.search("wizard", oz_text)
print(found_wizard)

# find all the occurrences of 'lion' in oz_text here

all_lions = re.findall("lion", oz_text)
number_lions = len(all_lions)
print(number_lions)

# store and print the length of all_lions here



##4.2 Chunking

Given your part-of-speech tagged text, you can now use regular expressions to find patterns in sentence structure that give insight into the meaning of a text. This technique of grouping words by their part-of-speech tag is called chunking.

With chunking in nltk, you can define a pattern of parts-of-speech tags using a modified notation of regular expressions. You can then find non-overlapping matches, or chunks of words, in the part-of-speech tagged sentences of a text.

One such type of chunking is NP-chunking, or noun phrase chunking. A noun phrase is a phrase that contains a noun and operates, as a unit, as a noun.

A popular form of noun phrase begins with a determiner DT, which specifies the noun being referenced, followed by any number of adjectives JJ, which describe the noun, and ends with a noun NN.

Another popular type of chunking is VP-chunking, or verb phrase chunking. A verb phrase is a phrase that contains a verb and its complements, objects, or modifiers.

Verb phrases can take a variety of structures, and here you will consider two. The first structure begins with a verb VB of any tense, followed by a noun phrase, and ends with an optional adverb RB of any form. The second structure switches the order of the verb and the noun phrase, but also ends with an optional adverb.

Another option you have to find chunks in your text is chunk filtering. Chunk filtering lets you define what parts of speech you do not want in a chunk and remove them.

In [None]:
#chunking basics

from nltk import RegexpParser, Tree
from pos_tagged_oz import pos_tagged_oz

# define adjective-noun chunk grammar here

chunk_grammar = "AN: {<JJ><NN>}"

# create RegexpParser object here

chunk_parser = RegexpParser(chunk_grammar)

# chunk the pos-tagged sentence at index 282 in pos_tagged_oz here

scaredy_cat = chunk_parser.parse(pos_tagged_oz[282])
print(scaredy_cat)

# pretty_print the chunked sentence here
Tree.fromstring(str(scaredy_cat)).pretty_print()

In [None]:
#NP Chunking

from nltk import RegexpParser
from pos_tagged_oz import pos_tagged_oz
from np_chunk_counter import np_chunk_counter

# define noun-phrase chunk grammar here

chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

# create RegexpParser object here

chunk_parser = RegexpParser(chunk_grammar)

# create a list to hold noun-phrase chunked sentences
np_chunked_oz = list()

# create a for loop through each pos-tagged sentence in pos_tagged_oz here
for sentence in pos_tagged_oz:
  # chunk each sentence and append to np_chunked_oz here
  np_chunked_oz.append(chunk_parser.parse(sentence))

# store and print the most common np-chunks here

most_common_np_chunks = np_chunk_counter(np_chunked_oz)

print(most_common_np_chunks)

In [None]:
#VP Chunking

from nltk import RegexpParser
from pos_tagged_oz import pos_tagged_oz
from vp_chunk_counter import vp_chunk_counter

# define verb phrase chunk grammar here

chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

# create RegexpParser object here

chunk_parser = RegexpParser(chunk_grammar)

# create a list to hold verb-phrase chunked sentences
vp_chunked_oz = list()

# create for loop through each pos-tagged sentence in pos_tagged_oz here
for p in pos_tagged_oz:
  # chunk each sentence and append to vp_chunked_oz here
  vp_chunked_oz.append(chunk_parser.parse(p))
  
# store and print the most common vp-chunks here
most_common_vp_chunks = vp_chunk_counter(vp_chunked_oz)
print(most_common_vp_chunks)

In [None]:
#Chunk filtering

from nltk import RegexpParser, Tree
from pos_tagged_oz import pos_tagged_oz

# define chunk grammar to chunk an entire sentence together
grammar = "Chunk: {<.*>+}"

# create RegexpParser object
parser = RegexpParser(grammar)

# chunk the pos-tagged sentence at index 230 in pos_tagged_oz
chunked_dancers = parser.parse(pos_tagged_oz[230])
print(chunked_dancers)

# define noun phrase chunk grammar using chunk filtering here

chunk_grammar = """NP: {<.*>+}
}<VB.*|IN>+{"""

# create RegexpParser object here

chunk_parser = RegexpParser(chunk_grammar)

# chunk and filter the pos-tagged sentence at index 230 in pos_tagged_oz here

filtered_dancers = chunk_parser.parse(pos_tagged_oz[230])
print(filtered_dancers)

# pretty_print the chunked and filtered sentence here
Tree.fromstring(str(filtered_dancers)).pretty_print()

#5. Language models

##5.1 Bag-of-Words

Bag-of-words (BoW) is a statistical language model based on word count. Bag-of-words does not give a flying fish about word starts or word order though; its sole concern is word count — how many times each word appears in a document.

One of the most common ways to implement the BoW model in Python is as a dictionary with each key set to a word and each value set to the number of times that word appears. For statistical models, we call the text that we use to build the model our training data.

###**BoW Vectors**

A feature vector is a numeric representation of an item’s important features. Each feature has its own column. If the feature exists for the item, you could represent that with a 1. If the feature does not exist for that item, you could represent that with a 0. Turning text into a BoW vector is known as feature extraction or vectorization.

But how do we know which vector index corresponds to which word? When building BoW vectors, we generally create a features dictionary of all vocabulary in our training data (usually several documents) mapped to indices.

In [None]:
# Creating a features dictionary (you can do this with less code bellow)

from preprocessing import preprocess_text
# Define create_features_dictionary() below:

def create_features_dictionary(documents):
  features_dictionary = dict()
  merged = " ".join(documents)
  tokens = preprocess_text(merged)
  index = 0
  for token in tokens:
    if(token not in features_dictionary):
      features_dictionary[token] = index
      index += 1
  return features_dictionary, tokens

training_documents = ["Five fantastic fish flew off to find faraway functions.", "Maybe find another five fantastic fish?", "Find my fish with a function please!"]

print(create_features_dictionary(training_documents)[0])

In [None]:
# Creating a BoW vector (you can do this with less code bellow)

from preprocessing import preprocess_text
# Define text_to_bow_vector() below:

def text_to_bow_vector(some_text, features_dictionary):
  bow_vector = [0] * len(features_dictionary)
  tokens = preprocess_text(some_text)
  for token in tokens:
    feature_index = features_dictionary[token]
    bow_vector[feature_index] += 1
  return bow_vector, tokens

features_dictionary = {'function': 8, 'please': 14, 'find': 6, 'five': 0, 'with': 12, 'fantastic': 1, 'my': 11, 'another': 10, 'a': 13, 'maybe': 9, 'to': 5, 'off': 4, 'faraway': 7, 'fish': 2, 'fly': 3}

text = "Another five fish find another faraway fish."
print(text_to_bow_vector(text, features_dictionary)[0])

In [None]:
# Creating a Naiive Bayes classifier (you can do this with less code bellow)

from spam_data import training_spam_docs, training_doc_tokens, training_labels, test_labels, test_spam_docs, training_docs, test_docs
from sklearn.naive_bayes import MultinomialNB

def create_features_dictionary(document_tokens):
  features_dictionary = {}
  index = 0
  for token in document_tokens:
    if token not in features_dictionary:
      features_dictionary[token] = index
      index += 1
  return features_dictionary

def tokens_to_bow_vector(document_tokens, features_dictionary):
  bow_vector = [0] * len(features_dictionary)
  for token in document_tokens:
    if token in features_dictionary:
      feature_index = features_dictionary[token]
      bow_vector[feature_index] += 1
  return bow_vector

# Define bow_sms_dictionary:

bow_sms_dictionary = create_features_dictionary(training_doc_tokens)

# Define training_vectors:

training_vectors = [tokens_to_bow_vector(training_doc, bow_sms_dictionary) for training_doc in training_spam_docs]

# Define test_vectors:

test_vectors = [tokens_to_bow_vector(test_doc, bow_sms_dictionary) for test_doc in test_spam_docs]

spam_classifier = MultinomialNB()

def spam_or_not(label):
  return "spam" if label else "not spam"

# Uncomment the code below when you're done:
spam_classifier.fit(training_vectors, training_labels)

predictions = spam_classifier.score(test_vectors, test_labels)

print("The predictions for the test data were {0}% accurate.\n\nFor example, '{1}' was classified as {2}.\n\nMeanwhile, '{3}' was classified as {4}.".format(predictions * 100, test_docs[0], spam_or_not(test_labels[0]), test_docs[10], spam_or_not(test_labels[10])))

In [None]:
# And here's how you do it with less code!

from spam_data import training_spam_docs, training_doc_tokens, training_labels, test_labels, test_spam_docs, training_docs, test_docs
from sklearn.naive_bayes import MultinomialNB

# Import CountVectorizer from sklearn:

from sklearn.feature_extraction.text import CountVectorizer

# Define bow_vectorizer:

bow_vectorizer = CountVectorizer()

# Define training_vectors:

training_vectors = bow_vectorizer.fit_transform(training_docs)

# Define test_vectors:

test_vectors = bow_vectorizer.transform(test_docs)

spam_classifier = MultinomialNB()

def spam_or_not(label):
  return "spam" if label else "not spam"

# Uncomment the code below when you're done:
spam_classifier.fit(training_vectors, training_labels)

predictions = spam_classifier.score(test_vectors, test_labels)

print("The predictions for the test data were {0}% accurate.\n\nFor example, '{1}' was classified as {2}.\n\nMeanwhile, '{3}' was classified as {4}.".format(predictions * 100, test_docs[7], spam_or_not(test_labels[7]), test_docs[35], spam_or_not(test_labels[11])))

## 5.2 TF-IDF

Term frequency-inverse document frequency is a numerical statistic used to indicate how important a word is to each document in a collection of documents, or a corpus.

When applying tf-idf to a corpus, each word is given a tf-idf score for each document, representing the relevance of that word to the particular document. A higher tf-idf score indicates a term is more important to the corresponding document.

The first component of tf-idf is term frequency, or how often a word appears in a document within the corpus.

The inverse document frequency component of the tf-idf score penalizes terms that appear more frequently across a corpus. The intuition is that words that appear more frequently in the corpus give less insight into the topic or meaning of an individual document, and should thus be deprioritized.

tfidf(t,d) = tf(t,d)*idf(t,corpus)tfidf(t,d)=tf(t,d)∗idf(t,corpus)

In [None]:
# Calculate Term Frequency

import codecademylib3_seaborn
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from preprocessing import preprocess_text

poem = '''
Success is counted sweetest
By those who ne'er succeed.
To comprehend a nectar
Requires sorest need.

Not one of all the purple host
Who took the flag to-day
Can tell the definition,
So clear, of victory,

As he, defeated, dying,
On whose forbidden ear
The distant strains of triumph
Break, agonized and clear!'''

# define clear_count:
clear_count = 2

# preprocess text
processed_poem = preprocess_text(poem)

# initialize and fit CountVectorizer
vectorizer = CountVectorizer()
term_frequencies = vectorizer.fit_transform([processed_poem])

# get vocabulary of terms

feature_names = vectorizer.get_feature_names()

# create pandas DataFrame with term frequencies
try:
  df_term_frequencies = pd.DataFrame(term_frequencies.T.todense(), index=feature_names, columns=['Term Frequency'])
  print(df_term_frequencies)
except:
  pass

In [None]:
# Calculate the Inverse Document Frequency

import codecademylib3_seaborn
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from term_frequency import term_frequencies, feature_names, df_term_frequencies

# display term-document matrix of term frequencies

print(df_term_frequencies)

# initialize and fit TfidfTransformer

transformer = TfidfTransformer()
transformer.fit(term_frequencies)
idf_values = transformer.idf_

# create pandas DataFrame with inverse document frequencies
try:
  df_idf = pd.DataFrame(idf_values, index = feature_names, columns=['Inverse Document Frequency'])
  print(df_idf)
except:
  pass

In [None]:
# Calculating everything

import codecademylib3_seaborn
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from poems import poems
from preprocessing import preprocess_text

# preprocess documents
processed_poems = [preprocess_text(poem) for poem in poems]

# initialize and fit TfidfVectorizer

vectorizer = TfidfVectorizer(norm=None)
tfidf_scores = vectorizer.fit_transform(processed_poems)

# get vocabulary of terms

feature_names = vectorizer.get_feature_names()

# get corpus index
corpus_index = [f"Poem {i+1}" for i in range(len(poems))]

# create pandas DataFrame with tf-idf scores
try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=corpus_index)
  print(df_tf_idf)
except:
  pass

In [None]:
# BoW to TF-IDF tranformation

import codecademylib3_seaborn
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from term_frequency import bow_matrix, feature_names, df_bag_of_words, corpus_index

# display term-document matrix of term frequencies (bag-of-words)

print(df_bag_of_words)

# initialize and fit TfidfTransformer, transform bag-of-words matrix

transformer = TfidfTransformer(norm=None)
tfidf_scores = transformer.fit_transform(bow_matrix)

# create pandas DataFrame with tf-idf scores
try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index = feature_names, columns=corpus_index)
  print(df_tf_idf)
except:
  pass

#6. Word Embeddings

##6.1 Word Embeddings

Word embeddings are vector representations of a word.

They allow us to take all the information that is stored in a word, like its meaning and its part of speech, and convert it into a numeric form that is more understandable to a computer.

The key at the heart of word embeddings is distance. We'll cover Manhatthan, Euclidean and Cosine, but Cosine is preferable as the length of the vector does not influence on the number of calculations made.

Word2vec is a statistical learning algorithm that develops word embeddings from a corpus of text. With either the continuous bag-of-words or continuous skip-grams representations as training data, word2vec then uses a shallow, 2-layer neural network to come up with the values that place words with a similar context in vectors near each other and words with different contexts in vectors far apart from each other.

When we want to train our own word2vec model on a corpus of text, we can use the gensim package!

In [None]:
# Basic word embeddings with spacy

import spacy

# load word embedding model

nlp = spacy.load('en')

# define word embedding vectors

happy_vec = nlp('happy').vector
sad_vec = nlp('sad').vector
angry_vec = nlp('angry').vector

#print(happy_vec)

# find vector length here

vector_length = len(happy_vec)
print(vector_length)

In [None]:
# Finding different distances

import numpy as np
from scipy.spatial.distance import cityblock, euclidean, cosine
import spacy

# load word embedding model
nlp = spacy.load('en')

# define word embedding vectors
happy_vec = nlp('happy').vector
sad_vec = nlp('sad').vector
angry_vec = nlp('angry').vector

# calculate Manhattan distance

man_happy_sad = cityblock(happy_vec, sad_vec)
man_sad_angry = cityblock(sad_vec, angry_vec)

print(man_happy_sad)
print(man_sad_angry)

# calculate Euclidean distance

euc_happy_sad = euclidean(happy_vec, sad_vec)
euc_sad_angry = euclidean(sad_vec, angry_vec)

print(euc_happy_sad)
print(euc_sad_angry)

# calculate cosine distance

cos_happy_sad = cosine(happy_vec, sad_vec)
cos_sad_angry = cosine(sad_vec, angry_vec)

print(cos_happy_sad)
print(cos_sad_angry)


In [None]:
# Word2Vec

from sklearn.feature_extraction.text import CountVectorizer

sentence = "It was the best of times, it was the worst of times."
print(sentence)

# preprocessing
sentence_lst = [word.lower().strip(".") for word in sentence.split()]

# set context_length
context_length = 4

# function to get cbows
def get_cbows(sentence_lst, context_length):
  cbows = list()
  for i, val in enumerate(sentence_lst):
    if i < context_length:
      pass
    elif i < len(sentence_lst) - context_length:
      context = sentence_lst[i-context_length:i] + sentence_lst[i+1:i+context_length+1]
      vectorizer = CountVectorizer()
      vectorizer.fit_transform(context)
      context_no_order = vectorizer.get_feature_names()
      cbows.append((val,context_no_order))
  return cbows

# define cbows here:

cbows = get_cbows(sentence_lst, context_length)

# function to get cbows
def get_skip_grams(sentence_lst, context_length):
  skip_grams = list()
  for i, val in enumerate(sentence_lst):
    if i < context_length:
      pass
    elif i < len(sentence_lst) - context_length:
      context = sentence_lst[i-context_length:i] + sentence_lst[i+1:i+context_length+1]
      skip_grams.append((val, context))
  return skip_grams

# define skip_grams here:

skip_grams = get_skip_grams(sentence_lst, context_length)

try:
  print('\nContinuous Bag of Words')
  for cbow in cbows:
    print(cbow)
except:
  pass
try:
  print('\nSkip Grams')
  for skip_gram in skip_grams:
    print(skip_gram)
except:
  pass

In [None]:
# Gensim

import gensim
from nltk.corpus import stopwords
from romeo_juliet import romeo_and_juliet

# load stop words
stop_words = stopwords.words('english')

# preprocess text
romeo_and_juliet_processed = [[word for word in romeo_and_juliet.lower().split() if word not in stop_words]]

# view inner list of romeo_and_juliet_processed

#print(romeo_and_juliet_processed[0][:20])

# train word embeddings model

model = gensim.models.Word2Vec(romeo_and_juliet_processed, size = 100, window = 5, min_count = 1, workers = 2, sg = 1)

# view vocabulary

vocabulary = list(model.wv.vocab.items())
#print(vocabulary)

# similar to romeo

similar_to_romeo = model.most_similar("romeo", topn = 20)
print(similar_to_romeo)

# one is not like the others

not_star_crossed_lover = model.doesnt_match(["romeo", "juliet", "mercutio"])
print(not_star_crossed_lover)

# 8. Deep Learning

##8.0. Preprocessing for seq2seq
Noise removal depends on your use case — do you care about casing or punctuation? For many tasks they are probably not important enough to justify the additional processing. This might be the time to make changes.

We’ll need the following for our Keras implementation:

- vocabulary sets for both our input (English) and target (Spanish) data
- the total number of unique word tokens we have for each set
- the maximum sentence length we’re using for each language

We also need to mark the start and end of each document (sentence) in the target samples so that the model recognizes where to begin and end its text generation (no book-long sentences for us!). One way to do this is adding "<START>" at the beginning and "<END>" at the end of each target document (in our case, this will be our Spanish sentences). For example, "Estoy feliz." becomes "<START> Estoy feliz. <END>".



In [None]:
from tensorflow import keras
import re
# Importing our translations
data_path = "span-eng.txt"
# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  lines = f.read().split('\n')

# Building empty lists to hold sentences
input_docs = []
target_docs = []
# Building empty vocabulary sets
input_tokens = set()
target_tokens = set()

for line in lines:
  # Input and target sentences are separated by tabs
  input_doc, target_doc = line.split('\t')
  # Appending each input sentence to input_docs
  input_docs.append(input_doc)
  # Splitting words from punctuation
  target_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", target_doc))
  # Redefine target_doc below 
  # and append it to target_docs:
  target_doc = "<START>" + target_doc + "<END>"
  target_docs.append(target_doc)
  
  # Now we split up each sentence into words
  # and add each unique word to our vocabulary set
  for token in re.findall(r"[\w']+|[^\s\w]", input_doc):
    print(token)
    # Add your code here:
    if token not in input_tokens:
      input_tokens.add(token)
    
  for token in target_doc.split():
    print(token)
    # And here:
    if token not in target_tokens:
      target_tokens.add(token)

input_tokens = sorted(list(input_tokens))
target_tokens = sorted(list(target_tokens))

# Create num_encoder_tokens and num_decoder_tokens:
num_encoder_tokens = len(input_tokens)
num_decoder_tokens = 27

try:
  max_encoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", input_doc)) for input_doc in input_docs])
  max_decoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", target_doc)) for target_doc in target_docs])
except ValueError:
  pass

In [None]:
#data:
'''
We'll see.	Después veremos.
We'll see.	Ya veremos.
We'll try.	Lo intentaremos.
We've won!	¡Hemos ganado!
Well done.	Bien hecho.
What's up?	¿Qué hay?
Who cares?	¿A quién le importa?
Who drove?	¿Quién condujo?
Who drove?	¿Quién conducía?
Who is he?	¿Quién es él?
Who is it?	¿Quién es?
'''

##9.1 Training Setup 
For each sentence, Keras expects a NumPy matrix containing one-hot vectors for each token. What’s a one-hot vector? In a one-hot vector, every token in our set is represented by a 0 except for the current token which is represented by a 1. For example given the vocabulary ["the", "dog", "licked", "me"], a one-hot vector for “dog” would look like [0, 1, 0, 0].

In order to vectorize our data and later translate it from vectors, it’s helpful to have a features dictionary (and a reverse features dictionary) to easily translate between all the 1s and 0s and actual words. We’ll build out the following:

- a features dictionary for English
- a features dictionary for Spanish
- a reverse features dictionary for English (where the keys and values are swapped)
- a reverse features dictionary for Spanish

Once we have all of our features dictionaries set up, it’s time to vectorize the data! We’re going to need vectors to input into our encoder and decoder, as well as a vector of target data we can use to train the decoder.

Because each matrix is almost all zeros, we’ll use numpy.zeros() from the NumPy library to build them out.

At this point we need to fill out the 1s in each vector. We can loop over each English-Spanish pair in our training sample using the features dictionaries to add a 1 for the token in question. 

Keras will fit — or train — the seq2seq model using these matrices of one-hot vectors:

- the encoder input data
- the decoder input data
- the decoder target data

Hang on a second, why build two matrices of decoder data? Aren’t we just encoding and decoding?

The reason has to do with a technique known as teacher forcing that most seq2seq models employ during training. Here’s the idea: we have a Spanish input token from the previous timestep to help train the model for the current timestep’s target token.

In [None]:
from tensorflow import keras
import numpy as np
from preprocessing import input_docs, target_docs, input_tokens, target_tokens, num_encoder_tokens, num_decoder_tokens, max_encoder_seq_length, max_decoder_seq_length

print('Number of samples:', len(input_docs))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

input_features_dict = dict(
    [(token, i) for i, token in enumerate(input_tokens)])

# Build out target_features_dict:
target_features_dict = dict(
    [(token, i) for i, token in enumerate(target_tokens)])

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_features_dict = dict(
    (i, token) for token, i in input_features_dict.items())

# Build out reverse_target_features_dict:
reverse_target_features_dict = dict(
    (i, token) for token, i in target_features_dict.items())

encoder_input_data = np.zeros(
    (len(input_docs), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
print("\nHere's the first item in the encoder input matrix:\n", encoder_input_data[0], "\n\nThe number of columns should match the number of unique input tokens and the number of rows should match the maximum sequence length for input sentences.")

# Build out the decoder_input_data matrix:
decoder_input_data = np.zeros(
    (len(input_docs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
print("\nHere's the first item in the decoder input matrix:\n", decoder_input_data[0], "\n\nThe number of columns should match the number of unique input tokens and the number of rows should match the maximum sequence length for input sentences.")

# Build out the decoder_target_data matrix:
decoder_target_data = np.zeros(
    (len(input_docs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
print("\nHere's the first item in the decoder input matrix:\n", decoder_input_data[0], "\n\nThe number of columns should match the number of unique input tokens and the number of rows should match the maximum sequence length for input sentences.")

In [None]:
from tensorflow import keras
import numpy as np
import re
from preprocessing import input_docs, target_docs, input_tokens, target_tokens, num_encoder_tokens, num_decoder_tokens, max_encoder_seq_length, max_decoder_seq_length

input_features_dict = dict(
    [(token, i) for i, token in enumerate(input_tokens)])
target_features_dict = dict(
    [(token, i) for i, token in enumerate(target_tokens)])

reverse_input_features_dict = dict(
    (i, token) for token, i in input_features_dict.items())
reverse_target_features_dict = dict(
    (i, token) for token, i in target_features_dict.items())

encoder_input_data = np.zeros(
    (len(input_docs), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_docs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_docs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

for line, (input_doc, target_doc) in enumerate(zip(input_docs, target_docs)):

  for timestep, token in enumerate(re.findall(r"[\w']+|[^\s\w]", input_doc)):

    print("Encoder input timestep & token:", timestep, token)
    print(input_features_dict[token])
    # Assign 1. for the current line, timestep, & word
    # in encoder_input_data:
    encoder_input_data[line, timestep, input_features_dict[token]] = 1

  for timestep, token in enumerate(target_doc.split()):

    # decoder_target_data is ahead of decoder_input_data by one timestep
    print("Decoder input timestep & token:", timestep, token)
    # Assign 1. for the current line, timestep, & word
    # in decoder_input_data:
    decoder_input_data[line, timestep, target_features_dict[token]] = 1
    
    if timestep > 0:
      # decoder_target_data is ahead by 1 timestep
      # and doesn't include the start token.
      decoder_target_data[line, timestep - 1, target_features_dict[token]] = 1
      print("Decoder target timestep:", timestep)
      # Assign 1. for the current line, timestep, & word
      # in decoder_target_data:


##9.2 Encoder Training Setup

Deep learning models in Keras are built in layers, where each layer is a step in the model.

Our encoder requires two layer types from Keras:

- An input layer, which defines a matrix to hold all the one-hot vectors that we’ll feed to the model.
- An LSTM layer, with some output dimensionality.

Next, we set up the input layer, which requires some number of dimensions that we’re providing. In this case, we know that we’re passing in all the encoder tokens, but we don’t necessarily know our batch size (how many ~chocolate chip cookies~ sentences we’re feeding the model at a time). Fortunately, we can say None because the code is written to handle varying batch sizes, so we don’t need to specify that dimension.

For the LSTM layer, we need to select the dimensionality (the size of the LSTM’s hidden states, which helps determine how closely the model molds itself to the training data — something we can play around with) and whether to return the state.

Remember, the only thing we want from the encoder is its final states. encoder_outputs isn’t really important for us, so we can just discard it. However, the states, we’ll save in a list. There is a lot to take in here, but there’s no need to memorize any of this — you got this.💪

In [None]:
from prep import num_encoder_tokens

from tensorflow import keras
from keras.layers import Input, LSTM
from keras.models import Model

# Create the input layer:
encoder_inputs = Input(shape = (None, num_encoder_tokens))

# Create the LSTM layer:
encoder_lstm = LSTM(256, return_state = True)

# Retrieve the outputs and states:
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)

# Put the states together in a list:
encoder_states = [state_hidden, state_cell]

##9.3 Decoder Training Setup

The decoder looks a lot like the encoder (phew!), with an input layer and an LSTM layer that we use together. However, with our decoder, we pass in the state data from the encoder, along with the decoder inputs. This time, we’ll keep the output instead of the states. 

We also need to run the output through a final activation layer, using the Softmax function, that will give us the probability distribution — where all probabilities sum to one — for each token. The final layer also transforms our LSTM output from a dimensionality of whatever we gave it (in our case, 10) to the number of unique words within the hidden layer’s vocabulary (i.e., the number of unique target tokens, which is definitely more than 10!).

Keras’s implementation could work with several layer types, but Dense is the least complex, so we’ll go with that.

In [None]:
from prep import num_encoder_tokens, num_decoder_tokens

from tensorflow import keras
# Add Dense to the imported layers
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# Encoder training setup
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(256, return_state=True)
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
encoder_states = [state_hidden, state_cell]

# The decoder input and LSTM layers:
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)

# Retrieve the LSTM outputs and states:
decoder_outputs, decoder_state_hidden, decoder_state_cell = decoder_lstm(decoder_inputs, initial_state = encoder_states)

# Build a final Dense layer:
decoder_dense = Dense(num_decoder_tokens, activation = "softmax")

# Filter outputs through the Dense layer:
decoder_outputs = decoder_dense(decoder_outputs)


##9.4 Build and Train seq2seq

First, we define the seq2seq model using the Model() function we imported from Keras. To make it a seq2seq model, we feed it the encoder and decoder inputs, as well as the decoder output.

Finally, our model is ready to train. First, we compile everything. Keras models demand two arguments to compile:

- An optimizer (we’re using RMSprop, which is a fancy version of the widely-used gradient descent) to help minimize our error rate (how bad the model is at guessing the true next word given the previous words in a sentence).

- A loss function (we’re using the logarithm-based cross-entropy function) to determine the error rate.
Because we care about accuracy, we’re adding that into the metrics to pay attention to while training.

Next we need to fit the compiled model. To do this, we give the .fit() method the encoder and decoder input data (what we pass into the model), the decoder target data (what we expect the model to return given the data we passed in), and some numbers we can adjust as needed:

- batch size (smaller batch sizes mean more time, and for some problems, smaller batch sizes will be better, while for other problems, larger batch sizes are better)

- the number of epochs or cycles of training (more epochs mean a model that is more trained on the dataset, and that the process will take more time)

- validation split (what percentage of the data should be set aside for validating — and determining when to stop training your model — rather than training)
Keras will take it from here to get you a (hopefully) nicely trained seq2seq model

In [None]:
from prep import num_encoder_tokens, num_decoder_tokens, decoder_target_data, encoder_input_data, decoder_input_data, decoder_target_data

from tensorflow import keras
# Add Dense to the imported layers
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# Encoder training setup
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(256, return_state=True)
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
encoder_states = [state_hidden, state_cell]

# Decoder training setup:
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)
decoder_outputs, decoder_state_hidden, decoder_state_cell = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Building the training model:
training_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

print("Model summary:\n")
training_model.summary()
print("\n\n")

# Compile the model:
training_model.compile(optimizer = 'rmsprop',
             loss = 'categorical_crossentropy', 
             metrics = ['accuracy'])

# Choose the batch size
# and number of epochs:
batch_size = 50
epochs = 50

print("Training the model:\n")
# Train the model:
training_model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size = batch_size, epochs = epochs, validation_split = 0.2)


##9.5 Setup for Testing

Now our model is ready for testing! Yay! However, to generate some original output text, we need to redefine the seq2seq architecture in pieces. Wait, didn’t we just define and train a model?

Well, yes. But the model we used for training our network only works when we already know the target sequence. This time, we have no idea what the Spanish should be for the English we pass in! So we need a model that will decode step-by-step instead of using teacher forcing. To do this, we need a seq2seq network in individual pieces.

To start, we’ll build an encoder model with our encoder inputs and the placeholders for the encoder’s output states.

Next up, we need placeholders for the decoder’s input states, which we can build as input layers and store together. Why? We don’t know what we want to decode yet or what hidden state we’re going to end up with, so we need to do everything step-by-step. We need to pass the encoder’s final hidden state to the decoder, sample a token, and get the updated hidden state back. Then we’ll be able to (manually) pass the updated hidden state back into the network.

Using the decoder LSTM and decoder dense layer (with the activation function) that we trained earlier, we’ll create new decoder states and outputs.

Finally, we can set up the decoder model. This is where we bring together:

- the decoder inputs (the decoder input layer)

- the decoder input states (the final states from the encoder)

- the decoder outputs (the NumPy matrix we get from the final output layer of the decoder)

- the decoder output states (the memory throughout the network from one word to the next)

In [None]:
from prep import num_encoder_tokens, num_decoder_tokens, decoder_target_data, encoder_input_data, decoder_input_data, decoder_target_data

from tensorflow import keras
# Add Dense to the imported layers
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# Encoder training setup
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(256, return_state=True)
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
encoder_states = [state_hidden, state_cell]

# Decoder training setup:
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)
decoder_outputs, decoder_state_hidden, decoder_state_cell = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Building the training model:
training_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

print("Model summary:\n")
training_model.summary()
print("\n\n")

# Compile the model:
training_model.compile(optimizer = 'rmsprop',
             loss = 'categorical_crossentropy', 
             metrics = ['accuracy'])

# Choose the batch size
# and number of epochs:
batch_size = 50
epochs = 50

print("Training the model:\n")
# Train the model:
training_model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size = batch_size, epochs = epochs, validation_split = 0.2)


##9.6 The Test Function

Finally, we can get to testing our model! To do this, we need to build a function that:

- accepts a NumPy matrix representing the test English sentence input

- uses the encoder and decoder we’ve created to generate Spanish output

Inside the test function, we’ll run our new English sentence through the encoder model. The .predict() method takes in new input (as a NumPy matrix) and gives us output states that we can pass on to the decoder.

Next, we’ll build an empty NumPy array for our Spanish translation, giving it three dimensions.

Luckily, we already know the first value in our Spanish sentence — "<Start>"! So we can give "<Start>" a value of 1 at the first timestep.

Before we get decoding, we’ll need a string where we can add our translation to, word by word.

At long last, it’s translation time. Inside the test function, we’ll decode the sentence word by word using the output state that we retrieved from the encoder (which becomes our decoder’s initial hidden state). We’ll also update the decoder hidden state after each word so that we use previously decoded words to help decode new ones.

To tackle one word at a time, we need a while loop that will run until one of two things happens (we don’t want the model generating words forever):

- The current token is "<END>".

- The decoded Spanish sentence length hits the maximum target sentence length.

Inside the while loop, the decoder model can use the current target sequence (beginning with the "<START>" token) and the current state (initially passed to us from the encoder model) to get a bunch of possible next words and their corresponding probabilities.

Next, we can use NumPy’s .argmax() method to determine the token (word) with the highest probability and add it to the decoded sentence.

Our final step is to update a few values for the next word in the sequence.

And now we can test it all out!

In [None]:
from training import encoder_inputs, decoder_inputs, encoder_states, decoder_lstm, decoder_dense, encoder_input_data, num_decoder_tokens

from prep import target_features_dict, reverse_target_features_dict, max_decoder_seq_length, input_docs, target_docs, target_tokens

from tensorflow import keras
from keras.layers import Input, LSTM, Dense
from keras.models import Model, load_model
import numpy as np

training_model = load_model('training_model.h5')
encoder_inputs = training_model.input[0]
encoder_outputs, state_h_enc, state_c_enc = training_model.layers[2].output
encoder_states = [state_h_enc, state_c_enc]
encoder_model = Model(encoder_inputs, encoder_states)

latent_dim = 256
decoder_state_input_hidden = Input(shape=(latent_dim,))
decoder_state_input_cell = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_hidden, decoder_state_input_cell]
decoder_outputs, state_hidden, state_cell = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_hidden, state_cell]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

def decode_sequence(test_input):
  # Encode the input as state vectors:
  encoder_states_value = encoder_model.predict(test_input)
  # Set decoder states equal to encoder final states
  decoder_states_value = encoder_states_value

  # Generate empty target sequence of length 1:
  target_seq = np.zeros((1, 1, num_decoder_tokens))
  
  # Populate the first token of target sequence with the start token:
  target_seq[0, 0, target_features_dict['<START>']] = 1.
  
  decoded_sentence = ''

  return decoded_sentence

for seq_index in range(10):
  test_input = encoder_input_data[seq_index: seq_index + 1]
  decoded_sentence = decode_sequence(test_input)
  print('-')
  print('Input sentence:', input_docs[seq_index])
  print('Decoded sentence:', decoded_sentence)

In [None]:
from training import encoder_inputs, decoder_inputs, encoder_states, decoder_lstm, decoder_dense, encoder_input_data, num_decoder_tokens

from prep import target_features_dict, reverse_target_features_dict, max_decoder_seq_length, input_docs, target_docs, target_tokens

from tensorflow import keras
from keras.layers import Input, LSTM, Dense
from keras.models import Model, load_model
import numpy as np

training_model = load_model('training_model.h5')
encoder_inputs = training_model.input[0]
encoder_outputs, state_h_enc, state_c_enc = training_model.layers[2].output
encoder_states = [state_h_enc, state_c_enc]
encoder_model = Model(encoder_inputs, encoder_states)

latent_dim = 256
decoder_state_input_hidden = Input(shape=(latent_dim,))
decoder_state_input_cell = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_hidden, decoder_state_input_cell]
decoder_outputs, state_hidden, state_cell = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_hidden, state_cell]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

def decode_sequence(test_input):
  encoder_states_value = encoder_model.predict(test_input)
  decoder_states_value = encoder_states_value
  target_seq = np.zeros((1, 1, num_decoder_tokens))
  target_seq[0, 0, target_features_dict['<START>']] = 1.
  decoded_sentence = ''
  
  stop_condition = False
  while not stop_condition:
    # Run the decoder model to get possible 
    # output tokens (with probabilities) & states
    output_tokens, new_decoder_hidden_state, new_decoder_cell_state = decoder_model.predict(
      [target_seq] + decoder_states_value)

    # Choose token with highest probability
    sampled_token_index = np.argmax(output_tokens[0, -1, :])
    sampled_token = reverse_target_features_dict[sampled_token_index]
    decoded_sentence += " " + sampled_token

    # Exit condition: either hit max length
    # or find stop token.
    if (sampled_token == '<END>' or len(decoded_sentence) > max_decoder_seq_length):
      stop_condition = True

    # Update the target sequence (of length 1).
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, sampled_token_index] = 1.

    # Update states
    decoder_states_value = [new_decoder_hidden_state, new_decoder_cell_state]

  return decoded_sentence

for seq_index in range(11):
  test_input = encoder_input_data[seq_index: seq_index + 1]
  decoded_sentence = decode_sequence(test_input)
  print('-')
  print('Input sentence:', input_docs[seq_index])
  print('Decoded sentence:', decoded_sentence)

# 9. Rule-based chatbots

## 9.0 Rule-based chatbots

Basically, they need to have a closed domain and can understand a closed set of phrases. It is possible to use regex to increase user understanding.

In [None]:
import re
import random

class SupportBot:
  negative_responses = ("nothing", "don't", "stop", "sorry")

  exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later")

  def __init__(self):
    self.matching_phrases = {'how_to_pay_bill': [r'.*how.*pay bills.*', r'.*how.*pay my bill.*'], r'pay_bill': [r'.*want.*pay.*my.*bill.*account.*number.*is (\d+)', r'.*need.*pay my bill.*account.*number.*is (\d+)']}

  def welcome(self):
    name = input("Hi, I'm a customer support representative. Welcome to Codecademy Bank. Before we can help you, I need some information from you. What is your first name and last name? ")
    
    will_help = input(f"Ok {name}, what can I help you with? ")
    
    if will_help in self.negative_responses:
      print("Ok, have a great day!")
      return
    
    self.handle_conversation(will_help)
  
  def handle_conversation(self, reply):
    while not self.make_exit(reply):
      reply = self.match_reply(reply)
      
  def make_exit(self, reply):
    for exit_command in self.exit_commands:
      if exit_command in reply:
        print("Ok, have a great day!")
        return True
      
    return False
  
  def match_reply(self, reply):
    for key, values in self.matching_phrases.items():
      for regex_pattern in values:
        found_match = re.match(regex_pattern, reply)
        if found_match and key == 'how_to_pay_bill':
          return self.how_to_pay_bill_intent()
        elif found_match and key == 'pay_bill':
          return self.pay_bill_intent(found_match.groups()[0])
        
    return input("I did not understand you. Can you please ask your question again?")
  
  def how_to_pay_bill_intent(self):
    return input("You can pay your bill a couple of ways. 1) online at bill.codecademybank.com or 2) you can pay your bill right now with me. Can I help you with anything else?")
  
  def pay_bill_intent(self, account_number=None):
    ACCOUNTNUMBER = account_number
    return input(f"The account with number {ACCOUNTNUMBER} was paid off. What else can I help you with?")
  
# Create a SupportBot instance
SupportConversation = SupportBot()
# Call the .welcome() method
SupportConversation.welcome()

#10. Retrieval-based chatbots

##10.0 Retrieval-based chatbots

# 11. Generative chatbots

##11.0 Choosing the right dataset

One of the trickiest challenges in building a deep learning chatbot program is choosing a dataset to use for training.

In [None]:
#data

'''
L451317 +++$+++ u7151 +++$+++ m480 +++$+++ MOTHER +++$+++ You are sick, that's why he's here.
L451316 +++$+++ u7153 +++$+++ m480 +++$+++ THE KID +++$+++ Mom, can't you tell him that I'm sick?
L451315 +++$+++ u7151 +++$+++ m480 +++$+++ MOTHER +++$+++ Your grandfather's here.
L451314 +++$+++ u7153 +++$+++ m480 +++$+++ THE KID +++$+++ What?
L451313 +++$+++ u7151 +++$+++ m480 +++$+++ MOTHER +++$+++ Guess what.
L451312 +++$+++ u7153 +++$+++ m480 +++$+++ THE KID +++$+++ A little bit.
L451311 +++$+++ u7151 +++$+++ m480 +++$+++ MOTHER +++$+++ You feeling any better?
L451417 +++$+++ u7153 +++$+++ m480 +++$+++ THE KID +++$+++ What?
L451345 +++$+++ u7153 +++$+++ m480 +++$+++ THE KID +++$+++ -hold it, hold it-
L451326 +++$+++ u7146 +++$+++ m480 +++$+++ GRANDFATHER +++$+++ That's right.
L451325 +++$+++ u7153 +++$+++ m480 +++$+++ THE KID +++$+++ A book?
L451324 +++$+++ u7146 +++$+++ m480 +++$+++ GRANDFATHER +++$+++ Open it up.
L451323 +++$+++ u7153 +++$+++ m480 +++$+++ THE KID +++$+++ What is it?
L451322 +++$+++ u7146 +++$+++ m480 +++$+++ GRANDFATHER +++$+++ I brought you a special present.
L451464 +++$+++ u7148 +++$+++ m480 +++$+++ INIGO +++$+++ I do not suppose you could speed things up?
L451463 +++$+++ u7149 +++$+++ m480 +++$+++ MAN IN BLACK +++$+++ Thank you.
L451462 +++$+++ u7148 +++$+++ m480 +++$+++ INIGO +++$+++ Sorry.
L452098 +++$+++ u7146 +++$+++ m480 +++$+++ GRANDFATHER +++$+++ As you wish...
L452097 +++$+++ u7153 +++$+++ m480 +++$+++ THE KID +++$+++ Maybe you could come over and read it again to me tomorrow.
L452095 +++$+++ u7146 +++$+++ m480 +++$+++ GRANDFATHER +++$+++ Okay. Okay. Okay. All right. So long.
L452094 +++$+++ u7153 +++$+++ m480 +++$+++ THE KID +++$+++ Okay.
L452093 +++$+++ u7146 +++$+++ m480 +++$+++ GRANDFATHER +++$+++ Now I think you ought to go to sleep.
L452088 +++$+++ u7153 +++$+++ m480 +++$+++ THE KID +++$+++ What? What?
L451438 +++$+++ u7155 +++$+++ m480 +++$+++ VIZZINI +++$+++ Inconceivable!
L451437 +++$+++ u7148 +++$+++ m480 +++$+++ INIGO +++$+++ He's climbing the rope. And he's gaining on us.
L451793 +++$+++ u7145 +++$+++ m480 +++$+++ FEZZIK +++$+++ Yeah?
L451792 +++$+++ u7148 +++$+++ m480 +++$+++ INIGO +++$+++ Perhaps not. I feel fine.
L451791 +++$+++ u7145 +++$+++ m480 +++$+++ FEZZIK +++$+++ You don't look so good.  You don't smell so good either.
L451790 +++$+++ u7145 +++$+++ m480 +++$+++ FEZZIK +++$+++ True!
L451789 +++$+++ u7148 +++$+++ m480 +++$+++ INIGO +++$+++ It's you.
L451788 +++$+++ u7145 +++$+++ m480 +++$+++ FEZZIK +++$+++ Hello.
L451915 +++$+++ u7150 +++$+++ m480 +++$+++ MIRACLE MAX +++$+++ A good hour. Yeah.
L451914 +++$+++ u7154 +++$+++ m480 +++$+++ VALERIE +++$+++ Yeah, an hour.
L451913 +++$+++ u7150 +++$+++ m480 +++$+++ MIRACLE MAX +++$+++ An hour.
'''

In [None]:
import more_itertools as mit

data_path = "pb.txt"

# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  raw_lines = f.read().split('\n')

raw_lines.reverse()
lines = []

for line in raw_lines:
    # split line into parts
    line_split = line.split(' +++$+++ ')
    # append tuple of character and line
    line_num = int(line_split[0][1:])

    current_line = line_split[4].strip()
    # append tuple of line num, character and line
    lines.append((line_num, current_line))
# make sure the lines are in order
lines = sorted(lines, key=lambda x: x[0])

# group lines by scene
by_scene = [list(group) for group in mit.consecutive_groups(lines, lambda x: x[0])]

dialog_only = [[dialog_line[1] for dialog_line in dialog_group] 
                for dialog_group in by_scene]

dialog_combos_nested = [list(map(list, zip(dialog_group, dialog_group[1:]))) for dialog_group in dialog_only]

dialog_combos = [combo for combos in dialog_combos_nested for combo in combos]

# print dialog combos:
print(dialog_combos)


##11.1 Setting up the bot

Just as we built a chatbot class to handle the methods for our rule-based and retrieval-based chatbots, we’ll build a chatbot class for our generative chatbot.

Inside, we’ll add a greeting method and a set of exit commands, just like we did for our closed-domain chatbots.

However, in this case, we’ll also import the seq2seq model we’ve built and trained on chat data for you, as well as other information we’ll need to generate a response.

As it happens, many cutting-edge chatbots blend a combination of rule-based, retrieval-based, and generative approaches in order to easily handle some intents using predefined responses and offload other inputs to a natural language generation system.

In [None]:
class ChatBot:
  
  negative_responses = ("no", "nope", "nah", "naw", "not a chance", "sorry")

  exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later", "stop")
  
  def start_chat(self):
    user_response = input("Hi, I'm a chatbot trained on dialog from The Princess Bride. Would you like to chat with me?\n")
    
    if user_response in self.negative_responses:
      print("Ok, have a great day!")
      return
    
    self.chat(user_response)
  
  def chat(self, reply):
    while not self.make_exit(reply):
      # change this line below:
      reply = input(self.generate_response(reply))
    
  # define .generate_response():
  def generate_response(self, user_input):
    return "Cool!\n"
  
  def make_exit(self, reply):
    for exit_command in self.exit_commands:
      if exit_command in reply:
        print("Ok, have a great day!")
        return True
      
    return False
  
# instantiate your ChatBot below:
chatty_mcchatface = ChatBot()
print(chatty_mcchatface.generate_response("Hey."))

##11.2 Generating Chatbot Responses

As you may have noticed, a fundamental change from one chatbot architecture to the next is how the method that handles conversation works. In rule-based and retrieval-based systems, this method checks for various user intents that will trigger corresponding responses. In the case of generative chatbots, the seq2seq test function we built for the machine translation will do most of the heavy lifting for us!

For our chatbot we’ve renamed decode_sequence() to .generate_response(). As a reminder, this is where response generation and selection take place:

1. The encoder model encodes the user input

2. The encoder model generates an embedding (the last hidden state values)

3. The embedding is passed from the encoder to the decoder

4. The decoder generates an output matrix of possible words and their probabilities

5. We use NumPy to help us choose the most probable word (according to our model)

6. Our chosen word gets translated back from a NumPy matrix into human language and added to the output sentence

In [None]:
import numpy as np
from seq2seq import encoder_model, decoder_model, num_decoder_tokens, target_features_dict, reverse_target_features_dict, max_decoder_seq_length

class ChatBot:
  
  negative_responses = ("no", "nope", "nah", "naw", "not a chance", "sorry")

  exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later", "stop")
  
  def start_chat(self):
    user_response = input("Hi, I'm a chatbot trained on dialog from The Princess Bride. Would you like to chat with me?\n")
    
    if user_response in self.negative_responses:
      print("Ok, have a great day!")
      return
    
    self.chat(user_response)
  
  def chat(self, reply):
    while not self.make_exit(reply):
      reply = input(self.generate_response(reply))
    
  # update .generate_response():
  def generate_response(self, user_input):
    states_value = encoder_model.predict(user_input)
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_features_dict['<START>']] = 1.
    
    chatbot_response = ''

    stop_condition = False
    while not stop_condition:
      output_tokens, hidden_state, cell_state = decoder_model.predict(
        [target_seq] + states_value)

      sampled_token_index = np.argmax(output_tokens[0, -1, :])
      sampled_token = reverse_target_features_dict[sampled_token_index]
      chatbot_response += " " + sampled_token
      
      if (sampled_token == '<END>' or len(chatbot_response) > max_decoder_seq_length):
        stop_condition = True
        
      target_seq = np.zeros((1, 1, num_decoder_tokens))
      target_seq[0, 0, sampled_token_index] = 1.
      
      states_value = [hidden_state, cell_state]
      
    return chatbot_response
  
  def make_exit(self, reply):
    for exit_command in self.exit_commands:
      if exit_command in reply:
        print("Ok, have a great day!")
        return True
      
    return False
  
chatty_mcchatface = ChatBot()
# call .generate_response():
chatty_mcchatface.generate_response("I'd love to chat.")



##11.3 Handling user input

Hmm… why can’t our chatbot chat? Right now our .generate_response() method only works with preprocessed data that’s been converted into a NumPy matrix of one-hot vectors. That won’t do for our chatbot; we don’t just want to use test data for our output. We want the .generate_response() method to accept new user input.

Luckily, we can address this by building a method that translates user input into a NumPy matrix. Then we can call that method inside .generate_response() on our user input.

We said it before, and we’ll say it again: we’re adding deep learning in now, so running the code may take a bit more time again.

In [None]:
import numpy as np
import re
from seq2seq import encoder_model, decoder_model, num_decoder_tokens, num_encoder_tokens, input_features_dict, target_features_dict, reverse_target_features_dict, max_decoder_seq_length, max_encoder_seq_length

class ChatBot:
  
  negative_responses = ("no", "nope", "nah", "naw", "not a chance", "sorry")

  exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later", "stop")
  
  def start_chat(self):
    user_response = input("Hi, I'm a chatbot trained on dialog from The Princess Bride. Would you like to chat with me?\n")
    
    if user_response in self.negative_responses:
      print("Ok, have a great day!")
      return
    
    self.chat(user_response)
  
  def chat(self, reply):
    while not self.make_exit(reply):
      reply = input(self.generate_response(reply))
    
  # define .string_to_matrix() below:
  def string_to_matrix(self, user_input):
    tokens = re.findall(r'[\w]+|[^\s\w]', user_input)
    user_input_matrix = np.zeros((1, max_encoder_seq_length, num_encoder_tokens), dtype='float32')
    for timestep, token in enumerate(tokens):
      user_input_matrix[0, timestep, input_features_dict[token]] = 1
    return user_input_matrix
  
  def generate_response(self, user_input):
    # change user_input into a NumPy matrix:
    user_input = self.string_to_matrix(user_input)
    # update argument for .predict():
    states_value = encoder_model.predict(user_input)
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_features_dict['<START>']] = 1.
    
    chatbot_response = ''

    stop_condition = False
    while not stop_condition:
      output_tokens, hidden_state, cell_state = decoder_model.predict(
        [target_seq] + states_value)

      sampled_token_index = np.argmax(output_tokens[0, -1, :])
      sampled_token = reverse_target_features_dict[sampled_token_index]
      chatbot_response += " " + sampled_token
      
      if (sampled_token == '<END>' or len(chatbot_response) > max_decoder_seq_length):
        stop_condition = True
        
      target_seq = np.zeros((1, 1, num_decoder_tokens))
      target_seq[0, 0, sampled_token_index] = 1.
      
      states_value = [hidden_state, cell_state]
      
    return chatbot_response
  
  def make_exit(self, reply):
    for exit_command in self.exit_commands:
      if exit_command in reply:
        print("Ok, have a great day!")
        return True
      
    return False
  
chatty_mcchatface = ChatBot()
# call .generate_response():
print(chatty_mcchatface.generate_response("I love the princess"))

##11.4 Handling unkown words

Nice work! Our chatbot now knows how to accept user input. But there is a pretty large caveat here: our chatbot only knows the vocabulary from our training data. What if a user uses a word that the chatbot has never seen before?

With our current code, we’ll get a KeyError:

This is because in .string_to_matrix() we are looking for token in input_features_dict.

Currently, if the token doesn’t exist in the input_features_dict dictionary (which keeps track of all words in the training data), our program has no way of handling it.

Here are a few popular approaches to tackle unknown words:

- Tell the chatbot to ignore them, which is the simplest fix for smaller datasets, but can never generate those words as output. (Can you imagine scenarios when this could be a problem?)

- Pause the chat process and have the chatbot ask what the entire utterance means. This requires the user to rephrase the entire utterance. This causes issues when working with a fairly limited dataset, since we may end up with the chatbot repeatedly asking the user to rephrase each input statement.

- Add in a step for the chabot to register any unknown word as a '<UNK>' token. This is generally more complicated than the other two solutions. It would require that the training data is built out with '<UNK>' tokens and requires several extra manual steps.

In [None]:
import numpy as np
import re
from seq2seq import encoder_model, decoder_model, num_decoder_tokens, num_encoder_tokens, input_features_dict, target_features_dict, reverse_target_features_dict, max_decoder_seq_length, max_encoder_seq_length

class ChatBot:
  
  negative_responses = ("no", "nope", "nah", "naw", "not a chance", "sorry")

  exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later", "stop")
  
  def start_chat(self):
    user_response = input("Hi, I'm a chatbot trained on dialog from The Princess Bride. Would you like to chat with me?\n")
    
    if user_response in self.negative_responses:
      print("Ok, have a great day!")
      return
    
    self.chat(user_response)
  
  def chat(self, reply):
    while not self.make_exit(reply):
      reply = input(self.generate_response(reply))
    
  # define .string_to_matrix() below:
  def string_to_matrix(self, user_input):
    tokens = re.findall(r"[\w']+|[^\s\w]", user_input)
    user_input_matrix = np.zeros(
      (1, max_encoder_seq_length, num_encoder_tokens),
      dtype='float32')
    for timestep, token in enumerate(tokens):
      # add an if clause to handle user input:
      if token in input_features_dict:
        user_input_matrix[0, timestep, input_features_dict[token]] = 1.
    return user_input_matrix
  
  def generate_response(self, user_input):
    input_matrix = self.string_to_matrix(user_input)
    states_value = encoder_model.predict(input_matrix)
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_features_dict['<START>']] = 1.
    
    chatbot_response = ''

    stop_condition = False
    while not stop_condition:
      output_tokens, hidden_state, cell_state = decoder_model.predict(
        [target_seq] + states_value)

      sampled_token_index = np.argmax(output_tokens[0, -1, :])
      sampled_token = reverse_target_features_dict[sampled_token_index]
      chatbot_response += " " + sampled_token
      
      if (sampled_token == '<END>' or len(chatbot_response) > max_decoder_seq_length):
        stop_condition = True
        
      target_seq = np.zeros((1, 1, num_decoder_tokens))
      target_seq[0, 0, sampled_token_index] = 1.
      
      states_value = [hidden_state, cell_state]
      
    return chatbot_response
  
  def make_exit(self, reply):
    for exit_command in self.exit_commands:
      if exit_command in reply:
        print("Ok, have a great day!")
        return True
      
    return False
  
chatty_mcchatface = ChatBot()
# call .generate_response():

print(chatty_mcchatface.generate_response("Now can I say love?"))

# Appendix: Web Scrapping with BeautifulSoup

##A.1 Requests

n order to get the HTML of the website, we need to make a request to get the content of the webpage. To learn more about requests in a general sense, you can check out this article.

Python has a requests library that makes getting content really easy. All we have to do is import the library, and then feed in the URL we want to GET.

In [None]:
import requests

webpage_response = requests.get("https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html")

webpage = webpage_response.content

print(webpage)

##A.2 The BeautifulSoup Object

When we printed out all of that HTML from our request, it seemed pretty long and messy. How could we pull out the relevant information from that long string?

BeautifulSoup is a Python library that makes it easy for us to traverse an HTML page and pull out the parts we’re interested in.

"html.parser" is one option for parsers we could use. There are other options, like "lxml" and "html5lib" that have different advantages and disadvantages, but for our purposes we will be using "html.parser" throughout.

With the requests skills we just learned, we can use a website hosted online as that HTML.

In [None]:
import requests
from bs4 import BeautifulSoup

webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html', 'html.parser')

webpage = webpage_response.content

soup = BeautifulSoup(webpage)

print(soup)

##A.3 Object types

BeautifulSoup breaks the HTML page into several types of objects.

A Tag corresponds to an HTML Tag in the original document. Accessing a tag from the BeautifulSoup object in this way will get the first tag of that type on the page. You can get the name of the tag using .name and a dictionary representing the attributes of the tag using .attrs.

NavigableStrings are the pieces of text that are in the HTML tags on the page. You can get the string inside of the tag by calling .string.


In [None]:
import requests
from bs4 import BeautifulSoup

webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

print(soup.p.string)

##A.4 Navigating by tags

To navigate through a tree, we can call the tag names themselves.If we made a soup object out of this HTML page, we have seen that we can get the first h1 element by calling:

print(soup.h1)

We can get the children of a tag by accessing the .children attribute. We can also navigate up the tree of a tag by accessing the .parents attribute.

In [None]:
import requests
from bs4 import BeautifulSoup

webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

for child in soup.div.children:
  print(child)

##A.5 Find all

If we want to find all of the occurrences of a tag, instead of just the first one, we can use .find_all().

This function can take in just the name of a tag and returns a list of all occurrences of that tag.

.find_all() is far more flexible than just accessing elements directly through the soup object. With .find_all(), we can use regexes, attributes, or even functions to select HTML elements more intelligently.

What if we want every <ol> and every <ul> that the page contains? We can select both of these types of elements with a regex in our .find_all().

What if we want all of the h1 - h9 tags that the page contains? Regex to the rescue again!

We can also just specify all of the elements we want to find by supplying the function with a list of the tag names we are looking for.

We can also try to match the elements with relevant attributes. We can pass a dictionary to the attrs parameter of find_all with the desired attributes of the elements we’re looking for.

If our selection starts to get really complicated, we can separate out all of the logic that we’re using to choose a tag into its own function. Then, we can pass that function into .find_all()!

In [None]:
import requests
from bs4 import BeautifulSoup

webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

turtle_links = soup.find_all("a")

print(turtle_links)

##A.6 Selector for SCSS Selectors

Another way to capture your desired elements with the soup object is to use CSS selectors. The .select() method will take in all of the CSS selectors you normally use in a .css file!

If we wanted to select all of the elements that have the class 'recipeLink', we could use the command:

soup.select(".recipeLink")

If we wanted to select the element that has the id 'selected', we could use the command:

soup.select("#selected")

Let’s say we wanted to loop through all of the links to these funfetti recipes that we found from our search.

for link in soup.select(".recipeLink > a"):
  webpage = requests.get(link)
  new_soup = BeautifulSoup(webpage)

This loop will go through each link in each .recipeLink div and create a soup object out of the webpage it links to. So, it would first make soup out of <a href="spaghetti.html">Funfetti Spaghetti</a>, then <a href="lasagna.html">Lasagna de Funfetti</a>, and so on.

In [None]:
import requests
from bs4 import BeautifulSoup

prefix = "https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/"
webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

turtle_links = soup.find_all("a")
links = []
#go through all of the a tags and get the links associated with them:
for a in turtle_links:
    links.append(prefix+a["href"])
    
#Define turtle_data:
turtle_data = {}

#follow each link:
for link in links:
  webpage = requests.get(link)
  turtle = BeautifulSoup(webpage.content, "html.parser")
  #Add your code here:
  turtle_name = turtle.select(".name")[0]
  turtle_data[turtle_name] = []

##A.7 Reading Text

When we use BeautifulSoup to select HTML elements, we often want to grab the text inside of the element, so that we can analyze it. We can use .get_text() to retrieve the text inside of whatever tag we want to call it on.

Notice that this combined the text inside of the outer h1 tag with the text contained in the span tag inside of it! Using get_text(), it looks like both of these strings are part of just one longer string. If we wanted to separate out the texts from different tags, we could specify a separator character. This command would use a . character to separate.



In [23]:
import requests
from bs4 import BeautifulSoup

prefix = "https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/"
webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

turtle_links = soup.find_all("a")
links = []
#go through all of the a tags and get the links associated with them"
for a in turtle_links:
    links.append(prefix+a["href"])
    
#Define turtle_data:
turtle_data = {}

#follow each link:
for link in links:
  webpage = requests.get(link)
  turtle = BeautifulSoup(webpage.content, "html.parser")
  turtle_name = turtle.select(".name")[0].get_text()
  
  stats = turtle.find("ul")
  stats_text = stats.get_text("|")
  turtle_data[turtle_name] = stats_text.split("|")

print(turtle_data)

Asia's largest titanium dioxide manufacture is trapped in an environmental vortex: Lomon Billions has been accused of sewage for many years
