# Understanding natural language
>  Here, you'll use machine learning to turn natural language into structured data using spaCy, scikit-learn, and rasa NLU. You'll start with a refresher on the theoretical foundations and then move onto building models using the ATIS dataset, which contains thousands of sentences from real people interacting with a flight booking system.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 2 exercises "Building Chatbots in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (8, 8)

import re
import spacy

## Understanding intents and entities

### Intent classification with regex I

<div class=""><p>You'll begin by implementing a very simple technique to recognize intents - looking for the presence of keywords.</p>
<p>A dictionary, <code>keywords</code>, has already been defined. It has the intents <code>"greet"</code>, <code>"goodbye"</code>, and <code>"thankyou"</code> as keys,
and lists of keywords as the corresponding values. For example, <code>keywords["greet"]</code> is set to <code>"["hello","hi","hey"]</code>.</p>
<p>Also defined is a second dictionary, <code>responses</code>, indicating how the bot should respond to each of these intents.
It also has a default response with the key <code>"default"</code>.</p>
<p>The function <code>send_message()</code>, along with the bot and user templates, have also already been defined. Your job in this exercise is to create a dictionary with the intents as keys and regex objects as values.</p></div>

In [1]:
keywords = {'goodbye': ['bye', 'farewell'],
 'greet': ['hello', 'hi', 'hey'],
 'thankyou': ['thank', 'thx']}

Instructions
<ul>
<li>Iterate over the <code>keywords</code> dictionary, using <code>intent</code> and <code>keys</code> as your iterator variables.</li>
<li>Use <code>'|'.join(keys)</code> to create regular expressions to match at least one of the keywords and pass it to <code>re.compile()</code> to compile the regular expressions into pattern objects. Store the result as the value of the <code>patterns</code> dictionary.</li>
</ul>

In [4]:
# Define a dictionary of patterns
patterns = {}

# Iterate over the keywords dictionary
for intent, keys in keywords.items():
    # Create regular expressions and compile them into pattern objects
    patterns[intent] = re.compile('|'.join(keys))
    
# Print the patterns
print(patterns)

{'goodbye': re.compile('bye|farewell'), 'greet': re.compile('hello|hi|hey'), 'thankyou': re.compile('thank|thx')}


### Intent classification with regex II

<p>With your <code>patterns</code> dictionary created, it's now time to define a function to find the intent of a message.</p>

In [5]:
patterns = {'goodbye': re.compile(r'bye|farewell', re.UNICODE),
 'greet': re.compile(r'hello|hi|hey', re.UNICODE),
 'thankyou': re.compile(r'thank|thx', re.UNICODE)}

In [6]:
responses = {'default': 'default message',
 'goodbye': 'goodbye for now',
 'greet': 'Hello you! :)',
 'thankyou': 'you are very welcome'}

In [10]:
user_template = 'USER : {0}'
bot_template = 'BOT : {0}'
def send_message(message):
    print(user_template.format(message))
    response = respond(message)
    print(bot_template.format(response))

Instructions
<ul>
<li>Iterate over the <code>intent</code>s and <code>pattern</code>s in the <code>patterns</code> dictionary using its <code>.items()</code> method.</li>
<li>Use the <code>.search()</code> method of <code>pattern</code> to look for keywords in the <code>message</code>.</li>
<li>If there is a match, return the corresponding <code>intent</code>.</li>
<li>Call your <code>match_intent()</code> function inside <code>respond()</code> with <code>message</code> as the argument and then hit 'Submit Answer' to see how the bot responds to the provided messages.</li>
</ul>

In [11]:
# Define a function to find the intent of a message
def match_intent(message):
    matched_intent = None
    for intent, pattern in patterns.items():
        # Check if the pattern occurs in the message 
        if re.search(pattern, message):
            matched_intent = intent
    return matched_intent

# Define a respond function
def respond(message):
    # Call the match_intent function
    intent = match_intent(message)
    # Fall back to the default response
    key = "default"
    if intent in responses:
        key = intent
    return responses[key]

# Send messages
send_message("hello!")
send_message("bye byeee")
send_message("thanks very much!")

USER : hello!
BOT : Hello you! :)
USER : bye byeee
BOT : goodbye for now
USER : thanks very much!
BOT : you are very welcome


### Entity extraction with regex

<div class=""><p>Now you'll use another simple method, this time for finding a person's name in a sentence, such as "hello, my name is David Copperfield".</p>
<p>You'll look for the keywords <code>"name"</code> or <code>"call(ed)"</code>, and find capitalized words using regex and assume those are names. Your job in this exercise is to define a <code>find_name()</code> function to do this.</p></div>

Instructions
<ul>
<li>Use <code>re.compile()</code> to create a pattern for checking if <code>"name"</code> or <code>"call"</code> keywords occur.</li>
<li>Create a pattern for finding capitalized words.</li>
<li>Use the <code>.findall()</code> method on <code>name_pattern</code> to retrieve all matching words in <code>message</code>.</li>
<li>Call your <code>find_name()</code> function inside <code>respond()</code> and then hit 'Submit Answer' to see how the bot responds to the provided messages.</li>
</ul>

In [12]:
# Define find_name()
def find_name(message):
    name = None
    # Create a pattern for checking if the keywords occur
    name_keyword = re.compile('name|call')
    # Create a pattern for finding capitalized words
    name_pattern = re.compile('[A-Z]{1}[a-z]*')
    if name_keyword.search(message):
        # Get the matching words in the string
        name_words = name_pattern.findall(message)
        if len(name_words) > 0:
            # Return the name if the keywords are present
            name = ' '.join(name_words)
    return name

# Define respond()
def respond(message):
    # Find the name
    name = find_name(message)
    if name is None:
        return "Hi there!"
    else:
        return "Hello, {0}!".format(name)

# Send messages
send_message("my name is David Copperfield")
send_message("call me Ishmael")
send_message("People call me Cassandra")

USER : my name is David Copperfield
BOT : Hello, David Copperfield!
USER : call me Ishmael
BOT : Hello, Ishmael!
USER : People call me Cassandra
BOT : Hello, People Cassandra!


**You just built a simple entity recognizer using regex. However, as you can see with the final output of send_message(), the mix of using regex while making assumptions does have its limitations.**

## Word vectors

### word vectors with spaCy

<div class=""><p>In this exercise you'll get your first experience with word vectors!
You're going to use the ATIS dataset, which contains thousands of sentences from real people
interacting with a flight booking system.</p>
<p>The user utterances are available in the list <code>sentences</code>, and the corresponding intents in <code>labels</code>.</p>
<p>Your job is to create a 2D array <code>X</code> with as many rows as there are sentences in the dataset, where each row is a vector describing that sentence.</p></div>

In [4]:
sentences = [' i want to fly from boston at 838 am and arrive in denver at 1110 in the morning',
 ' what flights are available from pittsburgh to baltimore on thursday morning',
 ' what is the arrival time in san francisco for the 755 am flight leaving washington',
 ' cheapest airfare from tacoma to orlando',
 ' round trip fares from pittsburgh to philadelphia under 1000 dollars',
 ' i need a flight tomorrow from columbus to minneapolis',
 ' what kind of aircraft is used on a flight from cleveland to dallas',
 ' show me the flights from pittsburgh to los angeles on thursday',
 ' all flights from boston to washington',
 ' what kind of ground transportation is available in denver',
 ' show me the flights from dallas to san francisco',
 ' show me the flights from san diego to newark by way of houston',
 ' what is the cheapest flight from boston to bwi',
 ' all flights to baltimore after 6 pm',
 ' show me the first class fares from boston to denver',
 ' show me the ground transportation in denver',
 ' all flights from denver to pittsburgh leaving after 6 pm and before 7 pm',
 ' i need information on flights for tuesday leaving baltimore for dallas dallas to boston and boston to baltimore',
 ' please give me the flights from boston to pittsburgh on thursday of next week',
 ' i would like to fly from denver to pittsburgh on united airlines',
 ' show me the flights from san diego to newark',
 ' please list all first class flights on united from denver to baltimore',
 ' what kinds of planes are used by american airlines',
 " i'd like to have some information on a ticket from denver to pittsburgh and atlanta",
 " i'd like to book a flight from atlanta to denver",
 ' which airline serves denver pittsburgh and atlanta',
 " show me all flights from boston to pittsburgh on wednesday of next week which leave boston after 2 o'clock pm",
 ' atlanta ground transportation',
 ' i also need service from dallas to boston arriving by noon',
 ' show me the cheapest round trip fare from baltimore to dallas']

Instructions
<ul>
<li>Load the <code>spaCy</code> English model by calling <code>spacy.load()</code> with argument <code>'en'</code>.</li>
<li>Calculate the length of <code>sentences</code> using <code>len()</code> and the dimensionality of the word vectors using <code>nlp.vocab.vectors_length</code>. </li>
<li>For each sentence, call the <code>nlp</code> object with the <code>sentence</code> as the sole argument. Store the result as <code>doc</code>.</li>
<li>Use the <code>.vector</code> attribute of <code>doc</code> to get the vector representation of each sentence, and store this vector in the appropriate row of <code>X</code>.</li>
</ul>

In [None]:
%%capture
!python -m spacy download en_core_web_lg
#restart the colab runtime

In [7]:
# Load the spacy model: nlp
#nlp = spacy.load('en')
nlp = spacy.load("en_core_web_lg")

# Calculate the length of sentences
n_sentences = len(sentences)

# Calculate the dimensionality of nlp
embedding_dim = nlp.vocab.vectors_length

# Initialize the array with zeros: X
X = np.zeros((n_sentences, embedding_dim))

# Iterate over the sentences
for idx, sentence in enumerate(sentences):
    # Pass each each sentence to the nlp object to create a document
    doc = nlp(sentence)
    # Save the document's .vector attribute to the corresponding row in X
    X[idx, :] = doc.vector

## Intents and classification

### Intent classification with sklearn

<div class=""><p>An array <code>X</code> containing vectors describing each of the sentences in the ATIS dataset has been created for you, along with a 1D array <code>y</code> containing the labels. The labels are integers corresponding to the intents in the dataset. For example, label <code>0</code> corresponds to the intent <code>atis_flight</code>.</p>
<p>Now, you'll use the <code>scikit-learn</code> library to train a classifier on this same dataset. Specifically, you will fit and evaluate a support vector classifier.</p></div>

In [15]:
%%capture
!wget https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/building-chatbots-in-python/datasets/ATIS_preprocessed.npz
data = np.load('ATIS_preprocessed.npz')
X_train = data['X_train']
y_train = data['y_train']
X_test = data['X_test']
y_test = data['y_test']

Instructions
<ul>
<li>Import the <code>SVC</code> class from <code>sklearn.svm</code>.</li>
<li>Instantiate a classifier <code>clf</code> by calling <code>SVC</code> with a single keyword argument <code>C</code> with value <code>1</code>.</li>
<li>Fit the classifier to the training data <code>X_train</code> and <code>y_train</code>.</li>
<li>Predict the labels of the test set, <code>X_test</code>.</li>
</ul>

In [17]:
# Import SVC
from sklearn.svm import SVC

# Create a support vector classifier
clf = SVC(C=1)

# Fit the classifier using the training data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Count the number of correct predictions
n_correct = 0
for i in range(len(y_test)):
    if y_pred[i] == y_test[i]:
        n_correct += 1

print("Predicted {0} correctly out of {1} test examples".format(n_correct, len(y_test)))

Predicted 182 correctly out of 201 test examples


## Entity extraction

### Using spaCy's entity recognizer

<div class=""><p>In this exercise, you'll use <code>spaCy</code>'s built-in entity recognizer to extract names, dates, and organizations from search queries. The <code>spaCy</code> library has been imported for you, and its English model has been loaded as <code>nlp</code>.</p>
<p>Your job is to define a function called <code>extract_entities()</code>, which takes in a single argument <code>message</code> and returns a dictionary with the included entity types as keys, and the extracted entities as values. The included entity types are contained in a list called <code>include_entities</code>.</p></div>

Instructions
<ul>
<li>Create a dictionary called <code>ents</code> to hold the entities by calling <code>dict.fromkeys()</code> with <code>include_entities</code> as the sole argument.</li>
<li>Create a <code>spacy</code> document called <code>doc</code> by passing the <code>message</code> to the <code>nlp</code> object.</li>
<li>Iterate over the entities in the document (<code>doc.ents</code>).</li>
<li>Check whether the entity's <code>.label_</code> is one we are interested in. If so, assign the entity's <code>.text</code> attribute to the corresponding key in the <code>ents</code> dictionary.</li>
</ul>

In [19]:
# Define included_entities
include_entities = ['DATE', 'ORG', 'PERSON']

# Define extract_entities()
def extract_entities(message):
    # Create a dict to hold the entities
    ents = dict.fromkeys(include_entities)
    # Create a spacy document
    doc = nlp(message)
    for ent in doc.ents:
        if ent.label_ in include_entities:
            # Save interesting entities
            ents[ent.label_] = ent.text
        else:
            print(ent.label_)
    return ents

print(extract_entities('friends called Mary who have worked at Google since 2010'))
print(extract_entities('people who graduated from MIT in 1999'))

{'DATE': '2010', 'ORG': 'Google', 'PERSON': 'Mary'}
{'DATE': '1999', 'ORG': 'MIT', 'PERSON': None}


### Assigning roles using spaCy's parser

<div class=""><p>In this exercise you'll use <code>spaCy</code>'s powerful syntax parser to assign <em>roles</em> to the entities in your users' messages. To do this, you'll define two functions, <code>find_parent_item()</code> and <code>assign_colors()</code>. In doing so, you'll use a parse tree to assign roles, similar to how Alan did in the video.</p>
<p>Recall that you can access the ancestors of a word using its <code>.ancestors</code> attribute.</p></div>

In [23]:
def entity_type(word):
    _type = None
    if word.text in colors:
        _type = "color"
    elif word.text in items:
        _type = "item"
    return _type
colors = ['black', 'red', 'blue']
items = ['shoes', 'handback', 'jacket', 'jeans']

Instructions
<ul>
<li>Create a <code>spacy</code> document called <code>doc</code> by passing the message <code>"let's see that jacket in red and some blue jeans"</code> to the <code>nlp</code> object. </li>
<li>In the <code>find_parent_item(word)</code> function, iterate over the <code>ancestors</code> of each <code>word</code> until an <code>entity_type()</code> of <code>"item"</code> is found. </li>
<li>In the <code>assign_colors(doc)</code> function, iterate over the <code>doc</code> until an <code>entity_type</code> of <code>"color"</code> is found. Then, find the parent item of this <code>word</code>.</li>
<li>Pass in the <code>spacy</code> document to the <code>assign_colors()</code> function.</li>
</ul>

In [24]:
# Create the document
doc = nlp("let's see that jacket in red and some blue jeans")

# Iterate over parents in parse tree until an item entity is found
def find_parent_item(word):
    # Iterate over the word's ancestors
    for parent in word.ancestors:
        # Check for an "item" entity
        if entity_type(parent) == "item":
            return parent.text
    return None

# For all color entities, find their parent item
def assign_colors(doc):
    # Iterate over the document
    for word in doc:
        # Check for "color" entities
        if entity_type(word) == "color":
            # Find the parent
            item =  find_parent_item(word)
            print("item: {0} has color : {1}".format(item, word))

# Assign the colors
assign_colors(doc) 

item: jacket has color : red
item: jeans has color : blue


## Robust language understanding with rasa NLU

### Rasa NLU

<p>In this exercise, you'll use Rasa NLU to create an <code>interpreter</code>, which parses incoming user messages and returns a set of entities. Your job is to train an <code>interpreter</code> using the MITIE entity recognition model in Rasa NLU.</p>

In [None]:
%%capture
!pip install rasa

In [33]:
from rasa_nlu.training_data import load_data #old way -> from rasa_nlu.converters import load_data
from rasa_nlu.config import RasaNLUConfig

Instructions
<ul>
<li>Create a dictionary called <code>args</code> with a single key <code>"pipeline"</code> with value <code>"spacy_sklearn"</code>.</li>
<li>Create a <code>config</code> by calling <code>RasaNLUConfig()</code> with the single argument <code>cmdline_args</code> with value <code>args</code>.</li>
<li>Create a <code>trainer</code> by calling <code>Trainer()</code> using the configuration as the argument.</li>
<li>Create a <code>interpreter</code> by calling <code>trainer.train()</code> with the <code>training_data</code>.</li>
</ul>

In [None]:
# Import necessary modules
from rasa_nlu.converters import load_data
from rasa_nlu.config import RasaNLUConfig
from rasa_nlu.model import Trainer

# Create args dictionary
args = {"pipeline": "spacy_sklearn"}

# Create a configuration and trainer
config = RasaNLUConfig(cmdline_args=args)
trainer = Trainer(config)

# Load the training data
training_data = load_data("./training_data.json")

# Create an interpreter by training the model
interpreter = trainer.train(training_data)

# Test the interpreter
print(interpreter.parse("I'm looking for a Mexican restaurant in the North of town"))

### Data-efficient entity recognition

<div class=""><p>Most systems for extracting entities from text are built to extract 'Universal' things like names, dates, and places.
But you probably don't have enough training data for your bot to make these systems perform well!</p>
<p>In this exercise, you'll activate the MITIE entity recognizer inside Rasa to extract restaurants-related entities using a very small amount of training data. A dictionary <code>args</code> has already been defined for you, along with a <code>training_data</code> object.</p></div>

Instructions
<ul>
<li>Create a <code>config</code> by calling <code>RasaNLUConfig()</code> with a single argument <code>cmdline_args</code> with value <code>{"pipeline": pipeline}</code>.</li>
<li>Create a <code>trainer</code> and use it to create an <code>interpreter</code>, just as you did in the previous exercise.</li>
</ul>

In [None]:
# Import necessary modules
from rasa_nlu.config import RasaNLUConfig
from rasa_nlu.model import Trainer

pipeline = [
    "nlp_spacy",
    "tokenizer_spacy",
    "ner_crf"
]

# Create a config that uses this pipeline
config = RasaNLUConfig(cmdline_args={"pipeline": pipeline})

# Create a trainer that uses this config
trainer = Trainer(config)

# Create an interpreter by training the model
interpreter = trainer.train(training_data)

# Parse some messages
print(interpreter.parse("show me Chinese food in the centre of town"))
print(interpreter.parse("I want an Indian restaurant in the west"))
print(interpreter.parse("are there any good pizza places in the center?"))