# Learning Objectives
In this lab we are going to:
- Explore using semantic features to improve intent classification
- Explore word sense disambiguation (WSD) with Wordnet/NLTK
- WSD using the lesk algorithm


# Setup 
Please run all the cell in this section before starting exercises. This section install the necessary packages and downloads the data needed for all exercises below.

In [1]:
!pip install datasets >> dev.null
!pip install swifter >> dev.null

import nltk
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from nltk import word_tokenize

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('brown')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

print("finished installing")

finished installing


[nltk_data] Downloading package wordnet to /Users/zhejing/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/zhejing/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to /Users/zhejing/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/zhejing/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /Users/zhejing/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
# Grab Clinc data and convert them into data frames
import urllib.request, json 
with urllib.request.urlopen("https://raw.githubusercontent.com/clinc/oos-eval/master/data/data_small.json") as url:
    data = json.loads(url.read().decode())

import pandas as pd 

train  = pd.DataFrame(data["train"], columns=["text", "intent"])
train["split"] = "train"

test   = pd.DataFrame(data["test"], columns=["text", "intent"])
test["split"] = "test"

print(f"dataset split sizes: train - {len(train)} | test - {len(test)}")

# Combine datasets into single dataframe for easier use
dataset = pd.concat([train, test])

# Filter intents 
labels = ["application_status", "alarm", "apr", "are_you_a_bot", 
          "balance", "calendar_update", "calories", "carry_on", "change_accent",
          "change_ai_name", "change_language", "change_volume", "exchange_rate",
          "expiration_date", "find_phone", "freeze_account", "greeting",
          "insurance_change", "jump_start",
          "interest_rate",
          "smart_home",
          "schedule_meeting",
          "user_name",
          "w2",
          "weather",
          "what_is_your_name",
          "what_song",
          "who_do_you_work_for",
          "yes"
          ]
dataset = dataset.query("intent in @labels")

print("finished")

dataset split sizes: train - 7500 | test - 4500
finished


# Overview

In this lab we'll be learning more about semantic features through the applied task of intent classification. Intent classification is a common feature of most modern conversational AI and chatbot systems. Intent classification aims to understand what a user is telling the bot and map the user's utterance to a prespecified set of intent categories. Let's imagine we're buidling a chatbot for a bank to handle common banking actions like: opening a savings account (open-savings) or reset pin number (reset-pin). We can can create a set of intents like open-savings and reset-pin to represent those actions to the bot. Then when a user says "I want to open a savings account" or "How can I sign up for a savings account", our classifier model will predict "open-savings" intent and respond to the user accordingly. 

We can use machine learning to build a classifier that can automatically map user utterances to intents. For this lab we'll explore creating a simple model for intent classification and using semantic features (e.g. synsets from wordnet and word sense disambiguation) to improve the perfomance our baseline model. 

## Data
We'll be using the Clinc-150 dataset which consists of 150 common chatbot intents and training utterances. More information about this dataset can be found in their paper [An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction](https://aclanthology.org/D19-1131/). 

For simplicity we gone ahead downloaded the data for you and stored it in a pandas dataframe. The variable `dataset` is pandas dataframe consisting of three columns:
- text: the user utterance 
- intent: intent label 
- split: identifies if the row is from the train or test split


We'll use a smaller subset of 150 intents so that we can experiment more quickly. As the goal of the lab is on semantic analysis, we've provided the baseline model for you. However you are encouraged to explore the full dataset and other models once you've finished the exercises below. 


In [3]:
# Overview of dataset variable
dataset

Unnamed: 0,text,intent,split
50,what am i allowed to carry on for american air...,carry_on,train
51,what are the carry on rules for united,carry_on,train
52,what can't i carry-on to delta,carry_on,train
53,tell me united's carry on policy,carry_on,train
54,what is the carry on limit,carry_on,train
...,...,...,...
4285,what do i need to do now that my battery is dead,jump_start,test
4286,i have to jump start my car,jump_start,test
4287,what shall i do now that my battery is dead,jump_start,test
4288,i gotta jump start my car,jump_start,test


## Baseline Model Preprocessing and setup

Recall from previous lectures that building a machine learning NLP model consists of several steps. 

- First we encode our categorical labels into numerical values that our classifier will be predicting. To accomplish this we use sklearns LabelEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). We generate two variables `y_train` and `y_test` which contain our encoded labels in to list.

- Next we create input features for our model. As the input to our model is user utterances (text), we'll represent each sentence as a vector of `tf-idf` scores across a learned vocabulary. We use the `TfidfVectorizer` (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to fit and transform our user utterances into tfidf features.



In [4]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

# Encode intent label and transform into enumerated values
le = LabelEncoder()
le.fit(dataset["intent"])

dataset["encoded_label"] = le.transform(dataset["intent"])

# break out the encoded labels by train / test split
y_train = dataset.query("split=='train'")["encoded_label"]  
y_test  = dataset.query("split=='test'")["encoded_label"]

# Generate a vocabaulary from text and generate tfidf features 
tfidf = TfidfVectorizer()
tfidf.fit(dataset["text"])

# create train and text input features
X_train = tfidf.transform(dataset.query("split=='train'")["text"])
X_test =  tfidf.transform(dataset.query("split=='test'")["text"])

print("finished")

finished


## Baseline Model Implementation

Finally we go ahead and train our model. We'll use Gaussian Naive Bayes(https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) with the default hyper parameters. Feel free to read the documention to experiment with the hyperparameters and see how it effects the model.


Finally we'll evaluate the model on our test set by first generating predictions used the trained model and then evaluating them against our gold labels `y_test`. We use sklearn's `accuracy_score` to get the accuracy of our model

In [5]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load model and fit to training data
clf = GaussianNB()
clf.fit(X_train.toarray(), y_train)

# Generate predictions
preds = clf.predict(X_test.toarray())

# Calculate Accuracy metrics
acc = accuracy_score(y_test, preds)
print(f"Accuracy - {acc}")

Accuracy - 0.7988505747126436


# Semantic features with Wordnet

Let's get to main exerises for this week. Recall that semantics involves understanding the meaning of words and concepts. Semantic analysis of text explores specific meaning of words in context (word sense), words that related together through common definitions (synoynms) or perhaps diameterically different from each other (antonyms) and in many other ways. In addition to the slides, as useful introduction can be found here: [Overview and Semantic Issues of Text Mining](https://sigmodrecord.org/publications/sigmodRecord/0709/p23.cesar-andritsos.pdf)

## Wordnet Synsets
We'll be using WordNet as the source of our semantic features and tools. 


"WordNet is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, and adjectives are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets." [1]

The NLTK library has a useful set of APIs to access Wordnet and perform common semantic analysis tasks. More information can be found in the documentation here: https://www.nltk.org/howto/wordnet.html.

In this exercise we'll be generating synoynms from wordnet synsets. The goal is augment the input to the model with additional synonyms for the key nouns to give the model additional semantic clues as to the meaning of the input. 

Synsets are an NLTK object which are a group of synonymous words that are related to a specific concept. 

The synset object has several useful properties and methods. Each synset contains a dictionary definition, related synonyms, and many other properties that can be found in the documentation.  

Let's a take a closer look in the code below:

[1] Introduction to WordNet: An On-line Lexical Database (https://wordnetcode.princeton.edu/5papers.pdf)


In [6]:
from nltk.corpus import wordnet as wn

# Let's take a looking at the synset for the word balance in Wordnet

# Synsets is a collection of synset obects. Here the first 5 synsets related to 
# the word balance
for synset in wn.synsets("balance")[:5]:
  print(synset)
  print("Defintion: ",synset.definition())
  print("Synonym lemmas: ", synset.lemma_names() )
  print("-----------------------------------")

Synset('balance.n.01')
Defintion:  a state of equilibrium
Synonym lemmas:  ['balance']
-----------------------------------
Synset('balance.n.02')
Defintion:  equality between the totals of the credit and debit sides of an account
Synonym lemmas:  ['balance']
-----------------------------------
Synset('proportion.n.05')
Defintion:  harmonious arrangement or relation of parts or elements within a whole (as in a design); - John Ruskin
Synonym lemmas:  ['proportion', 'proportionality', 'balance']
-----------------------------------
Synset('balance.n.04')
Defintion:  equality of distribution
Synonym lemmas:  ['balance', 'equilibrium', 'equipoise', 'counterbalance']
-----------------------------------
Synset('remainder.n.01')
Defintion:  something left after other parts have been taken away
Synonym lemmas:  ['remainder', 'balance', 'residual', 'residue', 'residuum', 'rest']
-----------------------------------


# Exercise 1

The overall goal of this exercise is to agument each input sentence with a set of related synonyms for the nouns found in the sentence. At a high level we can accomplish this with the following steps:

1. Extract all nouns in the sentence
2. For each noun look up synonyms (if they exist) from the Wordnet synsets
3. Add the extracted synonyms to the end of the input

## Exercise 1a. Extract all nouns in a sentence
For this exercise you will write a function that extracts all the nouns in a sentence. 

For example:

- Sentence: I want to check the balance of my bank account
- Nouns: balance, bank, account 

To identify nouns we can use a NLTK POS tagger to tag each word in the sentence with POS tag. We can then extract all word with the relevant POS tags that define nouns. Below we show how you can get POS tags from NLTK for a given sentence

In [7]:
import nltk
from nltk import word_tokenize

ex1 = "I want to check the balance of my bank accounts"

# 1. Tokenize text
toks = word_tokenize(ex1)

# 2. Generate tagged tokens
tagged = nltk.pos_tag(toks)

# 3. Loop over tagged tokens and print out word and tag
for word, pos_tag in tagged:
  print(word, pos_tag)

I PRP
want VBP
to TO
check VB
the DT
balance NN
of IN
my PRP$
bank NN
accounts NNS


Go ahead and write the code to extract the nouns based on POS tags. For a full list of POS tags see: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/ 


Hint: the POS tag for singular nouns is NN. Can you find the remaining tags for plural nouns, proper nouns singular and proper noun plural. 

In [10]:
def extract_nouns(text):
  """
  This method takes in a sentene and return back of extracted nouns. If no nouns
  are found, an empty list is returned. 
  """
  # 1. Tokenize text
  toks = nltk.word_tokenize(text)
 
  # 2. Loop over tagged tokens and extract nouns. Add extracted nouns to the
  # extracted nouns list
  noun_tags = ['NN', 'NN', 'NNP', 'NNS', 'NNPS']
  extracted_nouns = []
  
  # YOUR CODE BELOW
  tagged = nltk.pos_tag(toks)
  for word, tag in tagged:
        if tag in noun_tags:
            extracted_nouns.append(word)
 
  return extracted_nouns

extract_nouns("I want to check the balance of my bank accounts")

['balance', 'bank', 'accounts']

## Exercise 1b. Extract synonyms from Wordnet 
Next you will create function to extract all the synonymns for a provided word from Wordnet. In the function below write your code to get synonyms from from Wordnet.



In [15]:
def extract_synonyms(noun):
  """
  Given a noun, return all the synonyms (if any) found in the synsets for
  that noun. Return an empty list if none are found. 
  """
  synonyms = []

  # Loop over the synsets for a given word and add each synonym 
  # to the synonyms list. 
  for synset in wn.synsets(noun):
        synonyms.extend(synset.lemma_names())

  return(synonyms)

extract_synonyms("balance")  

['balance',
 'balance',
 'proportion',
 'proportionality',
 'balance',
 'balance',
 'equilibrium',
 'equipoise',
 'counterbalance',
 'remainder',
 'balance',
 'residual',
 'residue',
 'residuum',
 'rest',
 'balance',
 'Libra',
 'Balance',
 'Libra',
 'Libra_the_Balance',
 'Balance',
 'Libra_the_Scales',
 'symmetry',
 'symmetricalness',
 'correspondence',
 'balance',
 'counterweight',
 'counterbalance',
 'counterpoise',
 'balance',
 'equalizer',
 'equaliser',
 'balance_wheel',
 'balance',
 'balance',
 'balance',
 'equilibrate',
 'equilibrize',
 'equilibrise',
 'balance',
 'poise',
 'balance',
 'balance']

## Exercise 1c. Add extracted synonyms to end of input

Finally, you will create a function that combines the functions you wote in 1a and 1b. In this function, you'll take all the synonyms found and add them to the end of the input. 


In [16]:
def add_synonyms_to_text(text):
  all_synonyms = []
 
  # 1. Extract Nouns
  extracted_nouns = extract_nouns(text)# your code here

  # Edge case, no nouns are found
  if extracted_nouns == []:
    return text

  # 2. For each noun extract synonyms
  for noun in extracted_nouns:
    # 1. Get synonyms
    s = extract_synonyms(noun)# your code here
    
    # 2. Add it to our list of synonyms and grab a unique
    all_synonyms.extend(s)
    unique_synonyms = set(all_synonyms)
    
  # 3. Return input with extracted synonyms appended at the end
  new_text = text + " " + " ".join(unique_synonyms)
  if new_text is None:
    return text
  else:
    return new_text

print(f"original text: {ex1}")
print(f"new text: {add_synonyms_to_text(ex1)}")

original text: I want to check the balance of my bank accounts
new text: I want to check the balance of my bank accounts counterbalance swear camber equilibrium equaliser rest balance_wheel chronicle residue rely correspondence report counterweight banking_company depository_financial_institution poise coin_bank proportion explanation trust equilibrize Libra_the_Balance story accounting calculate symmetricalness Libra_the_Scales banking_concern residuum score equilibrise symmetry money_box news_report remainder savings_bank equipoise history account counterpoise deposit equalizer cant Balance write_up account_statement residual bank business_relationship invoice Libra proportionality bank_building describe answer_for bill equilibrate balance


## Evaluation of Synonym features

Now we're ready to see what the effect of these semantic feature are on our model. We will want to apply the `add_synonyms_to_text` method to all the 
text in our dataset. Doing this linearly in a for loop is very inefficient. So provide code that parallelizes the operation below. We also provide the code to run to retrain the model. After running the cells below go to Exercise 1d to wrap up Exercise 1. 

In [17]:
import swifter

# Apply the text with synonyms function to text column in datasets
dataset["text_with_synonyms"] = dataset["text"].swifter.apply(lambda x: add_synonyms_to_text(x)) 



Pandas Apply:   0%|          | 0/2320 [00:00<?, ?it/s]

In [18]:
# Generate inputs 
tfidf_w_synonyms = TfidfVectorizer()
tfidf_w_synonyms.fit(dataset["text_with_synonyms"])

X_train = tfidf_w_synonyms.transform(dataset.query("split=='train'")["text_with_synonyms"])
X_test =  tfidf_w_synonyms.transform(dataset.query("split=='test'")["text_with_synonyms"])

In [19]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

clf = GaussianNB()
clf.fit(X_train.toarray(), y_train)

preds = clf.predict(X_test.toarray())
print(accuracy_score(y_test, preds))

0.803448275862069


## Exercise 1d: Evaluation
Did adding synonyms to the input improve performance? What is a potential downside of this approach?

Write your answer here:

it did, but we may be adding words with different meaning and changed the context.

# Improving Synonym Selection with WSD and Lesk

One limitation of the approach above was that we were naively adding all the synonyms we found in a synset for a given word. However, words can have different meanings in different contexts. Recall the concept of word sense and the problem of word sense disambiguation. Word sense disambiguation aims to identify the specific meaning of a word in context. 

The lesk algorithm aims to disambugiuate a word based on the overlap between the context around the word and the definitions of the word. The definition that has the highest similarity to the conctext is returned. 

NLTK has an implementation of the lesk algorithm which we explore below.

In [20]:
from nltk.wsd import lesk

sent = 'He cashed a check at the bank.'
sent_toks = nltk.word_tokenize(sent)
ambiguous = 'bank'

syn = lesk(sent_toks, ambiguous, pos='n')
print(sent)
print(syn)
print("Definition: ", syn.definition())
print("Synonyms: ", syn.lemma_names())

print("\n------------------------------------------\n")
sent2 = 'He saw the road bank left.'
sent2_toks = nltk.word_tokenize(sent2)
ambiguous = 'bank'
syn2 = lesk(sent2_toks, "bank")

print(sent2)
print(syn2)
print("Definition: ", syn2.definition())
print("Synonyms: ", syn2.lemma_names())

He cashed a check at the bank.
Synset('savings_bank.n.02')
Definition:  a container (usually with a slot in the top) for keeping money at home
Synonyms:  ['savings_bank', 'coin_bank', 'money_box', 'bank']

------------------------------------------

He saw the road bank left.
Synset('bank.n.07')
Definition:  a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Synonyms:  ['bank', 'cant', 'camber']


## Exercise 2

Let's create a function `lesk_synonyms` that provides synonyms based on the disambiguated noun. 

In [27]:

def lesk_synonyms(text, ambiguous):
  """
  Given a context and ambigous noun, return the synonyms from the disambiguated
  synset. Return empty list if none are found.
  """
  # 1. Get disambiguated synset
  syn = lesk(nltk.word_tokenize(text), ambiguous, pos='n') # your code here

  # Edge case, if synset is empty return empty list
  if syn is None:
    return []

  # Strip out '_' found in lemmas that contain compound words.
  cleaned_synonyms = [lemma.replace("_", " ") for lemma in syn.lemma_names()] 
  return cleaned_synonyms

lesk_synonyms("I went to the bank today.", "bank")

['bank', 'cant', 'camber']

We updated the `add_synonyms_to_text` to use the lesk synonyms and updated the model below. After running the cells below, go Exercise 2a. 

In [28]:
def add_synonyms_to_text(text): 
  
  # 1. Extract Nouns
  extracted_nouns = extract_nouns(text)

  # Edge case, no nouns are found
  if extracted_nouns == []:
    return text

  all_synonyms = []
  for noun in extracted_nouns:
    # 1. Get synonyms
    s = lesk_synonyms(text, noun)
    
    # 2. Add it to our list of synoyms
    all_synonyms.extend(s)
    unique_synonyms = set(all_synonyms)
  
  if len(unique_synonyms) == 0:
    return text
  else:
   return text + " " + " ".join(unique_synonyms)

print("He went to the bank")
print(add_synonyms_to_text("He went to the bank"))

He went to the bank
He went to the bank bank cant camber


In [29]:
dataset["text_with_lesk_synonyms"] = dataset["text"].swifter.apply(lambda x: add_synonyms_to_text(x)) 



Pandas Apply:   0%|          | 0/2320 [00:00<?, ?it/s]

In [30]:
# Generate inputs 
tfidf_w_synonyms = TfidfVectorizer()
tfidf_w_synonyms.fit(dataset["text_with_lesk_synonyms"])

X_train = tfidf_w_synonyms.transform(dataset.query("split=='train'")["text_with_lesk_synonyms"])
X_test =  tfidf_w_synonyms.transform(dataset.query("split=='test'")["text_with_lesk_synonyms"])

print("finished")

finished


In [31]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
clf = GaussianNB()
clf.fit(X_train.toarray(), y_train)

preds = clf.predict(X_test.toarray())
print(accuracy_score(preds, y_test))

0.8333333333333334


## Exercise 2a. Evaluation
What effect did replacing synonyms with lesk synonyms have on the model's accuracy?

it improved the model accuracy, as the synonyms added are disambigouous with lesk hence increased the chances of adding relevant synonyms only.