 # Intent inference



 author: Steeve Laquitaine
 purpose: predict intent class
 approach:
   - preprocessing
       - Constituency parsing
       - Filtering
           - complexity: keep intents w/ N sentences
           - mood
           - syntax similarity
   - inference
       - cluster and add labels

 TABLE OF CONTENTS

 * Packages
 * Parameters
 * Load data
 * Constituency parsing
 * Filtering
   * by query complexity
   * by grammatical mood
   * by syntactical similarity
 * Intent parsing
 * Label inference

 Prerequisites

   * cfg..xlsx
   * sim_matrix.xlsx

 Observations:

   * So far the best parameters are:

       SEED            = " VB NP" <br>
       THRES_NUM_SENT  = 1 <br>
       NUM_SENT        = 1 <br>
       THRES_SIM_SCORE = 1 <br>
       FILT_MOOD       = ("ask",) <br>

 [TODO]:
  - refactor and abstract pipeline
  - link raw dataset with inference dataset with an index (primary key)

 # PACKAGES

In [1]:
# set project path
from inspect import TPFLAGS_IS_ABSTRACT
import os
from collections import defaultdict

proj_path = "/Users/steeve_laquitaine/desktop/CodeHub/intent/"
os.chdir(proj_path)

from time import time

# import packages
import pandas as pd
import spacy
import yaml
from collections import Counter
import numpy as np


# import custom nodes
from intent.src.intent.nodes import (
    features,
    inference,
    parsing,
    preprocess,
    retrieval,
    similarity,
)
from intent.src.intent.pipelines.parsing import Cfg
from intent.src.intent.pipelines.similarity import Lcs
from intent.src.tests import test_run

# shorcuts
todf = pd.DataFrame

# display
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", None)

Warming up PyWSD (takes ~10 secs)... took 8.985882997512817 secs.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/steeve_laquitaine/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/steeve_laquitaine/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  prms = yaml.load(file)
  catalog = yaml.load(file)
  catalog = yaml.load(file)
  catalog = yaml.load(file)


# PARAMETERS

In [2]:
SEED = " VB NP"  # seed for comparison
NUM_SENT = 1  # keep query with max one sentence
THRES_SIM_SCORE = 1  # Keep queries syntactically similar to seed
FILT_MOOD = ("ask",)  # ("state", "wish-or-excl", "ask")  # Keep statements
DIST_THRES = 5  # inference threshold for clustering, low values -> more clusters
with open(proj_path + "intent/conf/base/parameters.yml") as file:
    prms = yaml.load(file)

  prms = yaml.load(file)


# LOAD DATA

## Raw data

In [3]:
t0 = time()
corpus_path = proj_path + "intent/data/01_raw/banking77/train.csv"
corpus = pd.read_csv(corpus_path)

# PREPROCESSING

## Constituency parsing

In [4]:
# [warning] this is slow
cfg = Cfg(corpus, prms).do()

Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
(Instantiation) took 31.89 secs
Time to completion: 25.63
Time to completion: 35.19
Time to completion: 32.11
Time to completion: 32.91
Time to completion: 31.87
Time to completion: 31.34
Time to completion: 29.52
Time to completion: 29.02
Time to completion: 27.53
Time to completion: 28.47
Time to completion: 28.17
38.94
  sample["VP"] = np.asarray(VPs)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  annots_df["annots"][annots_df["VP"].isnull()] = np.nan


## Filtering

### filter complexity

 We kept intents with N sentences

In [5]:
cfg_cx = preprocess.filter_n_sent_eq(cfg, NUM_SENT, verbose=True)

There are 100 original queries.
88 after filtering = 1 sentence queries.


### filter mood

In [6]:
cfg_mood = preprocess.filter_in_only_mood(cfg_cx, FILT_MOOD)

In [7]:
tag = parsing.from_cfg_to_constituents(cfg_mood["cfg"])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cfg[cfg.isnull()] = ""


### filter syntax similarity

In [8]:
# calculate similarity
similarity_matrix = Lcs().do(cfg_mood)
test_run.test_len_similarity_matx(cfg_mood, similarity_matrix)

In [9]:
sim_ranked = similarity.rank_nearest_to_seed(similarity_matrix, seed=SEED, verbose=True)
posting_list = retrieval.create_posting_list(tag)
ranked = similarity.print_ranked_VPs(cfg_mood, posting_list, sim_ranked)
filtered = similarity.filter_by_similarity(ranked, THRES_SIM_SCORE)
# test [TODO]
test_run.test_rank_nearest_to_seed(similarity_matrix, seed=SEED)
test_run.test_posting_list(posting_list, similarity_matrix, seed=SEED)
test_run.test_get_posting_index(cfg_mood, posting_list, sim_ranked)


0 duplicated cfgs were dropped.
9 querie(s) is(are) left after filtering.


In [10]:
# map back to raw intent indices
raw_ix = cfg_mood["index"]
filtered_raw_ix = raw_ix.values[filtered.index.values]

### Intent parsing

 1. Apply dependency parsing to each query
 2. Apply NER
 3. Retrieve (intent (ROOT), intendeed (dobj), entities (NER))

In [11]:
intents = parsing.parse_intent(filtered)

In [12]:
# show (intent, intendeed)
cfg_mood.index = cfg_mood["index"]
cfg_mood.merge(todf(intents, index=filtered_raw_ix), left_index=True, right_index=True)[
    ["index", "text", "intent", "intendeed"]
]

Unnamed: 0,index,text,intent,intendeed
1375,1375,How can I change currency type?,[type],
83,83,Can I track the card that was just sent to me?,[track],[card]
1432,1432,How can I exchange currencies with this app?,[currencies],
27,27,When will I recieve my new card?,[recieve],[card]
1449,1449,How can I change the currency I'm exchanging from AUD to GBP?,[change],"[currency, exchanging]"
63,63,Can I track the card you sent to me?,[track],[card]
1470,1470,Can this app exchange American and English currency?,[currency],
116,116,How do I track the card you sent to me?,[track],[card]
1392,1392,How can I convert currencies?,[convert],[currencies]


# LABEL INFERENCE

 1. Filter words not in Wordnet
 2. Apply verb phrase hierarchical clustering

In [13]:
filtered_corpus = preprocess.filter_words(cfg_mood["VP"], "not_in_wordnet")

In [14]:
filtered_corpus = preprocess.drop_empty_queries(filtered_corpus)

In [15]:
# [warning] this is very slow
tic = time()
labels = inference.label_queries(tuple(filtered_corpus), DIST_THRES)

  final_similarity = dot / tow


In [16]:
labels.index = filtered_corpus.index
print(f"{round(time() - tic, 2)} secs")
print(f"Total: {round(time() - t0, 2)} secs")
labelled = labels.sort_values(by=["label"])


216.65 secs
Total: 292.72 secs


# EVALUATION

## Map cluster with True label for evaluation

In [17]:
true_labels = corpus.loc[labelled.index]["category"]
labelled["true_labels"] = true_labels
labelled

Unnamed: 0_level_0,query,label,true_labels
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
136,is tracking number card,1,card_arrival
1363,change,1,exchange_via_app
44,tell me why I have received new card,1,card_arrival
73,was supposed arrive,1,card_arrival
1359,change another currency,1,exchange_via_app
90,track new card sent me,1,card_arrival
1384,change currency another,1,exchange_via_app
63,track card sent me,1,card_arrival
145,gotten new card,1,card_arrival
12,track card me,1,card_arrival


In [18]:
# assign the most likely label to each cluster

In [19]:
unique_labels = labelled["label"].unique()
predicted_labels_all = []
proba_predicted_all = []
n_labelled = len(labelled)

# for each cluster, assign its most likely true label as the predicted label
for ix, this_label in enumerate(unique_labels):

    # find indices of this cluster intents
    this_label_ix = np.where(labelled == this_label)[0]
    nb_label = len(this_label_ix)

    # find its most frequent true label
    predicted = Counter(
        labelled["true_labels"].iloc[this_label_ix].tolist()
    ).most_common()[0]

    # assign this true label as predicted label and its conditional proba
    # as proba of predicted
    predicted_labels_all += [predicted[0]] * nb_label
    proba_predicted_all += [predicted[1] / n_labelled] * nb_label
labelled["predicted"] = predicted_labels_all
labelled["proba_predicted (ratio)"] = proba_predicted_all
labelled

Unnamed: 0_level_0,query,label,true_labels,predicted,proba_predicted (ratio)
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
136,is tracking number card,1,card_arrival,exchange_via_app,0.512821
1363,change,1,exchange_via_app,exchange_via_app,0.512821
44,tell me why I have received new card,1,card_arrival,exchange_via_app,0.512821
73,was supposed arrive,1,card_arrival,exchange_via_app,0.512821
1359,change another currency,1,exchange_via_app,exchange_via_app,0.512821
90,track new card sent me,1,card_arrival,exchange_via_app,0.512821
1384,change currency another,1,exchange_via_app,exchange_via_app,0.512821
63,track card sent me,1,card_arrival,exchange_via_app,0.512821
145,gotten new card,1,card_arrival,exchange_via_app,0.512821
12,track card me,1,card_arrival,exchange_via_app,0.512821


## Calculate performance metrics

In [20]:
nb_TP = sum(labelled["predicted"] == labelled["true_labels"])
accuracy = nb_TP / n_labelled
print("Task info:")
print("- number of classes:", labelled["true_labels"].nunique())
print("\nMetrics:")
print("- accuracy:", accuracy)


Task info:
- number of classes: 2

Metrics:
- accuracy: 0.5128205128205128
