# INTENT PARSING

* **Purpose** :
  * Test intent parsing with ALLENLP

 * TABLE OF CONTENT
 * SETUP
   * paths
 * PARAMETERS
 * PARSING
   * Allennlp VP parsing
   * Parsing performance
   * Focus on the class well parsed
 * ANNOTATION


# SETUP

In [312]:
import os
from datetime import datetime
from time import time

import numpy as np
import pandas as pd
import yaml
from nltk.tree import ParentedTree
from pigeon import annotate

proj_path = "/Users/steeve_laquitaine/desktop/CodeHub/intent/"
os.chdir(proj_path)
# in root
from intent.src.intent.nodes import mood, parsing

# dataframe display
pd.set_option("display.max_colwidth", 100)
pd.set_option("display.max_rows", 1000)

# pd.set_option('display.notebook_repr_html', True)

# to display df w/ nbconvert to pdf
# def _repr_latex_(self):
    # return "\centering{%s}" % self.to_latex()
# pd.DataFrame._repr_latex_ = _repr_latex_  # monkey patch pandas DataFrame

## paths

In [313]:
# load catalog
with open(proj_path+"intent/conf/base/catalog.yml") as file:
    catalog = yaml.load(file)
with open(proj_path+"intent/conf/base/parameters.yml") as file:
    prms = yaml.load(file)
tr_data_path = proj_path + "intent/data/01_raw/banking77/train.csv"
test_data_path = proj_path + "intent/data/01_raw/banking77/test.csv"

# PARAMETERS

In [314]:
prm = dict()
prm["sample"] = 100
prm["mood"] = ["declarative"]
prm["intent_class"] = "card_arrival"  # good parsing performance


In [315]:
# read queries data
tr_data = pd.read_csv(tr_data_path)

In [316]:
# select data for an input class
data = tr_data[tr_data["category"].eq(prm["intent_class"])]

In [317]:
data.head(5)

Unnamed: 0,text,category
0,I am still waiting on my card?,card_arrival
1,What can I do if my card still hasn't arrived after 2 weeks?,card_arrival
2,I have been waiting over a week. Is the card still coming?,card_arrival
3,Can I track my card while it is in the process of delivery?,card_arrival
4,"How do I know if I will get my card, or if it is lost?",card_arrival


In [318]:
sample = data["text"].iloc[0]

# PARSING

## ALLENLP VP PARSING

In [319]:
tic = time()
al_prdctor = parsing.init_allen_parser()
print(f"(Instantiation) took {round(time()-tic,2)} secs")

(Instantiation) took 30.8 secs


In [320]:
tic = time()
output = al_prdctor.predict(sentence=sample)
parsed_txt = output["trees"]
print(f"(Inference) took {round(time()-tic,2)} secs")
print(f"Parsed sample:\n{parsed_txt}")

(Inference) took 0.3 secs
Parsed sample:
(S (NP (PRP I)) (VP (VBP am) (ADVP (RB still)) (VP (VBG waiting) (PP (IN on) (NP (PRP$ my) (NN card))))) (. ?))


In [321]:
tree = ParentedTree.fromstring(parsed_txt)
assert len(parsing.extract_VP(al_prdctor, "I want coffee")) > 0, "VP is Empty"

In [322]:
# Speed up (1 hour / 10K queries)
VPs = parsing.extract_all_VPs(prm, data, al_prdctor)
assert (
    len(VPs) == len(data) or len(VPs) == prm["sample"]
), '''VP's length does not match "data"'''

Time to completion: 79.28
Time to completion: 60.47
Time to completion: 51.51
Time to completion: 51.73
Time to completion: 47.52
Time to completion: 46.61
Time to completion: 49.31
Time to completion: 46.95
Time to completion: 49.87
Time to completion: 48.46
Time to completion: 46.92
38.02


In [323]:
# augment dataset with VPs
data["VP"] = pd.DataFrame(VPs)
data.iloc[: prm["sample"]]

Unnamed: 0,text,category,VP
0,I am still waiting on my card?,card_arrival,take for my new card to arrive in them mail
1,What can I do if my card still hasn't arrived after 2 weeks?,card_arrival,check the delivery of the card you sent
2,I have been waiting over a week. Is the card still coming?,card_arrival,track when my card will be delivered
3,Can I track my card while it is in the process of delivery?,card_arrival,tell me where my card is ? I ordered it 2 weeks ago
4,"How do I know if I will get my card, or if it is lost?",card_arrival,gets here
5,When did you send me my new card?,card_arrival,have a tracking number for the card I was sent
6,Do you have info about the card on delivery?,card_arrival,need to do to get my new card which I have requested 2 weeks ago
7,What do I do if I still have not received my new card?,card_arrival,been lost in delivery
8,Does the package with my card have tracking?,card_arrival,'s been a week since you issued me a card and I still did n't get it . Should I keep waiting
9,I ordered my card but it still isn't here,card_arrival,send me my new card


 Write parsed data

In [324]:
data.to_excel(catalog['parsed'])

In [325]:
# verb_p[0].pretty_print()

## PARSING PERFORMANCE

 * **Parser works in 62% of the cases for "card_arrival" and never for other classes**

   * see 2a_eda_parsing.py
   * We will analyse why later.
   * We now focus on the class well parsed: "card_arrival".

## FOCUS ON THE CLASS WELL PARSED

 moods = mood.classify_sentence_type(data["text"])
 moods

### ANNOTATE

 1. Annotate VPs that look like intent vs. not
 2. Look what make them different
 3. Test a few hypothesis:
   - mood: declarative vs. interrogative syntax ?
   - tense: present vs. past ?
   - lexical: some verbs and not others
   - else ?
   - semantics: direct object vs. indirect ?

In [326]:
annots = []
filepath = os.path.splitext(catalog['annots'])
myfile, myext = filepath[0], filepath[1]
if prms['annotation']=='do':
    annots = annotate(data["VP"], options=["yes", "no"])
elif prms['annotation'] == 'load':
    annot_path = os.path.split(catalog['annots'])[0]
    files = os.listdir(annot_path); 
    files = [file for file in files if file.startswith('annots')]
    latest_version = annot_path + '/' + files[-1]
    annots = pd.read_excel(latest_version)    
else:
    print('WARNING: you must either "load" or "do" annotations')

 Write annots

In [327]:
if prms['annotation']=='do' and not os.path.isfile(catalog['annots']):
    # add current time to filename 
    now = datetime.now().strftime("%d/%m/%Y %H:%M:%S").replace(' ','_').replace(':','_').replace('/','_')    
    annots_df = pd.DataFrame(annots)
    annots_df.to_excel(f'{myfile}_{now}{myext}')
else:
    print('WARNING: Annots was not written. To write, delete existing and rerun.')



In [328]:
annots_df = annots.rename(columns={0:'text',1:'annot'})
annots_df['annot'][annots_df['text'].isnull()] = np.nan

 **Fig. Queries are sorted by annotation result below.**

In [329]:
annots_df = annots_df.sort_values(by='annot', ascending=False)

In [330]:
annots_df

Unnamed: 0.1,Unnamed: 0,text,annot
78,78,help me track my card,yes
40,40,know when my new card is going to arrive,yes
22,22,track my card for me,yes
23,23,to track the card that was just sent to me,yes
25,25,"know if I will get my card , or if it is lost",yes
77,77,to track my card,yes
73,73,get my new card,yes
70,70,tell me where my card is ? I ordered it 2 weeks ago,yes
43,43,want to find out what happened to my new card,yes
86,86,to track the card you sent,yes


 **Fig. Proportion of "Good" intents.**

In [331]:
n_total = len(annots_df)
n_null = annots_df['annot'].isnull().sum()
n_yes = annots_df['annot'].eq('yes').sum()
n_no = annots_df['annot'].eq('no').sum()
stats = pd.DataFrame({
    'annot': ['null', 'yes', 'no','Total'], 
    'count': [n_yes, n_no, n_null, n_total],
    '%': [n_yes/n_total, n_no/n_total, n_null/n_total, 1]
    })
stats

Unnamed: 0,annot,count,%
0,,26,0.169935
1,yes,74,0.48366
2,no,53,0.346405
3,Total,153,1.0


In [332]:
# jupyter nbconvert --no-input --to=pdf 2_Intent_parsing.ipynb



In [336]:
n_total = len(annots_df)
n_null = annots_df['annot'].isnull().sum()
n_yes = annots_df['annot'].eq('yes').sum()
n_no = annots_df['annot'].eq('no').sum()
stats = pd.DataFrame({
    'annot': ['null', 'yes', 'no','Total'], 
    'count': [n_null, n_yes, n_no, n_total],
    '%': [n_null/n_total*100, n_yes/n_total*100, n_no/n_total*100, 100]
    })
stats

Unnamed: 0,annot,count,%
0,,53,34.640523
1,yes,26,16.993464
2,no,74,48.366013
3,Total,153,100.0
