## The dataset

The [E2E NLG challenge dataset](https://github.com/tuetschek/e2e-dataset) consists of 50k pairs of natural language texts (NLs) and their meaning representations (MRs). For example:

- MR:

```
name[The Eagle],
eatType[coffee shop],
food[French],
priceRange[moderate],
customerRating[3/5],
area[riverside],
kidsFriendly[yes],
near[Burger King]
```

- NL:


```
The three star coffee shop, The Eagle, gives families a mid-priced dining experience featuring a variety of wines and cheeses. Find The Eagle near Burger King.
```

This example is taken from the page on the [E2E NLG challenge](http://www.macs.hw.ac.uk/InteractionLab/E2E/). The objective of this challenge is to generate a text given its meaning representation. 

## Objective of this notebook

In this notebook, we train a  classifier to label a text with the MR attribute-values. So given the text above as input, the classifier would tell us that it is `eatType[coffee shop]`, `food[French]`, `priceRange[moderate]`, `customerRating[3/5]`, `kidsFriendly[yes]`. `area[riverside]` should not be output as it is not verbalized (the human writers sometimes omitted some information).

## Aligning a text with its meaning representation in Natural Language Generation: what for?

According to the e2e NLG challenge organizers (see [that paper](https://aclweb.org/anthology/W18-6539)), the NLG model with best results is achieved by [this system](https://aclweb.org/anthology/N18-1014). To improve their results, the authors of that approach use reranking of the NLG outputs by looking at slots in the MR that are missing in the output text (false negative) and slots in the output text that are missing in the input MR (false positive). 

This is done by a slot aligner that aligns each sentence in the output with a subset of MR slots.  This slot alignment approach uses heuristics based on a gazetteer, a set of hand written rules and access to Wordnet to augment the gazetteer with related terms (e.g., "italian" and "pasta"). The slot alignment is also used by the authors to generate new data by taking individual sentences and their aligned slots as new input pairs.

In this notebook, we align the texts with the MRs, not individual sentences. That is left for future work.

## Results and discussion

The approach in this notebook relies on [fastai](https://www.fast.ai/) approach and library for classification using transfer-based NLP and gets an **f-score of 89-90% over all the labels on the test set**.

The texts were delexicalized for venue names and this raised the test set f-score (from 85% to 90%) since we don't want labels to depend on names of venues. On the other hand, types of venue `restaurant` have a very low f-score whilst being high in both training and validation set. It is not clear why this happens.

Qualitative analysis of texts and MR pairs with most incorrect predictions in the training set reveals 3 main issues with the dataset: some MRs are not verbalized, some texts verbalize different MRs, and some MRs are so close in meaning that they are undistinguishable in the text.

Note: This notebook is based on [that github](https://github.com/krasing/multilabel-ULMFiT/blob/master/asrs_new-factors-clean.ipynb) which does multi-label classification.

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import pandas as pd
import re
import numpy as np
from fastai import * # notebook was run with fastai 1.0.51
from fastai.text import *
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.metrics import classification_report

pd.set_option('display.max_colwidth', -1)

# by setting a random seed number, we'll ensure that when doing language model, same training-validation split is used.
np.random.seed(42) 


In [None]:
path = Path('../input')

Given the e2e NLG challenge dataset, we want to detect content given text.

## Getting the data

In [None]:
df = pd.read_csv(path/"trainset.csv")
df_test = pd.read_csv(path/"testset_w_refs.csv")
df_dev = pd.read_csv(path/"devset.csv")
print(df.shape)
print(df_dev.shape)
print(df_test.shape)
df.head()

We remove name and near from the features as they are open ended + they are normally a strict match.

We remove the name feature-value as it is in every MR and string-based. We replace the name feature value with `near[yes]` to indicate when it is verbalized.

In [None]:
import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

In [None]:
def delexicalize(attribute,value,new_value,new_row,row):
    new_row["ref"] = re.sub(value,new_value,new_row["ref"])
    new_row["ref"] = re.sub(value.lower(),new_value.lower(),new_row["ref"])
    new_row["ref"] = re.sub(strip_accents(value.lower()),new_value.lower(),new_row["ref"])
    new_row["ref"] = re.sub(strip_accents(value),new_value,new_row["ref"])
    value0=value[0]+value[1:].lower()
    new_row["ref"] = re.sub(value0,new_value,new_row["ref"])
    new_row["ref"] = re.sub(strip_accents(value0),new_value,new_row["ref"])
    value0=value[0].lower()+value[1:]
    new_row["ref"] = re.sub(value0,new_value,new_row["ref"])
    new_row["ref"] = re.sub(strip_accents(value0),new_value,new_row["ref"])
    return new_row

In [None]:
from nltk import sent_tokenize
def process_features(df):
    rows = []
    for i,row in df.iterrows():
        row0 = row.to_dict()
        row0["ref"] = re.sub("  +"," ",row0["ref"])
        row0["mr"] = re.sub("  +"," ",row0["mr"])
        name = re.sub(r"^.*name\[([^\]]+)\].*$",r"\1",row0["mr"].strip())
        near = re.sub(r"^.*near\[([^\]]+)\].*$",r"\1",row0["mr"].strip())
        name = re.sub("  +"," ",name)
        near = re.sub("  +"," ",near)
        row0 = delexicalize("name",name,"Xxx",row0,row)
        row0 = delexicalize("near",near,"Yyy",row0,row)
        row0["mr"] = re.sub(r"name\[[^\]]+\](, *| *$)","",row0["mr"].strip())
        row0["mr"] = re.sub(r"near\[[^\]]+\](, *| *$)",r"near[yes]\1",row0["mr"].strip())
        row0["mr"] = re.sub(r", *$","",row0["mr"].strip())
        row0["mr"] = re.sub(r" *, *",",",row0["mr"].strip())
        row0["mr"] = row0["mr"].strip()
        if row["ref"]==row0["ref"]:
            continue
        rows.append(row0)
    return pd.DataFrame(rows)

In [None]:
df=process_features(df)
df_dev=process_features(df_dev)
df_test=process_features(df_test)
print(df.shape)
print(df_dev.shape)
print(df_test.shape)
df.head()

## Some statistics about the data

In [None]:
from nltk.tokenize import sent_tokenize
rows=[]
for i,row in df.iterrows():
    mrs = row["mr"].split(",")
    sents = sent_tokenize(row["ref"])
    for mr in mrs:
        row[mr]=1
        if not mr.startswith("near") and not mr.startswith("name"):
            feature_name = re.sub(r"^([^\[]+)\[.*$",r"\1",mr.strip())
            row[feature_name]=1
    row["num_mrs"]=len(mrs)
    row["num_sents"]=len(sents)
    rows.append(row)

In [None]:
df_stats = pd.DataFrame(rows)
df_stats = df_stats.fillna(0)
df_stats.head(5)

In [None]:
stats = {}
df_sample = df_stats
rows =[]
for col in df_sample.columns:
    row={}
    if df_stats[col].dtype == np.float64:
        if "[" not in col:
            row["feature"]="_"+col
        else:
            row["feature"]=col
        row["num"]=df_sample[col].sum()
        row["mean"]=df_sample[col].mean()
        row["std"]=df_sample[col].std()
        rows.append(row)
    elif df_sample[col].dtype == np.int64:
        row["feature"]="__"+col
        row["num_1"] = (df_sample.loc[df_sample[col]==1]).shape[0]
        row["num"]=df_sample[col].sum()
        row["mean"]=df_sample[col].mean()
        row["min"]=df_sample[col].min()
        row["max"]=df_sample[col].max()
        row["std"]=df_sample[col].std()
        row["median"]=df_sample[col].median()
        rows.append(row)
df_stats0 = pd.DataFrame(rows)
df_stats0 = df_stats0.sort_values(by="feature")
df_stats0

## Fine tuning the language model

For fine tuning the language model, we use training, validation and test set, as we're not using the labels.

In [None]:
df_all = pd.concat([df, df_dev,df_test], ignore_index=True)
df_all.shape

In [None]:
bs = 56

In [None]:
df_all.sample(5)

In [None]:
data_lm = (TextList.from_df(df_all, ".", cols='ref')
                .split_by_rand_pct(0.1)
                .label_for_lm()
                .databunch(bs=bs))

In [None]:
data_lm.show_batch()

In [None]:
learn_lm = language_model_learner(data_lm, arch=AWD_LSTM, drop_mult=1e-7)

In [None]:
learn_lm.freeze()

In [None]:
learn_lm.lr_find()
learn_lm.recorder.plot()

In [None]:
learn_lm.fit_one_cycle(1, 1e-02, moms=(0.8,0.7))

In [None]:
learn_lm.unfreeze()

In [None]:
learn_lm.lr_find()
learn_lm.recorder.plot(suggestion=True)

In [None]:
learn_lm.fit_one_cycle(4, 1e-03, moms=(0.8,0.7),wd=0.3)

In [None]:
learn_lm.recorder.plot_losses()

In [None]:
learn_lm.save('fine_tuned')
learn_lm.save_encoder('fine_tuned_enc')

We get the generative language model to work, giving it some beginning of text for it to complete:

In [None]:
TEXT = "Near"
N_WORDS = 50
N_TEXTS = 2

In [None]:
print("\n".join(learn_lm.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_TEXTS)))

## Classification

In [None]:
bs = 56

In [None]:
def precision(log_preds, targs, thresh=0.5, epsilon=1e-8):
    pred_pos = (log_preds > thresh).float()
    tpos = torch.mul((targs == pred_pos).float(), targs.float())
    return (tpos.sum()/(pred_pos.sum() + epsilon))#.item()

In [None]:
def recall(log_preds, targs, thresh=0.5, epsilon=1e-8):
    pred_pos = (log_preds > thresh).float()
    tpos = torch.mul((targs == pred_pos).float(), targs.float())
    return (tpos.sum()/(targs.sum() + epsilon))

In [None]:
data_clas = TextClasDataBunch.from_df(".", train_df=df, valid_df=df_dev, 
                                  vocab=data_lm.vocab, 
                                  text_cols='ref', 
                                  label_cols='mr',
                                  label_delim=',',
                                  bs=bs)

In [None]:
data_clas.show_batch()

In [None]:
print(len(data_clas.valid_ds.classes))
data_clas.valid_ds.classes

In [None]:
learn = text_classifier_learner(data_clas, arch=AWD_LSTM,drop_mult=1e-7)
learn.metrics = [accuracy_thresh, precision, recall]
learn.load_encoder('fine_tuned_enc')

In [None]:
learn.freeze()

In [None]:
learn.lr_find()
learn.recorder.plot(suggestion=True)

In [None]:
learn.fit_one_cycle(10, 1E-02, moms=(0.8,0.7),wd=1e-7)

In [None]:
learn.recorder.plot_losses()

In [None]:
learn.save("stage1")

In [None]:
learn = text_classifier_learner(data_clas, arch=AWD_LSTM,drop_mult=1e-7)
learn = learn.load("stage1")
learn.metrics = [accuracy_thresh, precision, recall]

Next, we unfreeze the whole model and train some more. I did not find that unfreezing the last 2 layers first made any improvement.

In [None]:
learn.unfreeze()

In [None]:
learn.lr_find()
learn.recorder.plot(suggestion=True)

In [None]:
learn.fit_one_cycle(5, slice(1E-03/(2.6**4),1E-03), moms=(0.8,0.7), wd=0.5)

In [None]:
learn.save("classifier_model",return_path=True, with_opt=True)

In [None]:
learn.recorder.plot_losses()

## Quantitative evaluation

In [None]:
def make_predictions(model_name,df_train,df_valid,vocab,bs):
    data_clas = TextClasDataBunch.from_df(".", train_df=df_train, valid_df=df_valid, 
                                      text_cols='ref', 
                                      label_cols='mr',
                                      label_delim=',',
                                      bs=bs)
    learn = text_classifier_learner(data_clas, arch=AWD_LSTM)
    learn.load(model_name)
    learn.data = data_clas
    preds, y = learn.get_preds(ordered=True)
    return learn,preds,y

In [None]:
learn_train,preds_train,y_train = make_predictions("classifier_model",df,df,None,bs)
learn_valid,preds_valid,y_valid = make_predictions("classifier_model",df,df_dev,None,bs)
learn_valid,preds_test,y_test = make_predictions("classifier_model",df,df_test,None,bs)

In [None]:
f1_train = f1_score(y_train, preds_train>0.5, average='micro')
f1_valid = f1_score(y_valid, preds_valid>0.5, average='micro')
f1_test = f1_score(y_test, preds_test>0.5, average='micro')
f1_train,f1_valid,f1_test

We can also look into more details at each feature performance:

In [None]:
y_true_train = y_train.numpy()
scores_train = preds_train.numpy()
report = classification_report(y_true_train, scores_train>0.5, target_names=data_clas.valid_ds.classes)
print(report)

In [None]:
y_true_valid = y_valid.numpy()
scores_valid = preds_valid.numpy()
report = classification_report(y_true_valid, scores_valid>0.5, target_names=data_clas.valid_ds.classes)
print(report)

In [None]:
y_true_test = y_test.numpy()
scores_test = preds_test.numpy()
report = classification_report(y_true_test, scores_test>0.5, target_names=data_clas.valid_ds.classes)
print(report)

The f-score with test set is near 90%. For some reason the classifier is no good at finding `eatType[restaurant]` mentions, with an f-score of 2%. 

With the validation set, there are no instances of `eatType[restaurant]` and `eatType[pub]` or `food[Fast food],food[French],food[Indian],food[Italian],food[Japanese]`. 

With the training set, the f-score is 95% with high f-scores on individual labels.

Why is detection of eating venue type so poor in the test set?

## Qualitative evaluation

We merge training examples with their predictions in the same dataframe and order rows in ascending order for f-score so we can view the worst predictions first:

In [None]:
learn,preds,y = make_predictions("classifier_model",df,df,None,bs)

In [None]:
f1_score(y, preds>0.5, average='micro')

In [None]:
def set_row_metrics(row,true_mrs,predicted_mrs):
        tp=0
        fp=0
        tn=0
        fn=0
        for mr in predicted_mrs:
            if mr in true_mrs:
                tp+=1
            else:
                fp+=1
        for mr in true_mrs:
            if mr not in predicted_mrs:
                fn+=1
            else:
                tn+=1
        row["tp"]=tp
        row["fp"]=fp
        row["fn"]=fn
        row["tn"]=tn
        row["precision"]=0
        row["recall"]=0
        row["fscore"]=0
        if tp+fp>0:
            row["precision"]=float(tp)/(tp+fp)
        if tp+fn>0:
            row["recall"]=float(tp)/(tp+fn)
        if row["precision"]+row["recall"]>0:
            row["fscore"]= 2*((row["precision"]*row["recall"])/(row["precision"]+row["recall"]))
        return row

In [None]:
def set_labels(df,preds,classes):
    preds_true = (preds>0.5)
    counter=0
    rows=[]
    for i,row in df.iterrows():
        row_preds = preds[counter]
        indices = [j for j in range(len(preds_true[counter])) if preds_true[counter][j]==True]
        row_labels = [classes[j] for j in indices]
        row["mr_predict"]=",".join(sorted(row_labels))
        predicted_mrs = row["mr"].split(",")
        row["mr"]=",".join(sorted(predicted_mrs))
        row = set_row_metrics(row,row_labels,predicted_mrs)
        rows.append(row)
        counter=counter+1
    return pd.DataFrame(rows)

In [None]:
learn.data.valid_ds.classes

In [None]:
df_preds = set_labels(df,preds,learn.data.valid_ds.classes)

We find that over a third of training instances are a perfect match:

In [None]:
df_preds[df_preds["fscore"]==1].shape[0]/df_preds.shape[0]

We sort dataframe containing real MRs, predicted MRs and corresponding texts, together with fscore and other metrics, in ascending order of fscore, so as to do some error analysis:

In [None]:
df_preds = df_preds.sort_values(by=["fscore","precision","recall"],ascending=True)

In [None]:
df_preds[df_preds["fscore"]<1][df_preds["fscore"]>0].head(5)

When looking at the texts with low f-score prediction, the following problems appear:

1. Some MRs are altogether incorrectly verbalized or sloppily verbalized. For example we have the following true MRs:
```
area[riverside],customer rating[1 out of 5],food[Fast food],near[yes],priceRange[high]	
```
It gets verbalized as:
```
Alimentum is a one star restaurant near the Yippee Noodle Bar
```
So `fast food` has been verbalized as `restaurant` which is technically true (we can say "a fast food restaurant").

2. Some MRs are not verbalized like in the example above where the fact that it is by the riverside is not mentioned.
3. There is a problem in that quantiative customer rating and price range like `customerRating[1 to 5]` are sometimes verbalized quantitatively, which is then analysed as `customerRating[low]`. So it seems that one can generate from those MRs but for evaluation, more flexible MRs should be considered: if text says 'low customer rating' then MR can either be `customerRating[low]` or `customerRating[1 to 5]`. For example we have the following true MRs:
```
eatType[restaurant],familyFriendly[yes],food[Japanese],priceRange[less than £20]
```
It gets verbalized as:
```
Loch Fyne is a cheap family friendly Japanese restaurant.
```
which gets classified as:
````
eatType[restaurant],familyFriendly[yes],food[Japanese],priceRange[cheap]
```


For each alignment, we can mark how many of the original it is missing, for how many it is a mismatch (original says Italian food and alignment says French food), and for how many the alignment added information:

In [None]:
def convert_to_dict(features):
    d = {}
    features = features.split(",")
    for f in features:
        name = (re.sub(r"^([^\[]+)\[([^\]]+)\]$",r"\1",f)).strip()
        value = (re.sub(r"^([^\[]+)\[([^\]]+)\]$",r"\2",f)).strip()
        if name not in d.keys():
            d[name]=set()
        d[name].add(value)
    return d

In [None]:
rows=[]
for i,row in df_preds.iterrows():
    row0=row
    mrs = convert_to_dict(row["mr"])
    mrs_predict = convert_to_dict(row["mr_predict"])
    missing=0
    mismatch=0
    added=0
    for feature in mrs.keys():
        if feature not in mrs_predict.keys():
            missing+=1
        else:
            for value in mrs[feature]:
                if value not in mrs_predict[feature]:
                    mismatch+=1
                    break
    for feature in mrs_predict.keys():
        if feature not in mrs.keys():
            added+=1
    row0["missing"]=missing
    row0["mismatch"]=mismatch
    row0["added"]=added
    rows.append(row0)
    
pd_preds0 = pd.DataFrame(rows)
pd.preds0.sample(5)