Introduction to NLG's Narrative API
===================================

This notebook is an introduction to Gramex NLG's Narrative API. Here we will learn how to create data-driven narratives with the NLG module, by going over the building blocks of the API.

Getting Started
---------------

If the NLG module is not installed, install it as follows:

```bash
$ pip install nlg
```

Test the installation by running the following cell:

In [1]:
from nlg.search import templatize
import pandas as pd

Next, let's load some data. For this tutorials, we will be using [this](https://raw.githubusercontent.com/gramener/gramex-nlg/master/nlg/tests/data/actors.csv) dataset. Please download the file and load it as a pandas dataframe.

In [2]:
# Replace the path with wherever you have downloaded the dataset.
df = pd.read_csv('../nlg/tests/data/actors.csv')
df

Unnamed: 0,category,name,rating,votes
0,Actors,Humphrey Bogart,0.570197,109
1,Actors,Cary Grant,0.438602,142
2,Actors,James Stewart,0.988374,120
3,Actors,Marlon Brando,0.102045,108
4,Actors,Fred Astaire,0.208877,84
5,Actresses,Katharine Hepburn,0.039188,63
6,Actresses,Bette Davis,0.282807,14
7,Actresses,Audrey Hepburn,0.120197,94
8,Actresses,Ingrid Bergman,0.29614,52
9,Actors,Spencer Tracy,0.466311,192


Let us now sort the dataframe by the `rating` column. NLG is designed to work with Gramex's [FormHandler](https://learn.gramener.com/guide/formhandler). Therefore, we will use FormHandler's own DSL to make any transformation on the dataset.

In [3]:
from gramex.data import filter as gfilter  # do not clobber the `filter` function from the Python stdlib
sort_args = {'_sort': ['-rating']}

Note that the `_sort` key in the dictionary above tells Gramex to sort a dataframe by the given columns. The value of they key is a _list_, indicating that dataframes can be sorted by multiple columns. Also, the hyphen before the column name indicates that the sorting is _descending_.

In [4]:
xdf = gfilter(df, sort_args.copy())

In [5]:
xdf.head()

Unnamed: 0,category,name,rating,votes
2,Actors,James Stewart,0.988374,120
0,Actors,Humphrey Bogart,0.570197,109
9,Actors,Spencer Tracy,0.466311,192
1,Actors,Cary Grant,0.438602,142
8,Actresses,Ingrid Bergman,0.29614,52


Now, let's write something about this dataset. It is apparent that James Stewart has the highest rating.

In [6]:
from nlg.utils import load_spacy_model
nlp = load_spacy_model()

text = nlp("James Stewart is the actor with the highest rating.")

The entry-point into the NLG module is the [`nlg.search.templatize`](https://github.com/gramener/gramex-nlg/blob/dev/nlg/search.py#L478) function. This function uses:
* a dataframe
* operations on the dataframe (as FormHandler arguments)
* some text about the dataset

to create a [`Nugget`](https://github.com/gramener/gramex-nlg/blob/dev/nlg/narrative.py#L102) object. To learn more about the `Nugget` object and it's methods, see the [README](https://github.com/gramener/gramex-nlg/tree/dev#glossary-grammar-of-data-driven-narratives).

In [7]:
nugget = templatize(text, sort_args, df)

  indices = {array[i]: i for i in mask.nonzero()[0]}


In [8]:
nugget

{% set fh_args = {"_sort": ["-rating"]}  %}
{% set df = U.gfilter(orgdf, fh_args.copy()) %}
{% set fh_args = U.sanitize_fh_args(fh_args, orgdf) %}
{# Do not edit above this line. #}
{{ df["name"].iloc[0] }} is the {{ G.singular(df["category"].iloc[-2]).lower() }} with the highest rating.

As we see, a nugget has an underlying [Tornado template](https://www.tornadoweb.org/en/stable/template.html) which has been auto-generated by the `templatize` function. Let's see how well this template re-renders on the dataset.

In [9]:
print(nugget.render(df))

b'    James Stewart is the actor with the highest rating.'


The text above is identical to the input text, but this is generated from a template. Essentially, we can pass any dataframe to the [`.render`](https://github.com/gramener/gramex-nlg/blob/dev/nlg/narrative.py#L190) method of the nugget object, and the text will be rendered in the context of that data. To test this, let's create a copy of the dataframe and give all the artists a random rating.

In [10]:
import numpy as np
np.random.seed(12345)

fake_ratings = df.copy()
fake_ratings['rating'] = np.random.rand(df.shape[0])

Let's see who the top rated artist is in this new, fake dataset.

In [11]:
fake_ratings.sort_values('rating', ascending=False).head()

Unnamed: 0,category,name,rating,votes
6,Actresses,Bette Davis,0.964515,14
0,Actors,Humphrey Bogart,0.929616,109
8,Actresses,Ingrid Bergman,0.748907,52
10,Actors,Charlie Chaplin,0.747715,76
9,Actors,Spencer Tracy,0.65357,192


Now, let's see if our original nugget is able to adapt to this new dataset.

In [12]:
nugget.render(fake_ratings)

b'    Bette Davis is the actor with the highest rating.'

Clearly, that is false. Bette Davis is the _actress_ with the highest rating. To see what went wrong, let's take a look at the template again.

In [13]:
print(nugget.template)

{% set fh_args = {"_sort": ["-rating"]}  %}
{% set df = U.gfilter(orgdf, fh_args.copy()) %}
{% set fh_args = U.sanitize_fh_args(fh_args, orgdf) %}
{# Do not edit above this line. #}
{{ df["name"].iloc[0] }} is the {{ G.singular(df["category"].iloc[-2]).lower() }} with the highest rating.


As we can see, the words 'actor' or 'actress' don't appear in the template. This means that the template-generator has correctly figured out that these words are dependent on the transformed dataset. However, it has not managed to determine the exact formula for this.

Any token in the input text which is data-dependent, is called a [`Variable`](https://github.com/gramener/gramex-nlg/blob/dev/nlg/narrative.py#L27). To see which words in a nugget are variables, take a look at the `.variables` attribute of the nugget.

In [14]:
nugget.variables

{James Stewart: {{ df["name"].iloc[0] }},
 actor: {{ G.singular(df["category"].iloc[-2]).lower() }}}

We see here that there are two tokens from the original text - `"James Stewart"` and `"actor"` that have been identified as variables. Only, the Python _expression_ for determining one of them is wrong. Whether the highest rated artist is an actor or an actress needs to be found from the `"category"` column of the first row.

To fix this, we can use the [`.set_expr`](https://github.com/gramener/gramex-nlg/blob/dev/nlg/narrative.py#L58) method of the respective variable. The `.set_expr` method accepts any valid Python expression as a string.

In [15]:
var = nugget.get_var('actor')

In [16]:
var.set_expr('df["category"].iloc[0]')

In [17]:
var

{{ G.singular(df["category"].iloc[0]).lower() }}

Now that we have fixed the variable. Let's re-render the nugget on the fake dataset.

In [18]:
nugget.render(fake_ratings)

b'    Bette Davis is the actress with the highest rating.'

----

There is scope for yet more automation. Note that the last word in the text, "rating", matches the name of the column by which the dataframe has been sorted. Therefore, even that can be turned into a variable. Essentially, we want the template to render the name of whichever column is used to sort the data, in place of rating.

New variables can be added to a nugget using the [`.add_var`](https://github.com/gramener/gramex-nlg/blob/dev/nlg/narrative.py#L236) method of the nugget object, as follows:

In [19]:
var_token = text[-2]  # The spacy token corresponding to "rating"

In [20]:
var_expr = 'fh_args["_sort"][0]'  # The Python expression to detect the sorted column

In [21]:
nugget.add_var(var_token, expr=var_expr)
nugget

{% set fh_args = {"_sort": ["-rating"]}  %}
{% set df = U.gfilter(orgdf, fh_args.copy()) %}
{% set fh_args = U.sanitize_fh_args(fh_args, orgdf) %}
{# Do not edit above this line. #}
{{ df["name"].iloc[0] }} is the {{ G.singular(df["category"].iloc[0]).lower() }} with the highest {{ fh_args["_sort"][0] }}.

----
Let us now test a scenario where we sort the dataframe by votes.

In [22]:
nugget.fh_args = {'_sort': ['-votes']}
nugget

{% set fh_args = {"_sort": ["-votes"]}  %}
{% set df = U.gfilter(orgdf, fh_args.copy()) %}
{% set fh_args = U.sanitize_fh_args(fh_args, orgdf) %}
{# Do not edit above this line. #}
{{ df["name"].iloc[0] }} is the {{ G.singular(df["category"].iloc[0]).lower() }} with the highest {{ fh_args["_sort"][0] }}.

In [23]:
nugget.render(df)

b'    Spencer Tracy is the actor with the highest votes.'

---

Now we know how to create templates from raw text, and how to assign tokens within the text as data-dependent variables. In forthcoming examples, we will explore:

1. how to design more complex variable expression - especially those that cannot be defined a short and simple Python strings
2. how to create longer narratives by putting together different nuggets.