# How Huggngface ```map``` behave with ```batched=True```


Using the Amazon Product Review dataset , wireless v1 subset.

* Prompts developed for the Huggingface [AWS product review dataset](https://huggingface.co/datasets/amazon_us_reviews) on wireless category for multi label classification.

* [bigscience/promptsource](https://huggingface.co/spaces/bigscience/promptsource)

Prompt source provides multiple templates (select in **prompt name** box) for a dataset for different ML tasks e.g. multi-label classification, summarization, etc. 

<img src="./image/huggngface_promptsource.png" align="left"/>

In [1]:
from typing import (
    List,
    Dict,
)
import multiprocessing

import pandas as pd
from datasets import load_dataset
from promptsource.templates import (
    DatasetTemplates
)
from datasets.iterable_dataset import (
    IterableDataset
)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option("max_colwidth", None)
pd.set_option("max_seq_items", None)

In [2]:
NUM_CORES: int = multiprocessing.cpu_count()

# Huggingface Dataset for Prompts

In [3]:
DATASET_NAME: str = "amazon_us_reviews"
SUBSET_NAME: str = "Wireless_v1_00"

In [4]:
train: IterableDataset = load_dataset(
    path=DATASET_NAME, 
    name=SUBSET_NAME,
    split="train",
    streaming=True
)

In [5]:
example = list(train.take(1))[0]
print(example)

{'marketplace': 'US', 'customer_id': '16414143', 'review_id': 'R3W4P9UBGNGH1U', 'product_id': 'B00YL0EKWE', 'product_parent': '852431543', 'product_title': 'LG G4 Case Hard Transparent Slim Clear Cover for LG G4', 'product_category': 'Wireless', 'star_rating': 2, 'helpful_votes': 1, 'total_votes': 3, 'vine': 0, 'verified_purchase': 1, 'review_headline': 'Looks good, functions meh', 'review_body': "2 issues  -  Once I turned on the circle apps and installed this case,  my battery drained twice as fast as usual.  I ended up turning off the circle apps, which kind of makes the case just a case...  with a hole in it.  Second,  the wireless charging doesn't work.  I have a Motorola 360 watch and a Qi charging pad. The watch charges fine but this case doesn't. But hey, it looks nice.", 'review_date': '2015-08-31'}


In [29]:
columns = list(example.keys())
print(columns)

['marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 'product_category', 'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase', 'review_headline', 'review_body', 'review_date']


# Prompt Templates for the Huggingface Dataset

In [7]:
template_name: str = f"{DATASET_NAME}/{SUBSET_NAME}" if SUBSET_NAME is not None else DATASET_NAME
templates = DatasetTemplates(
    # The dataset_name should be known/accepted Huggingface dataset name.
    dataset_name=template_name   
)  
templates.all_template_names

['Generate review based on rating and category',
 'Generate review headline based on rating',
 'Generate review headline based on review body',
 'Given the review body return a categorical rating',
 'Given the review headline return a categorical rating']

In [8]:
template = templates['Given the review body return a categorical rating']
template.get_name()

'Given the review body return a categorical rating'

In [9]:
print(template.jinja)

Given the following review:
{{review_body}}
predict the associated rating from the following choices (1 being lowest and 5 being highest)
- {{ answer_choices | join('\n- ') }} 
|||
{{answer_choices[star_rating-1]}}


In [10]:
template.answer_choices

'1 ||| 2 ||| 3 ||| 4 ||| 5'

# Chat (Prompt/Response)

Generate a chat by applying the template. 

**NOTE**: ```template.apply(example: Dict])``` function can take single dictionary.

In [12]:
prompt, response = template.apply(example)

In [13]:
df = pd.DataFrame([(prompt, response)])
df.columns = ['prompt', 'response']
df

Unnamed: 0,prompt,response
0,"Given the following review:\n2 issues - Once I turned on the circle apps and installed this case, my battery drained twice as fast as usual. I ended up turning off the circle apps, which kind of makes the case just a case... with a hole in it. Second, the wireless charging doesn't work. I have a Motorola 360 watch and a Qi charging pad. The watch charges fine but this case doesn't. But hey, it looks nice.\npredict the associated rating from the following choices (1 being lowest and 5 being highest)\n- 1\n- 2\n- 3\n- 4\n- 5",2


In [14]:
print(prompt)

Given the following review:
2 issues  -  Once I turned on the circle apps and installed this case,  my battery drained twice as fast as usual.  I ended up turning off the circle apps, which kind of makes the case just a case...  with a hole in it.  Second,  the wireless charging doesn't work.  I have a Motorola 360 watch and a Qi charging pad. The watch charges fine but this case doesn't. But hey, it looks nice.
predict the associated rating from the following choices (1 being lowest and 5 being highest)
- 1
- 2
- 3
- 4
- 5


# Generate chat from dataset WITHOUT batch

```map(bached=False)``` **without** batch function gets single dictionary as its argument. Hence you **can apply** ```template.apply```. 

* [map()](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.map)

> If **batched is ```False```**, then the function **takes 1 example** in and should return 1 example.  
> An example is a dictionary, e.g. ```{"text": "Hello there !"}```.

In [88]:
def map_data_to_chat(row: Dict) -> Dict:
    """Huggingface Dataset map function to map data/row to chat. Must return a dictionary
    Return: dictionary
    """
    prompt, response = template.apply(row)
    return {        
        "chat": f"PROMPT:{prompt}\nRESPONSE:{response}\n\n"
    }

In [89]:
chats: IterableDataset = train.map(
    function=map_data_to_chat,
    batched=False,
    remove_columns=columns
)

In [90]:
list(chats.take(1))

[{'chat': "PROMPT:Given the following review:\n2 issues  -  Once I turned on the circle apps and installed this case,  my battery drained twice as fast as usual.  I ended up turning off the circle apps, which kind of makes the case just a case...  with a hole in it.  Second,  the wireless charging doesn't work.  I have a Motorola 360 watch and a Qi charging pad. The watch charges fine but this case doesn't. But hey, it looks nice.\npredict the associated rating from the following choices (1 being lowest and 5 being highest)\n- 1\n- 2\n- 3\n- 4\n- 5\nRESPONSE:2\n\n"}]

# Generate chat from dataset WITH batch

```map(bached=False)``` **with** batch function gets a dictionary as its argument. Hence you **can apply** ```template.apply```. 

* [map()](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.map)

> If **batched is ```True```** and ```batch_size``` is ```n > 1```, then the function takes **a batch of n examples** as input and can return a batch with n examples, or with an arbitrary number of examples. Note that the last batch may have less than n examples.  
> 
> A batch is a dictionary, e.g. a batch of n examples is ```{"text": ["Hello there !"] * n}```.

NOTE: This ```{"text": ["Hello there !"] * n}``` is tricky.

Instead of a list of examples to which we can apply Python built-in [map](https://docs.python.org/3/library/functions.html#map) to apply ```template.apply``` on each dictioanry:
```
[
    {
        "marketplace":"US",
        "customer_id":"16414143",
        "review_id":"R3W4P9UBGNGH1U",
        "product_id":"B00YL0EKWE",
        "product_parent":"852431543",
        ...
    },
    ...
]
```

Huggingface datasets gives below to which we cannot apply ```template.apply```:
```
{
    "marketplace": ["US", "US", "US", ....],
    "customer_id": ['16414143', '50800750', '15184378', ...],
    "review_id": ["R3W4P9UBGNGH1U",...],
    "product_id": ["B00YL0EKWE", ...],
    "product_parent": ["852431543", ...],
    ...
}
```


In [83]:
def map_data_to_chat_batched(rows: List[Dict[str, str]]) -> Dict:
    """Huggingface Dataset map function to map data/row to chat. Must return a dictionary
    """
    for key, value in rows.items():
        print(f"{key}:{value}") if key not in ['review_body', 'product_title'] else None
    return {"text": ["dummy"]}

In [84]:
chats: IterableDataset = train.map(
    function=map_data_to_chat_batched,
    batched=True,
    batch_size=5,
    remove_columns=columns
)

In [85]:
list(chats.take(1))

marketplace:['US', 'US', 'US', 'US', 'US']
customer_id:['16414143', '50800750', '15184378', '10203548', '488280']
review_id:['R3W4P9UBGNGH1U', 'R15V54KBMTQWAY', 'RY8I449HNXSVF', 'R18TLJYCKJFLSR', 'R1NK26SWS53B8Q']
product_id:['B00YL0EKWE', 'B00XK95RPQ', 'B00SXRXUKO', 'B009V5X1CE', 'B00D93OVF0']
product_parent:['852431543', '516894650', '984297154', '279912704', '662791300']
product_category:['Wireless', 'Wireless', 'Wireless', 'Wireless', 'Wireless']
star_rating:[2, 4, 5, 5, 5]
helpful_votes:[1, 0, 0, 0, 0]
total_votes:[3, 0, 0, 0, 0]
vine:[0, 0, 0, 0, 0]
verified_purchase:[1, 0, 1, 1, 1]
review_headline:['Looks good, functions meh', 'A fun little gadget', 'Five Stars', 'Great charger', 'Five Stars']
review_date:['2015-08-31', '2015-08-31', '2015-08-31', '2015-08-31', '2015-08-31']


[{'text': 'dummy'}]

# Conclusion

To apply ```template.apply```, need to use ```batched=False```.