In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate

# Summarization

*Text summarization* is used to condense long documents into summaries.

In this section, we will train a bilingual model for English and Spanish.

## Preparing a multilingual corpus

We will use the *Multilingual Amazon Reviews Corpus* to create our bilingual summarizer. This corpus consists of Amazon product reviews in six languages.

In [None]:
from datasets import load_dataset

spanish_dataset = load_dataset('amazon_reviews_multi', 'es')
english_dataset = load_dataset('amazon_reviews_multi', 'en')

In [None]:
spanish_dataset

In [None]:
english_dataset

In [9]:
def show_samples(dataset, num_samples=3, seed=101):
    sample = dataset['train'].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n>> Title: {example['review_title']}")
        print(f">> Review: {example['review_body']}")

In [None]:
show_samples(spanish_dataset)

In [None]:
show_samples(english_dataset)

To get a feel for what domains we can choose from, we need to convert `english_dataset` to a `pandas.DataFrame` and compute the number of reviews per product category:

In [None]:
english_dataset.set_format(type='pandas')
english_df = english_dataset['train'][:]

# show counts for top 20 products
english_df['product_category'].value_counts()[:20]

In [None]:
def filter_books(example):
    return (
        example['product_category'] == 'book'
        or example['product_category'] == 'digital_ebook_purchase'
    )

Before applying the filter, we need to switch the format from `pandas` back to `arrow`:

In [None]:
english_dataset.reset_format()

In [None]:
spanish_books = spanish_dataset.filter(filter_books)
english_books = english_dataset.filter(filter_books)

show_samples(english_books)