# Part 0 - Data preparation

In this notebook we will download the Amazon Review dataset and save it to S3. We will also do some light data preprocessing by only keeping the columns we need, filtering out reviews that are too short, and limiting the size of the datasets.

To read more, please check out https://towardsdatascience.com/setting-up-a-text-summarisation-project-introduction-526622eea4a8.

## Data download

We download the dataset from https://huggingface.co/datasets/amazon_reviews_multi and save it to a Pandas dataframe.

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset
train_ds = load_dataset("amazon_reviews_multi", "en", split='train')
val_ds = load_dataset("amazon_reviews_multi", "en", split='validation')
test_ds = load_dataset("amazon_reviews_multi", "en", split='test')

In [None]:
import pandas as pd
df_train = pd.DataFrame(train_ds)
df_val = pd.DataFrame(val_ds)
df_test = pd.DataFrame(test_ds)

In [None]:
df_train.head()

## Filtering the dataset

We want to discard reviews and titles that are too short, so that our model can produce more interesting summaries.

In [None]:
cutoff_title = 5
cutoff_body = 20

In [None]:
df_train = df_train[(df_train['review_title'].apply(lambda x: len(x.split()) >= cutoff_title)) & (df_train['review_body'].apply(lambda x: len(x.split()) >= cutoff_body))]
df_val = df_val[(df_val['review_title'].apply(lambda x: len(x.split()) >= cutoff_title)) & (df_val['review_body'].apply(lambda x: len(x.split()) >= cutoff_body))]
df_test = df_test[(df_test['review_title'].apply(lambda x: len(x.split()) >= cutoff_title)) & (df_test['review_body'].apply(lambda x: len(x.split()) >= cutoff_body))]

## Limiting the size of the datasets

We want to limit the size of the datasets so that training of the model can finish in a reasonable amount of time. This is a decision that we might want to revisit in the experimentation phase if we want to increase the performance of the model.

In [None]:
print(len(df_train), len(df_val), len(df_test))

In [None]:
df_train = df_train.sample(20000, random_state=42)
df_val = df_val.sample(1000, random_state=42)
df_test = df_test.sample(1000, random_state=42)

## Save the data as CSV files and upload them to S3

We need to upload the data to S3 in order to train the model at a later point.

In [None]:
df_train.to_csv('data/train.csv', index=False, columns=['review_body', 'review_title'])
df_val.to_csv('data/val.csv', index=False, columns=['review_body', 'review_title'])
df_test.to_csv('data/test.csv', index=False, columns=['review_body', 'review_title'])

In [None]:
import sagemaker
sess = sagemaker.Session()
bucket = sess.default_bucket()

In [None]:
!aws s3 cp data/train.csv s3://$bucket/summarization/data/train.csv
!aws s3 cp data/val.csv s3://$bucket/summarization/data/val.csv
!aws s3 cp data/test.csv s3://$bucket/summarization/data/test.csv