# Part 0 - Data preparation

In this notebook we will download the arXiv dataset and save it to S3. We will also do some light data preprocessing by only keeping the columns we need, filtering out reviews that are too short, and limiting the size of the datasets.

To read more, please check out https://towardsdatascience.com/setting-up-a-text-summarisation-project-introduction-526622eea4a8.

First of all we want to make sure that the relevent libraries are installed on this machine:

In [None]:
!pip install "sagemaker>=2.48.0" "transformers==4.6.1" "datasets[s3]==1.6.2" --upgrade

In [None]:
!pip install rouge_score

We will download the dataset directly from the Kaggle website so we need to install the Kaggle Python package

In [None]:
!pip install kaggle

In [None]:
import os
os.environ['KAGGLE_USERNAME'] = "<your-kaggle-username>"
os.environ['KAGGLE_KEY'] = "<your-kaggle-api-key>"

In [None]:
import kaggle
kaggle.api.authenticate()

In [None]:
kaggle.api.dataset_download_files('Cornell-University/arxiv', path=".")

In [None]:
!unzip arxiv.zip

In [None]:
!mkdir data

In [None]:
!mv arxiv.zip raw_data/

In [None]:
!mv arxiv-metadata-oai-snapshot.json raw_data/

In [None]:
from datasets import load_dataset
dataset = load_dataset("arxiv_dataset", data_dir='./raw_data/', split='train', ignore_verifications=True)

The original dataset is too long, so we shuffle it and limit the number of records to 25,000.

In [None]:
dataset = dataset.shuffle(seed=42)
dataset = dataset.select(range(25000))
dataset

In [None]:
import pandas as pd
df = pd.DataFrame(dataset)

In [None]:
 # only keep columns that are required
df = df[['abstract', 'title']]
df = df.rename(columns={"abstract": "text", "title": "summary"})

In [None]:
df = df.replace(r'\n',' ', regex=True)

In [None]:
pd.options.display.max_colwidth = 100

In [None]:
df.head()

## Filtering the dataset

We want to discard reviews and titles that are too short, so that our model can produce more interesting summaries.

In [None]:
cutoff_summary = 5
cutoff_text = 20
df = df[(df['summary'].apply(lambda x: len(x.split()) >= cutoff_summary)) & (df['text'].apply(lambda x: len(x.split()) >= cutoff_text))]

In [None]:
len(df)

## Limiting the size of the datasets and splitting

We want to limit the size of the datasets so that training of the model can finish in a reasonable amount of time. This is a decision that we might want to revisit in the experimentation phase if we want to increase the performance of the model. We then split the dataset into test (80%), validation (10%), and test (10%)

In [None]:
df = df.sample(20000, random_state=43)

In [None]:
import numpy as np
# split the dataset into train, val, and test
df_train, df_val, df_test = np.split(df.sample(frac=1, random_state=44), [int(0.8*len(df)), int((0.9)*len(df))])

In [None]:
df_train.to_csv('data/train.csv', index=False)
df_val.to_csv('data/val.csv', index=False)
df_test.to_csv('data/test.csv', index=False)

## Save the data as CSV files and upload them to S3

We need to upload the data to S3 in order to train the model at a later point.

In [None]:
import sagemaker
sess = sagemaker.Session()
bucket = sess.default_bucket()

In [None]:
!aws s3 cp data/train.csv s3://$bucket/summarization/data/train.csv
!aws s3 cp data/val.csv s3://$bucket/summarization/data/val.csv
!aws s3 cp data/test.csv s3://$bucket/summarization/data/test.csv