# Part 0 - Data preparation

In this notebook we will download the arXiv dataset and save it to S3. We will also do some light data preprocessing by only keeping the columns we need, filtering out reviews that are too short, and limiting the size of the datasets.

To read more, please check out https://towardsdatascience.com/setting-up-a-text-summarisation-project-introduction-526622eea4a8.

First of all we want to make sure that the relevent libraries are installed on this machine:

In [1]:
!pip install "sagemaker>=2.48.0" "transformers==4.6.1" "datasets[s3]==1.6.2" --upgrade

Collecting sagemaker>=2.48.0
  Downloading sagemaker-2.81.1.tar.gz (519 kB)
[K     |████████████████████████████████| 519 kB 1.6 MB/s eta 0:00:01
[?25hCollecting transformers==4.6.1
  Downloading transformers-4.6.1-py3-none-any.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 2.0 MB/s eta 0:00:01
[?25hCollecting datasets[s3]==1.6.2
  Downloading datasets-1.6.2-py3-none-any.whl (221 kB)
[K     |████████████████████████████████| 221 kB 1.6 MB/s eta 0:00:01
[?25hCollecting tqdm>=4.27
  Downloading tqdm-4.63.1-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 1.8 MB/s eta 0:00:01
[?25hCollecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp39-cp39-macosx_10_11_x86_64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 1.6 MB/s eta 0:00:01
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)


In [2]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Collecting nltk
  Using cached nltk-3.7-py3-none-any.whl (1.5 MB)
Collecting absl-py
  Downloading absl_py-1.0.0-py3-none-any.whl (126 kB)
[K     |████████████████████████████████| 126 kB 1.0 MB/s eta 0:00:01
Installing collected packages: nltk, absl-py, rouge-score
Successfully installed absl-py-1.0.0 nltk-3.7 rouge-score-0.0.4


We will download the dataset directly from the Kaggle website so we need to install the Kaggle Python package

In [3]:
!pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[K     |████████████████████████████████| 58 kB 1.8 MB/s eta 0:00:01
Collecting python-slugify
  Downloading python_slugify-6.1.1-py2.py3-none-any.whl (9.1 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 1.8 MB/s eta 0:00:01
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73051 sha256=d8b3341c62e5979a445cd5b85752249f41737d679d52321875e840e72ff15b40
  Stored in directory: /Users/robbdunlap/Library/Caches/pip/wheels/ac/b2/c3/fa4706d469b5879105991d1c8be9a3c2ef329ba9fe2ce5085e
Successfully built kaggle
Installing collected packages: text-unidecode, python-slugify, kaggle
Successfully installed kaggle-1.5.12 python-slugify-6.1.1 text-unidecode-1.3


In [None]:
# https://github.com/Kaggle/kaggle-api
# follow the instruction in the above link to export Kaggle username and key to local environment so you don't 
# have to put it in the notebook (so it won't be exposed on GitHub)

# only run the below if you aren't able to export these to your local environment through your shell
# import os
# os.environ['KAGGLE_USERNAME'] = "<your-kaggle-username>"
# os.environ['KAGGLE_KEY'] = "<your-kaggle-api-key>"

In [4]:
import kaggle
kaggle.api.authenticate()

In [5]:
kaggle.api.dataset_download_files('Cornell-University/arxiv', path=".")

In [6]:
!unzip arxiv.zip

Archive:  arxiv.zip
  inflating: arxiv-metadata-oai-snapshot.json  


In [7]:
!mkdir data

mkdir: data: File exists


In [3]:
!mkdir raw_data
!mv arxiv.zip raw_data/

In [4]:
!mv arxiv-metadata-oai-snapshot.json raw_data/

In [5]:
from datasets import load_dataset
dataset = load_dataset("arxiv_dataset", data_dir='./raw_data/', split='train', ignore_verifications=True)

Using custom data configuration default-data_dir=.%2Fraw_data%2F


Downloading and preparing dataset arxiv_dataset/default (download: Unknown size, generated: 2.09 GiB, post-processed: Unknown size, total: 2.09 GiB) to /Users/robbdunlap/.cache/huggingface/datasets/arxiv_dataset/default-data_dir=.%2Fraw_data%2F/1.1.0/242eb95c95350194872f5be3fb00e7938e53b0944442e85f45a5d2240328f370...


HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset arxiv_dataset downloaded and prepared to /Users/robbdunlap/.cache/huggingface/datasets/arxiv_dataset/default-data_dir=.%2Fraw_data%2F/1.1.0/242eb95c95350194872f5be3fb00e7938e53b0944442e85f45a5d2240328f370. Subsequent calls will reuse this data.


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


The original dataset is too long, so we shuffle it and limit the number of records to 25,000.

In [6]:
dataset = dataset.shuffle(seed=42)
dataset = dataset.select(range(25000))
dataset

Dataset({
    features: ['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'abstract', 'update_date'],
    num_rows: 25000
})

In [7]:
import pandas as pd
df = pd.DataFrame(dataset)

In [8]:
 # only keep columns that are required
df = df[['abstract', 'title']]
df = df.rename(columns={"abstract": "text", "title": "summary"})

In [9]:
df = df.replace(r'\n',' ', regex=True)

In [10]:
pd.options.display.max_colwidth = 100

In [11]:
df.head()

Unnamed: 0,text,summary
0,"The rest-frame UV spectra of three recent tidal disruption events (TDEs), ASASSN-14li, PTF15af...",Carbon and Nitrogen Abundance Ratio In the Broad Line Region of Tidal Disruption Events
1,Inspired by the success of transformer-based pre-training methods on natural language tasks an...,Survey: Transformer based Video-Language Pre-training
2,Pandharipande-Pixton have used the geometry of the moduli space of stable quotients to produce...,Tautological relations in moduli spaces of weighted pointed curves
3,Suppose X is a projective toric scheme defined over a commutative ring R equipped with an ampl...,A splitting result for the algebraic K-theory of projective toric schemes
4,We introduce an approach that accurately reconstructs 3D human poses and detailed 3D full-body...,Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies from Single RGB Images


## Filtering the dataset

We want to discard reviews and titles that are too short, so that our model can produce more interesting summaries.

In [None]:
cutoff_summary = 5
cutoff_text = 20
df = df[(df['summary'].apply(lambda x: len(x.split()) >= cutoff_summary)) & (df['text'].apply(lambda x: len(x.split()) >= cutoff_text))]

In [None]:
len(df)

## Limiting the size of the datasets and splitting

We want to limit the size of the datasets so that training of the model can finish in a reasonable amount of time. This is a decision that we might want to revisit in the experimentation phase if we want to increase the performance of the model. We then split the dataset into test (80%), validation (10%), and test (10%)

In [None]:
df = df.sample(20000, random_state=43)

In [None]:
import numpy as np
# split the dataset into train, val, and test
df_train, df_val, df_test = np.split(df.sample(frac=1, random_state=44), [int(0.8*len(df)), int((0.9)*len(df))])

In [None]:
df_train.to_csv('data/train.csv', index=False)
df_val.to_csv('data/val.csv', index=False)
df_test.to_csv('data/test.csv', index=False)

## Save the data as CSV files and upload them to S3

We need to upload the data to S3 in order to train the model at a later point.

In [None]:
import sagemaker
sess = sagemaker.Session()
bucket = sess.default_bucket()

In [None]:
!aws s3 cp data/train.csv s3://$bucket/summarization/data/train.csv
!aws s3 cp data/val.csv s3://$bucket/summarization/data/val.csv
!aws s3 cp data/test.csv s3://$bucket/summarization/data/test.csv