# I. Project Team Members

| Prepared by | Email | Prepared for |
| :-: | :-: | :-: |
| **Hardefa Rogonondo** | hardefarogonondo@gmail.com | **Research Paper Summarization Engine** |

# II. Notebook Target Definition

This Jupyter notebook is the initial step in creating Research Paper Summarization Engine Project, focusing on data preparation. It imports the SciTLDR dataset from the Hugging Face datasets repository, conducts an initial exploration to understand its structure, including dataset shape, data types, and the contents of each data dictionary. Recognizing that the dataset is pre-divided into training, testing, and validation subsets, the notebook concludes by exporting these segments into .pkl format, paving the way for future phases of the project.

# III. Notebook Setup

## III.A. Import Libraries

In [1]:
from datasets import load_dataset
import pandas as pd
import pickle

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## III.B. Import Data

In [2]:
dataset = load_dataset('allenai/scitldr', trust_remote_code=True)
train_df = pd.DataFrame(dataset["train"])
test_df = pd.DataFrame(dataset["test"])
validation_df = pd.DataFrame(dataset["validation"])

In [3]:
train_df.head()

Unnamed: 0,source,source_labels,rouge_scores,paper_id,target
0,[Due to the success of deep learning to solvin...,"[0, 0, 0, 0, 1, 0]","[0.30188679695129395, 0.3720930218696594, 0.60...",SysEexbRb,[We provide necessary and sufficient analytica...
1,[The backpropagation (BP) algorithm is often t...,"[0, 0, 0, 1, 0, 0, 0, 0]","[0.0, 0.0, 0.1304347813129425, 0.1428571343421...",SygvZ209F7,"[Biologically plausible learning algorithms, p..."
2,"[We introduce the 2-simplicial Transformer, an...","[0, 1]","[0.3333333432674408, 0.8888888955116272]",rkecJ6VFvr,[We introduce the 2-simplicial Transformer and...
3,"[We present Tensor-Train RNN (TT-RNN), a novel...","[0, 0, 0, 1, 0, 0]","[0.06666666269302368, 0.06451612710952759, 0.0...",HJJ0w--0W,[Accurate forecasting over very long time hori...
4,[Recent efforts on combining deep models with ...,"[0, 1, 0, 0, 0, 0, 0]","[0.277777761220932, 0.5714285373687744, 0.0952...",HyH9lbZAW,[We propose a variational message-passing algo...


In [4]:
test_df.head()

Unnamed: 0,source,source_labels,rouge_scores,paper_id,target
0,[Incremental class learning involves sequentia...,"[0, 0, 0, 0, 1, 0, 0, 0, 0]","[0.2857142686843872, 0.1818181723356247, 0.227...",SJ1Xmf-Rb,"[FearNet is a memory efficient neural-network,..."
1,[Multi-view learning can provide self-supervis...,"[1, 0, 0, 0, 0, 0]","[0.20000000298023224, 0.0, 0.15789473056793213...",S1xzyhR9Y7,[Multi-view learning improves unsupervised sen...
2,[We show how discrete objects can be learnt in...,"[1, 0, 0, 0, 0]","[0.978723406791687, 0.3333333432674408, 0.4150...",HJDUjKeA-,[We show how discrete objects can be learnt in...
3,[Most recent gains in visual recognition have ...,"[0, 0, 1, 0, 0, 0]","[0.11764705181121826, 0.1463414579629898, 0.19...",BJgLg3R9KQ,[A large-scale dataset for training attention ...
4,"[In recent years, deep neural networks have de...","[0, 0, 1, 0, 0, 0, 0, 0]","[0.0, 0.05882352590560913, 0.2702702581882477,...",BklpOo09tQ,[We proposed a time-efficient defense method a...


In [5]:
validation_df.head()

Unnamed: 0,source,source_labels,rouge_scores,paper_id,target
0,[Mixed precision training (MPT) is becoming a ...,"[0, 0, 0, 1, 0, 0]","[0.23999999463558197, 0.260869562625885, 0.199...",rJlnfaNYvB,[We devise adaptive loss scaling to improve mi...
1,"[Many real-world problems, e.g. object detecti...","[0, 0, 1, 0, 0]","[0.05405404791235924, 0.2926829159259796, 0.97...",rJVoEiCqKQ,[We present a novel approach for learning to p...
2,[Foveation is an important part of human visio...,"[0, 0, 1, 0, 0]","[0.11764705181121826, 0.11764705181121826, 0.3...",rkldVXKU8H,[We compare object recognition performance on ...
3,[We explore the concept of co-design in the co...,"[0, 1, 0, 0, 0, 0]","[0.1249999925494194, 0.4888888895511627, 0.204...",BJfIVjAcKm,[We develop methods to train deep neural model...
4,[Batch Normalization (BatchNorm) has shown to ...,"[0, 0, 1, 0, 0, 0]","[0.19999998807907104, 0.23999999463558197, 0.4...",BJlEEaEFDS,[Investigation of how BatchNorm causes adversa...


# IV. Data Preparation

## IV.A. Data Shape Inspection

In [6]:
train_df.shape, test_df.shape, validation_df.shape

((1992, 5), (618, 5), (619, 5))

## IV.B. Data Information Inspection

In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1992 entries, 0 to 1991
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   source         1992 non-null   object
 1   source_labels  1992 non-null   object
 2   rouge_scores   1992 non-null   object
 3   paper_id       1992 non-null   object
 4   target         1992 non-null   object
dtypes: object(5)
memory usage: 77.9+ KB


In [8]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 618 entries, 0 to 617
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   source         618 non-null    object
 1   source_labels  618 non-null    object
 2   rouge_scores   618 non-null    object
 3   paper_id       618 non-null    object
 4   target         618 non-null    object
dtypes: object(5)
memory usage: 24.3+ KB


In [9]:
validation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 619 entries, 0 to 618
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   source         619 non-null    object
 1   source_labels  619 non-null    object
 2   rouge_scores   619 non-null    object
 3   paper_id       619 non-null    object
 4   target         619 non-null    object
dtypes: object(5)
memory usage: 24.3+ KB


## IV.C. Data Definition

| Variables | Columns Definition |
| :-: | :-: |
| source | The Abstract, Introduction and Conclusion (AIC) or Full text of the paper, with one sentence per line. |
| source_labels | Binary 0 or 1, 1 denotes the oracle sentence. |
| rouge_scores | Precomputed ROUGE baseline scores for each sentence. |
| paper_id | Arxiv Paper ID. |
| target | Multiple summaries for each sentence, one sentence per line. |

## IV.D. Export Data

In [10]:
train_df.to_pickle('../../data/processed/train_df.pkl')
test_df.to_pickle('../../data/processed/test_df.pkl')
validation_df.to_pickle('../../data/processed/validation_df.pkl')