# Instructions
* Run the code starting from 'Dataset Creation' up to Synthetic Data Generation. You only need to do this once
* Then, the 'Synthetic Data Generation' section will contain everything you need for generating synthetic data from the dataset.

# Dataset Creation
1) Download the CNN/Dailymail dataset into the folder `datasets/`. The folder should be named `cnn_dailymail` already, and the train `.csv` should be in `datasets/cnn_dailymail/train.csv` (or change the directory below as needed)
2) create the directory `datasets/cnn_parsed`
3) Run this section of the notebook. This should remove all the dailymail and duplicate articles and save the new train test val split into the folder above.

In [21]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

dataset_dir_in = 'datasets/cnn_dailymail/'
dataset_dir_out = 'datasets/cnn_parsed/'

In [22]:
# load train dataset
def load_and_parse(dataset = 'train'):
    df = pd.read_csv(dataset_dir_in + dataset + '.csv')
    df = df[df.article.str.contains('CNN')]
    df = df.drop_duplicates('article')
    return df

df_train = load_and_parse('train')
df_test = load_and_parse('test')
df_val = load_and_parse('validation')

In [65]:
df = pd.concat([df_train, df_val])
df_train, df_val = train_test_split(df, test_size=0.3)


In [70]:
num_articles_per_split = 128

def create_splits(df, num_articles_per_split):
    df = df.sort_values('id')
    splits = np.array(range(df.shape[0]))
    splits = splits // num_articles_per_split
    df['split'] = splits
    return df

df_train, df_val = create_splits(df_train, num_articles_per_split), create_splits(df_val, num_articles_per_split)

In [71]:
df_train.to_csv(dataset_dir_out + 'train.csv')
df_val.to_csv(dataset_dir_out + 'val.csv')
df_test.to_csv(dataset_dir_out + 'test.csv')

# Synthetic Data Generation

* Below, make sure the directories match with the ones you created
* Then specify 'splits' to choose which splits you want to generate data for. Kerem: 0-150, Lillian: 151-300, Emma: 301-450

In [1]:
import data
import pandas as pd
import matplotlib.pyplot as plt
# test the summary generator

df_train = pd.read_csv('datasets/cnn_parsed/train.csv')

task = 'summary' # summary or qna
synthetic_data_dir = f'datasets/synthetic/{task}/'

if task == 'summary':
    generator = data.SummaryGenerator()
else:
    generator = data.QnAGenerator()

total_splits = df_train.split.max() + 1

total_splits

451

In [2]:
import time

splits = [5,6,7,8,9]
data.process_splits(df_train, generator, splits, synthetic_data_dir = synthetic_data_dir, mode = 'train')

----- Parsing split 5 -----
----- Number of articles to parse: 128 -----
----- ELAPSED TIME -----
208.9 seconds
----- Parsing split 6 -----
----- Number of articles to parse: 128 -----
----- ELAPSED TIME -----
227.9 seconds
----- Parsing split 7 -----
----- Number of articles to parse: 128 -----
----- ELAPSED TIME -----
201.8 seconds
----- Parsing split 8 -----
----- Number of articles to parse: 128 -----
----- ELAPSED TIME -----
198.6 seconds
----- Parsing split 9 -----
----- Number of articles to parse: 128 -----
----- ELAPSED TIME -----
222.0 seconds
