# ROCStories to Preprocessing

This notebook, preprocess the ROCStories dataset and send to HuggingFace Hub as a Dataset.

We take a few changes.

1. Column names are renamed to match in each split, for example in `test` we have originally `InputSentenceX`, it was renamed do `sentenceX` to match the `train` split.

2. Columns missing in different splits were add to the other split. For example `train` split contains `storytitle` column. This column as added in `test` and `validation` splits, but it's content is empty. 

3. There was three columns in `test`/`validation` splits. `RandomFifthSentenceQuiz[1|2]` and `AnswerRightEnding`. The firs ones where renamed to `sentence5` if they are the correct answer (based on the `AnserRightEnding` and the other was renamed to `sentenceE` (Error) if its not the right answer.

4. The test set of 2018 dataset do not contains the column `AnswerRightEnding`, therefore, for sake of completeness we took the first random sentence as the correct one and the second as the wrong one.


Most of these changes were taken to avoid a limitation in DatasetDicts in HuggingFace, since they all need to have the same columns.

Regarding the ROCStories these are the original papers

"Tackling The Story Ending Biases in The Story Cloze Test". Rishi Sharma, James Allen, Omid Bakhshandeh, Nasrin Mostafazadeh. In Proceedings of the 2018 Conference of the Association for Computational Linguistics (ACL), 2018

"A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories". Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli and James Allen. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), 2016

In [1]:
import datasets
import pandas as pd
from pathlib import Path

In [2]:
rocstories_files = {
    "2016": {
        "train":      "ROCStories_spring2016 - ROCStories_spring2016.csv",
        "test":       "cloze_test_test__spring2016 - cloze_test_ALL_test.csv",
        "validation": "cloze_test_val__spring2016 - cloze_test_ALL_val.csv"
    },
    "2018": {
        "train":      "ROCStories_winter2017 - ROCStories_winter2017.csv",
        "test":       "cloze_test_test__winter2018-cloze_test_ALL_test.csv",
        "validation": "cloze_test_val__winter2018-cloze_test_ALL_val.csv"
    }
}

In [3]:
corpora_base_path = Path("../corpora/")
rocstories_base_path = corpora_base_path / "ROCStories"

In [4]:
dfs = {}
for year, year_splits  in rocstories_files.items():
    dfs[year] = {}
    for split, filename in year_splits.items():
        dfs[year][split] = pd.read_csv(rocstories_base_path / f"{filename}")

In [5]:
def add_answers(sample):
    right_ans = sample['AnswerRightEnding']
    if right_ans == 1:
        sample['sentence5'] = sample['RandomFifthSentenceQuiz1']
        sample['sentenceE'] = sample['RandomFifthSentenceQuiz2']
    else:
        sample['sentence5'] = sample['RandomFifthSentenceQuiz2']
        sample['sentenceE'] = sample['RandomFifthSentenceQuiz1']   

    return sample

In [6]:
dfs['2016']['test'] = dfs['2016']['test'].apply(add_answers, axis=1)
dfs['2016']['validation'] = dfs['2016']['validation'].apply(add_answers, axis=1)
dfs['2018']['validation'] = dfs['2018']['validation'].apply(add_answers, axis=1)
# By lack of a label in 2018 Test dataset we are assuming that answer 1
# is always correct and answer 2 is always wrong.
dfs['2018']['test']['sentence5'] = dfs['2018']['test']['RandomFifthSentenceQuiz1']
dfs['2018']['test']['sentenceE'] = dfs['2018']['test']['RandomFifthSentenceQuiz2']

In [7]:
# Rename columns to match in all splits
columns_to_rename = {
        'InputStoryid': 'storyid',
        'InputSentence1': 'sentence1',
        'InputSentence2': 'sentence2',
        'InputSentence3': 'sentence3',    
        'InputSentence4': 'sentence4',
    }

dfs['2016']['test'] =  dfs['2016']['test'].rename(columns=columns_to_rename)
dfs['2016']['validation'] = dfs['2016']['validation'].rename(columns=columns_to_rename)

dfs['2018']['test'] =  dfs['2018']['test'].rename(columns=columns_to_rename)
dfs['2018']['validation'] = dfs['2018']['validation'].rename(columns=columns_to_rename)

In [8]:
# Removing unecessary columns
columns_to_remove = ['RandomFifthSentenceQuiz1', 'RandomFifthSentenceQuiz2', 'AnswerRightEnding']

dfs['2016']['test'] =  dfs['2016']['test'].drop(columns=columns_to_remove)
dfs['2016']['validation'] = dfs['2016']['validation'].drop(columns=columns_to_remove)

dfs['2018']['test'] =  dfs['2018']['test'].drop(columns=columns_to_remove, errors='ignore')
dfs['2018']['validation'] = dfs['2018']['validation'].drop(columns=columns_to_remove)

In [9]:
# Adding empty columns
dfs['2016']['train']['sentenceE'] = ""
dfs['2016']['test']['storytitle'] = ""
dfs['2016']['validation']['storytitle'] = ""

dfs['2018']['train']['sentenceE'] = ""
dfs['2018']['test']['storytitle'] = ""
dfs['2018']['validation']['storytitle'] = ""

In [10]:
# Reordering columns
columns =  ['storyid', 'storytitle', 'sentence1', 'sentence2', 'sentence3', 'sentence4', 'sentence5', 'sentenceE']

dfs['2016']['train'] = dfs['2016']['train'][columns]
dfs['2016']['test'] = dfs['2016']['test'][columns]
dfs['2016']['validation'] = dfs['2016']['validation'][columns]

dfs['2018']['train'] = dfs['2018']['train'][columns]
dfs['2018']['test'] = dfs['2018']['test'][columns]
dfs['2018']['validation'] = dfs['2018']['validation'][columns]


In [11]:
# Build dataset
rocstories_2016 = datasets.DatasetDict({
    'train': datasets.Dataset.from_pandas(dfs['2016']['train']),
    'test': datasets.Dataset.from_pandas(dfs['2016']['test']),
    'validation': datasets.Dataset.from_pandas(dfs['2016']['validation'])
})

In [12]:
rocstories_2018 = datasets.DatasetDict({
    'train': datasets.Dataset.from_pandas(dfs['2018']['train']),
    'test': datasets.Dataset.from_pandas(dfs['2018']['test']),
    'validation': datasets.Dataset.from_pandas(dfs['2018']['validation'])
})

In [13]:
dataset2016 = "igormorgado/ROCStories2016"
rocstories_2016.push_to_hub(dataset2016)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/46 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/igormorgado/ROCStories2016/commit/a2deb1f67410a67faccac1b8ab353f3be65adc99', commit_message='Upload dataset', commit_description='', oid='a2deb1f67410a67faccac1b8ab353f3be65adc99', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/igormorgado/ROCStories2016', endpoint='https://huggingface.co', repo_type='dataset', repo_id='igormorgado/ROCStories2016'), pr_revision=None, pr_num=None)

In [15]:
dataset2018 = "igormorgado/ROCStories2018"
rocstories_2018.push_to_hub(dataset2018)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/53 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/igormorgado/ROCStories2018/commit/cf9d25d619345d75051b9430adae7818af65c587', commit_message='Upload dataset', commit_description='', oid='cf9d25d619345d75051b9430adae7818af65c587', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/igormorgado/ROCStories2018', endpoint='https://huggingface.co', repo_type='dataset', repo_id='igormorgado/ROCStories2018'), pr_revision=None, pr_num=None)

In [18]:
README_2016 = """
# ROCStories 2016 Dataset

This is the HuggingFace version of ROCStories Dataset.

We take a few changes.

1. Column names are renamed to match in each split, for example in `test` we have originally `InputSentenceX`, that was renamed do `sentenceX` to match the `train` split.

2. Columns missing in different splits were add to the other split with empty values. For example `train` split contains `storytitle` column. This column as added in `test` and `validation` splits. 

3. There was three columns in `test`/`validation` splits. `RandomFifthSentenceQuiz[1|2]` and `AnswerRightEnding`. The first ones where renamed to `sentence5` if they are the correct answer (based on the `AnserRightEnding`) and the other was renamed to `sentenceE` (Error) if its not the right answer.

Most of these changes were taken to avoid a limitation in DatasetDicts in HuggingFace, since they all need to have the same columns.

Original Paper

"A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories". Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli and James Allen. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), 2016
"""

In [17]:
README_2018 = """
# ROCStories 2018 Dataset

This is the HuggingFace version of ROCStories Dataset.

We take a few changes.

1. Column names are renamed to match in each split, for example in `test` we have originally `InputSentenceX`, that was renamed do `sentenceX` to match the `train` split.

2. Columns missing in different splits were add to the other split with empty values. For example `train` split contains `storytitle` column. This column as added in `test` and `validation` splits. 

3. There was three columns in `test`/`validation` splits. `RandomFifthSentenceQuiz[1|2]` and `AnswerRightEnding`. The first ones where renamed to `sentence5` if they are the correct answer (based on the `AnserRightEnding`) and the other was renamed to `sentenceE` (Error) if its not the right answer.

WARNING: The test set of 2018 dataset do not contains the column `AnswerRightEnding`, therefore, for sake of completeness we took `RandomFifthSentenceQuiz1` as the correct one and the `RandomFifthSentenceQuiz1` as the wrong one. If you know where to find the correct answers, please let me know and I will fix the dataset.

Most of these changes were taken to avoid a limitation in DatasetDicts in HuggingFace, since they all need to have the same columns.

Regarding the ROCStories these are the original papers

"Tackling The Story Ending Biases in The Story Cloze Test". Rishi Sharma, James Allen, Omid Bakhshandeh, Nasrin Mostafazadeh. In Proceedings of the 2018 Conference of the Association for Computational Linguistics (ACL), 2018
"""

In [23]:
from huggingface_hub import HfApi, login
import io
api = HfApi()

In [24]:
readme_bytes = README_2016.encode('utf-8')

# Upload the README content
api.upload_file(
    path_or_fileobj=io.BytesIO(readme_bytes),
    path_in_repo="README.md",
    repo_id=dataset2016,
    repo_type="dataset",
)

print(f"README.md has been uploaded to {dataset2016}")

- empty or missing yaml metadata in repo card


README.md has been uploaded to igormorgado/ROCStories2016


In [25]:
readme_bytes = README_2018.encode('utf-8')

# Upload the README content
api.upload_file(
    path_or_fileobj=io.BytesIO(readme_bytes),
    path_in_repo="README.md",
    repo_id=dataset2018,
    repo_type="dataset",
)

print(f"README.md has been uploaded to {dataset2018}")

README.md has been uploaded to igormorgado/ROCStories2018
