## important: please only run this ONE TIME so we can be sure that we will be running all our models from the same dataset.

Import detdox dataset, examine and create our dataset with train, valid and test sets to run on the folllowing modals:

* BART-large pretrain
* BART-large-xsum pretrain
* BART-large-cnn pretrain
* BART-large with fine tuning
* BART-large-xsum with fine tuning
* BART-large-cnn with fine tuning
* T5 models

In [1]:
# Install these packages if running from colab
!pip install tensorflow-datasets --quiet
!pip install pydot --quiet
!pip install transformers --quiet

# install huggingface datasets
!pip install datasets --quiet

! pip install rouge-score nltk --quiet
! pip install huggingface_hub --quiet

[K     |████████████████████████████████| 5.5 MB 7.9 MB/s 
[K     |████████████████████████████████| 7.6 MB 44.1 MB/s 
[K     |████████████████████████████████| 163 kB 48.7 MB/s 
[K     |████████████████████████████████| 441 kB 7.2 MB/s 
[K     |████████████████████████████████| 95 kB 5.1 MB/s 
[K     |████████████████████████████████| 212 kB 69.7 MB/s 
[K     |████████████████████████████████| 115 kB 56.8 MB/s 
[K     |████████████████████████████████| 127 kB 42.6 MB/s 
[K     |████████████████████████████████| 115 kB 55.1 MB/s 
[?25h  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [2]:
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.layers import Embedding, Input, Dense, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
import tensorflow_datasets as tfds

import sklearn as sk
import os
import nltk
from nltk.data import find

import matplotlib.pyplot as plt

import re

#let's make longer output readable without scrolling
from pprint import pprint

# the toxic parallel dataset, with rouge metric
from datasets import load_dataset, load_from_disk, load_metric, DatasetDict

In [3]:
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# paths
dataset_path = 'drive/MyDrive/Colab Notebooks/w266_project_data'
csv_path = 'drive/MyDrive/Colab Notebooks/w266_project_predictions/'

## load paradetox dataset from huggingface

In [5]:
# import paradetox dataset from huggingface
# the toxic parallel dataset, with rouge metric
from datasets import load_dataset, load_metric, DatasetDict

dataset = load_dataset("SkolkovoInstitute/paradetox", split="train")
metric = load_metric("rouge")

Downloading readme:   0%|          | 0.00/5.15k [00:00<?, ?B/s]



Downloading and preparing dataset csv/SkolkovoInstitute--paradetox to /root/.cache/huggingface/datasets/SkolkovoInstitute___csv/SkolkovoInstitute--paradetox-2d7856e905be458c/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.04M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/SkolkovoInstitute___csv/SkolkovoInstitute--paradetox-2d7856e905be458c/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  


Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

### shuffle paradetox dataset and build our dataset into train, valid and test sets

In [6]:
# 90% train, 10% test + validation
train_test_dataset = dataset.train_test_split(test_size=0.1, shuffle=True)

# Split the 10% test set into half test, half valid
test_valid_split = train_test_dataset['test'].train_test_split(test_size=0.5)

# gather them into a single DatasetDict
dataset = DatasetDict({
    'train': train_test_dataset['train'],
    'test': test_valid_split['test'],
    'valid': test_valid_split['train']})

### Examine the dataset

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['en_toxic_comment', 'en_neutral_comment'],
        num_rows: 17789
    })
    test: Dataset({
        features: ['en_toxic_comment', 'en_neutral_comment'],
        num_rows: 989
    })
    valid: Dataset({
        features: ['en_toxic_comment', 'en_neutral_comment'],
        num_rows: 988
    })
})

The dataset contains two columns

- en_toxic_comment = input: toxic comments
- en_neutral_comment = label: manually translated neutral comments

There are no features in this dataset

### examine and test ROUGE metric

In [8]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

In [9]:
# test ROUGE metric with the same prediction (input) and reference (label)
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

In [10]:
dataset['train'][0]

{'en_toxic_comment': "u 'd be surprised all the shit u think about when u jus sittin there",
 'en_neutral_comment': 'You would be supriesd all the things you think about when you are just sitting there.'}

In [11]:
# try ROUGE on one of the actual detox dataset example
try_preds = [dataset['train']['en_toxic_comment'][0]]
try_labels = [dataset['train']['en_neutral_comment'][0]]
metric.compute(predictions=try_preds, references=try_labels)

{'rouge1': AggregateScore(low=Score(precision=0.4666666666666667, recall=0.4375, fmeasure=0.45161290322580644), mid=Score(precision=0.4666666666666667, recall=0.4375, fmeasure=0.45161290322580644), high=Score(precision=0.4666666666666667, recall=0.4375, fmeasure=0.45161290322580644)),
 'rouge2': AggregateScore(low=Score(precision=0.21428571428571427, recall=0.2, fmeasure=0.20689655172413796), mid=Score(precision=0.21428571428571427, recall=0.2, fmeasure=0.20689655172413796), high=Score(precision=0.21428571428571427, recall=0.2, fmeasure=0.20689655172413796)),
 'rougeL': AggregateScore(low=Score(precision=0.4666666666666667, recall=0.4375, fmeasure=0.45161290322580644), mid=Score(precision=0.4666666666666667, recall=0.4375, fmeasure=0.45161290322580644), high=Score(precision=0.4666666666666667, recall=0.4375, fmeasure=0.45161290322580644)),
 'rougeLsum': AggregateScore(low=Score(precision=0.4666666666666667, recall=0.4375, fmeasure=0.45161290322580644), mid=Score(precision=0.46666666666

In [12]:
dataset.shape

{'train': (17789, 2), 'test': (989, 2), 'valid': (988, 2)}

### Save dataset to oue local storage

to load the dataset
* dataset = load_from_disk("drive/MyDrive/Colab Notebooks/w266_project_data")

In [13]:
# save the dataset to the disk
dataset.save_to_disk("drive/MyDrive/Colab Notebooks/w266_project_data")

# to load the dataset
# dataset = load_from_disk("drive/MyDrive/Colab Notebooks/w266_project_data")

Flattening the indices:   0%|          | 0/18 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]