# An In-depth Evaluation of Approaches to Text Classification (IDEATC)

## I. Data Preparation

_This notebook is used to prepare data for the project. This includes retrieving and preprocessing datasets – mostly from [Hugging Face](https://huggingface.co/datasets) – for sentiment analysis, news categorisation and topic classification._

### Libraries

In [1]:
# standard library
import os
import sys; sys.path.append(os.path.join(os.pardir, 'src'))
from pathlib import Path
from pprint import pprint
from collections import Counter

# data wrangling
import pandas as pd
from datasets import load_dataset, Dataset, DatasetDict
from sklearn.datasets import fetch_20newsgroups

# nlp
import en_core_web_sm

# local packages
import src

# other settings
nlp = en_core_web_sm.load(disable=['ner'])
SAVE_PATH_DATASET = Path(os.pardir, 'data', 'processed')
SAVE_PATH_PHRASES = Path(os.pardir, 'models', 'gensim')

## I. Yelp Polarity

**Source:** [Hugging Face Datasets](https://huggingface.co/datasets/yelp_polarity)
**Task:** Sentiment Analysis
**Training Size:** 560,000
**Testing Size:** 1,000
**Target:** Binary

In [2]:
dataset_name = 'yelp_polarity'
dataset = load_dataset(dataset_name)
dataset

Found cached dataset yelp_polarity (/Users/mykolaskrynnyk/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/14f90415c754f47cf9087eadac25823a395fef4400c7903c5897f55cfaaa6f61)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 560000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 38000
    })
})

In [3]:
dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['1', '2'], id=None)}

In [4]:
dataset = src.preparation.preprocess_dataset(
    dataset_dict=dataset,
    class_names=['Negative', 'Positive'],
    undersample_test=True,
    nlp=nlp,
    phrases_save_path=SAVE_PATH_PHRASES.joinpath(f'phrases_{dataset_name}.pkl'),
    verbose=True
)
dataset

Loading cached split indices for dataset at /Users/mykolaskrynnyk/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/14f90415c754f47cf9087eadac25823a395fef4400c7903c5897f55cfaaa6f61/cache-eeb4110384fb2880.arrow and /Users/mykolaskrynnyk/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/14f90415c754f47cf9087eadac25823a395fef4400c7903c5897f55cfaaa6f61/cache-3011dfe065f18ce8.arrow


{'label': ClassLabel(names=['Negative', 'Positive'], id=None),
 'text': Value(dtype='string', id=None)}


Map:   0%|          | 0/560000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/560000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

17500it [00:41, 421.86it/s]


Phrase count: 1,260


Map:   0%|          | 0/560000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Class count: 2
Label   	Train	Test	Support (Train)
Negative	50.00%	50.00%	280,000
Positive	50.00%	50.00%	280,000


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 560000
    })
    test: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 1000
    })
})

In [5]:
pprint(dataset['train'][3])

{'label': 0,
 'text': "I'm writing this review to give you a heads up before you see this "
         'Doctor. The office staff and administration are very unprofessional. '
         'I left a message with multiple people regarding my bill, and no one '
         'ever called me back. I had to hound them to get an answer about my '
         'bill. \\n\\nSecond, and most important, make sure your insurance is '
         "going to cover Dr. Goldberg's visits and blood work. He recommended "
         'to me that I get a physical, and he knew I was a student because I '
         'told him. I got the physical done. Later, I found out my health '
         "insurance doesn't pay for preventative visits. I received an $800.00 "
         "bill for the blood work. I can't pay for my bill because I'm a "
         "student and don't have any cash flow at this current time. I can't "
         "believe the Doctor wouldn't give me a heads up to make sure my "
         "insurance would cover work that w

In [6]:
path = SAVE_PATH_DATASET.joinpath(f'{dataset_name}_processed')
dataset.save_to_disk(path)

Saving the dataset (0/2 shards):   0%|          | 0/560000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

## II. Yelp Review (Full)

**Source:** [Hugging Face Datasets](https://huggingface.co/datasets/yelp_review_full)
**Task:** Sentiment Analysis
**Training Size:** 650,000
**Testing Size:** 2,500
**Target:** Multiclass (5 classes)

In [7]:
dataset_name = 'yelp_review_full'
dataset = load_dataset(dataset_name)
dataset

Found cached dataset yelp_review_full (/Users/mykolaskrynnyk/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [8]:
dataset['train'].features

{'label': ClassLabel(names=['1 star', '2 star', '3 stars', '4 stars', '5 stars'], id=None),
 'text': Value(dtype='string', id=None)}

In [9]:
dataset = src.preparation.preprocess_dataset(
    dataset_dict=dataset,
    class_names=['Very Negative', 'Negative', 'Neutral', 'Positive', 'Very Positive'],
    undersample_test=True,
    nlp=nlp,
    phrases_save_path=SAVE_PATH_PHRASES.joinpath(f'phrases_{dataset_name}.pkl'),
    verbose=True
)
dataset

Loading cached split indices for dataset at /Users/mykolaskrynnyk/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-1bc38269e6ab6e16.arrow and /Users/mykolaskrynnyk/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-a7e60c3d64bc800a.arrow


{'label': ClassLabel(names=['Very Negative', 'Negative', 'Neutral', 'Positive', 'Very Positive'], id=None),
 'text': Value(dtype='string', id=None)}


Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/650000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/2500 [00:00<?, ? examples/s]

20313it [00:47, 425.76it/s]


Phrase count: 1,245


Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Class count: 5
Label        	Train	Test	Support (Train)
Very Positive	20.00%	20.00%	130,000
Negative     	20.00%	20.00%	130,000
Positive     	20.00%	20.00%	130,000
Very Negative	20.00%	20.00%	130,000
Neutral      	20.00%	20.00%	130,000


DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'text_clean'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text', 'text_clean'],
        num_rows: 2500
    })
})

In [10]:
pprint(dataset['train'][0])

{'label': 4,
 'text': 'dr. goldberg offers everything i look for in a general '
         "practitioner.  he's nice and easy to talk to without being "
         "patronizing; he's always on time in seeing his patients; he's "
         'affiliated with a top-notch hospital (nyu) which my parents have '
         'explained to me is very important in case something happens and you '
         'need surgery; and you can get referrals to see specialists without '
         "having to see him first.  really, what more do you need?  i'm "
         'sitting here trying to think of any complaints i have about him, but '
         "i'm really drawing a blank.",
 'text_clean': 'offer look general practitioner nice and easy talk patronize '
               'time see patient affiliate top_notch hospital parent explain '
               'important case happen and need surgery and get referral see '
               'specialist have see more need sit try think complaint have but '
               'draw blank'

In [11]:
path = SAVE_PATH_DATASET.joinpath(f'{dataset_name}_processed')
dataset.save_to_disk(path)

Saving the dataset (0/2 shards):   0%|          | 0/650000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2500 [00:00<?, ? examples/s]

## III. IMDb Reviews

**Source:** [Hugging Face Datasets](https://huggingface.co/datasets/imdb)
**Task:** Sentiment Analysis
**Training Size:** 25,000
**Testing Size:** 1,000
**Target:** Binary

In [12]:
dataset_name = 'imdb'
dataset = load_dataset(dataset_name)
dataset

Found cached dataset imdb (/Users/mykolaskrynnyk/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [13]:
dataset.pop('unsupervised');

In [14]:
dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

In [15]:
dataset = src.preparation.preprocess_dataset(
    dataset_dict=dataset,
    class_names=['Negative', 'Positive'],
    undersample_test=True,
    nlp=nlp,
    phrases_save_path=SAVE_PATH_PHRASES.joinpath(f'phrases_{dataset_name}.pkl'),
    verbose=True
)
dataset

Loading cached split indices for dataset at /Users/mykolaskrynnyk/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-7883781f4ee3ff88.arrow and /Users/mykolaskrynnyk/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-ac48d6be58849160.arrow


{'label': ClassLabel(names=['Negative', 'Positive'], id=None),
 'text': Value(dtype='string', id=None)}


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

782it [00:03, 250.31it/s]


Phrase count: 1,106


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Class count: 2
Label   	Train	Test	Support (Train)
Negative	50.00%	50.00%	12,500
Positive	50.00%	50.00%	12,500


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 1000
    })
})

In [16]:
pprint(dataset['train'][0])

{'label': 0,
 'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the '
         'controversy that surrounded it when it was first released in 1967. I '
         'also heard that at first it was seized by U.S. customs if it ever '
         'tried to enter this country, therefore being a fan of films '
         'considered "controversial" I really had to see this for myself.<br '
         '/><br />The plot is centered around a young Swedish drama student '
         'named Lena who wants to learn everything she can about life. In '
         'particular she wants to focus her attentions to making some sort of '
         'documentary on what the average Swede thought about certain '
         'political issues such as the Vietnam War and race issues in the '
         'United States. In between asking politicians and ordinary denizens '
         'of Stockholm about their opinions on politics, she has sex with her '
         'drama teacher, classmates, and married men.<br

In [17]:
path = SAVE_PATH_DATASET.joinpath(f'{dataset_name}_processed')
dataset.save_to_disk(path)

Saving the dataset (0/1 shards):   0%|          | 0/25000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

## IV. Rotten Tomatoes Reviews

**Source:** [Hugging Face Datasets](https://huggingface.co/datasets/rotten_tomatoes)
**Task:** Sentiment Analysis
**Training Size:** 8,530
**Testing Size:** 1,066
**Target:** Binary

In [2]:
dataset_name = 'rotten_tomatoes'
dataset = load_dataset(dataset_name)
dataset

Found cached dataset rotten_tomatoes (/Users/mykolaskrynnyk/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [3]:
dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

In [4]:
dataset = src.preparation.preprocess_dataset(
    dataset_dict=dataset,
    class_names=['Negative', 'Positive'],
    undersample_test=False,
    nlp=nlp,
    phrases_save_path=SAVE_PATH_PHRASES.joinpath(f'phrases_{dataset_name}.pkl'),
    verbose=True
)
dataset

{'label': ClassLabel(names=['Negative', 'Positive'], id=None),
 'text': Value(dtype='string', id=None)}


Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/8530 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1066 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1066 [00:00<?, ? examples/s]

267it [00:00, 1965.62it/s]


Phrase count: 2,260


Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Class count: 2
Label   	Train	Test	Support (Train)
Positive	50.00%	50.00%	4,265
Negative	50.00%	50.00%	4,265


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 1066
    })
})

In [5]:
pprint(dataset['train'][0])

{'label': 1,
 'text': 'the rock is destined to be the 21st century\'s new " conan " and '
         "that he's going to make a splash even greater than arnold "
         'schwarzenegger , jean-claud van damme or steven segal .',
 'text_clean': 'rock destine st_century new and that go make splash great '
               'schwarzenegger jean claud damme or steven segal'}


In [6]:
src.preparation.utils.describe_labels(dataset)

Class count: 2
Label   	Train	Test	Support (Train)
Positive	50.00%	50.00%	4,265
Negative	50.00%	50.00%	4,265


In [7]:
path = SAVE_PATH_DATASET.joinpath(f'{dataset_name}_processed')
dataset.save_to_disk(path)

Saving the dataset (0/1 shards):   0%|          | 0/8530 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1066 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1066 [00:00<?, ? examples/s]

## V. Stanford Sentiment Treebank

**Source:** [Hugging Face Datasets](https://huggingface.co/datasets/rotten_tomatoes)
**Task:** Sentiment Analysis
**Training Size:** 8,544
**Testing Size:** 1,101
**Target:** Multiclass (3 classes)

In [24]:
dataset_name = 'SetFit/sst5'
dataset = load_dataset(dataset_name)
dataset

Found cached dataset json (/Users/mykolaskrynnyk/.cache/huggingface/datasets/SetFit___json/SetFit--sst5-4c07b9d5881ae209/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 8544
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2210
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 1101
    })
})

In [25]:
dataset = dataset.remove_columns(['label_text'])

In [26]:
dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': Value(dtype='int64', id=None)}

In [27]:
dataset_name = dataset_name.replace('/', '_').lower()
dataset = src.preparation.preprocess_dataset(
    dataset_dict=dataset,
    class_names=['Very Negative', 'Negative', 'Neutral', 'Positive', 'Very Positive'],
    undersample_test=False,
    nlp=nlp,
    phrases_save_path=SAVE_PATH_PHRASES.joinpath(f'phrases_{dataset_name}.pkl'),
    verbose=True
)
dataset

{'label': ClassLabel(names=['Very Negative', 'Negative', 'Neutral', 'Positive', 'Very Positive'], id=None),
 'text': Value(dtype='string', id=None)}


Map:   0%|          | 0/8544 [00:00<?, ? examples/s]

Map:   0%|          | 0/2210 [00:00<?, ? examples/s]

Map:   0%|          | 0/1101 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/8544 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/2210 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1101 [00:00<?, ? examples/s]

267it [00:00, 2913.70it/s]

Phrase count: 1,983





Map:   0%|          | 0/8544 [00:00<?, ? examples/s]

Map:   0%|          | 0/2210 [00:00<?, ? examples/s]

Map:   0%|          | 0/1101 [00:00<?, ? examples/s]

Class count: 5
Label        	Train	Test	Support (Train)
Positive     	27.18%	23.08%	2,322
Negative     	25.96%	28.64%	2,218
Neutral      	19.01%	17.60%	1,624
Very Positive	15.07%	18.05%	1,288
Very Negative	12.78%	12.62%	1,092


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 8544
    })
    test: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 2210
    })
    validation: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 1101
    })
})

In [28]:
pprint(dataset['train'][0])

{'label': 4,
 'text': 'a stirring , funny and finally transporting re-imagining of beauty '
         'and the beast and 1930s horror films',
 'text_clean': 'stirring funny and transport re imagining beauty and beast and '
               'horror film'}


In [29]:
src.preparation.utils.describe_labels(dataset)

Class count: 5
Label        	Train	Test	Support (Train)
Positive     	27.18%	23.08%	2,322
Negative     	25.96%	28.64%	2,218
Neutral      	19.01%	17.60%	1,624
Very Positive	15.07%	18.05%	1,288
Very Negative	12.78%	12.62%	1,092


In [30]:
path = SAVE_PATH_DATASET.joinpath(f'{dataset_name}_processed')
dataset.save_to_disk(path)

Saving the dataset (0/1 shards):   0%|          | 0/8544 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2210 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1101 [00:00<?, ? examples/s]

## VI. Dynabench DynaSent (Round 2)

**Source:** [Hugging Face Datasets](https://huggingface.co/datasets/dynabench/dynasent)
**Task:** Sentiment Analysis
**Training Size:** 13,065
**Testing Size:** 720
**Target:** Multiclass (3 classes)

In [31]:
dataset_name = 'dynabench/dynasent'
dataset = load_dataset(dataset_name, 'dynabench.dynasent.r2.all')
dataset

Found cached dataset dynasent (/Users/mykolaskrynnyk/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r2.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'hit_ids', 'sentence', 'sentence_author', 'has_prompt', 'prompt_data', 'model_1_label', 'model_1_probs', 'text_id', 'label_distribution', 'gold_label', 'metadata'],
        num_rows: 13065
    })
    validation: Dataset({
        features: ['id', 'hit_ids', 'sentence', 'sentence_author', 'has_prompt', 'prompt_data', 'model_1_label', 'model_1_probs', 'text_id', 'label_distribution', 'gold_label', 'metadata'],
        num_rows: 720
    })
    test: Dataset({
        features: ['id', 'hit_ids', 'sentence', 'sentence_author', 'has_prompt', 'prompt_data', 'model_1_label', 'model_1_probs', 'text_id', 'label_distribution', 'gold_label', 'metadata'],
        num_rows: 720
    })
})

In [32]:
dataset['train'].features

{'id': Value(dtype='string', id=None),
 'hit_ids': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'sentence': Value(dtype='string', id=None),
 'sentence_author': Value(dtype='string', id=None),
 'has_prompt': Value(dtype='bool', id=None),
 'prompt_data': {'indices_into_review_text': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
  'review_rating': Value(dtype='int32', id=None),
  'prompt_sentence': Value(dtype='string', id=None),
  'review_id': Value(dtype='string', id=None)},
 'model_1_label': Value(dtype='string', id=None),
 'model_1_probs': {'negative': Value(dtype='float32', id=None),
  'positive': Value(dtype='float32', id=None),
  'neutral': Value(dtype='float32', id=None)},
 'text_id': Value(dtype='string', id=None),
 'label_distribution': {'positive': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
  'negative': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
  'neutral': Sequence(feature=

In [33]:
def recode_label(example: dict) -> dict:
    mapping = {
        'negative': 0,
        'neutral': 1,
        'positive': 2,
    }
    example['gold_label'] = mapping[example['gold_label']]
    return example

to_rename = {
    'sentence': 'text',
    'gold_label': 'label',
}

dataset = dataset.map(recode_label)
for split_name in dataset:
    dataset[split_name] = dataset[split_name].select_columns(list(to_rename)).rename_columns(to_rename)
dataset

Loading cached processed dataset at /Users/mykolaskrynnyk/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r2.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967/cache-ff06baae2c44b0df.arrow
Loading cached processed dataset at /Users/mykolaskrynnyk/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r2.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967/cache-5a4381739f08aed4.arrow
Loading cached processed dataset at /Users/mykolaskrynnyk/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r2.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967/cache-64f188eec4e17895.arrow


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 13065
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 720
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 720
    })
})

In [34]:
dataset_name = dataset_name.replace('/', '_').lower()
dataset = src.preparation.preprocess_dataset(
    dataset_dict=dataset,
    class_names=['Negative', 'Neutral', 'Positive'],
    undersample_test=False,
    nlp=nlp,
    phrases_save_path=SAVE_PATH_PHRASES.joinpath(f'phrases_{dataset_name}.pkl'),
    verbose=True
)
dataset

{'label': ClassLabel(names=['Negative', 'Neutral', 'Positive'], id=None),
 'text': Value(dtype='string', id=None)}


Map:   0%|          | 0/13065 [00:00<?, ? examples/s]

Map:   0%|          | 0/720 [00:00<?, ? examples/s]

Map:   0%|          | 0/720 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/13065 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/720 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/720 [00:00<?, ? examples/s]

409it [00:00, 3833.29it/s]

Phrase count: 1,749





Map:   0%|          | 0/13065 [00:00<?, ? examples/s]

Map:   0%|          | 0/720 [00:00<?, ? examples/s]

Map:   0%|          | 0/720 [00:00<?, ? examples/s]

Class count: 3
Label   	Train	Test	Support (Train)
Positive	46.22%	33.33%	6,038
Negative	35.05%	33.33%	4,579
Neutral 	18.74%	33.33%	2,448


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 13065
    })
    validation: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 720
    })
    test: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 720
    })
})

In [35]:
pprint(dataset['train'][0])

{'label': 2,
 'text': 'We enjoyed our first and last meal in Toronto at Bombay Palace, and '
         "I can't think of a better way to book our journey.",
 'text_clean': 'enjoy first and last meal and think well way book journey'}


In [36]:
path = SAVE_PATH_DATASET.joinpath(f'{dataset_name}_processed')
dataset.save_to_disk(path)

Saving the dataset (0/1 shards):   0%|          | 0/13065 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/720 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/720 [00:00<?, ? examples/s]

## VII. AG News

**Source:** [Hugging Face Datasets](https://huggingface.co/datasets/ag_news)
**Task:** News Categorisation
**Training Size:** 120,000
**Testing Size:** 7,600
**Target:** Multiclass (4 classes)

In [8]:
dataset_name = 'ag_news'
dataset = load_dataset(dataset_name)
dataset

Found cached dataset ag_news (/Users/mykolaskrynnyk/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [9]:
dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None)}

In [10]:
dataset_name = dataset_name.replace('/', '_').lower()
dataset = src.preparation.preprocess_dataset(
    dataset_dict=dataset,
    class_names=None,  # do not apply any transformations to existing class names
    undersample_test=False,
    nlp=nlp,
    phrases_save_path=SAVE_PATH_PHRASES.joinpath(f'phrases_{dataset_name}.pkl'),
    verbose=True
)
dataset

{'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None),
 'text': Value(dtype='string', id=None)}


Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/120000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/7600 [00:00<?, ? examples/s]

3750it [00:03, 1044.31it/s]


Phrase count: 3,486


Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

Class count: 4
Label   	Train	Test	Support (Train)
Business	25.00%	25.00%	30,000
Sci/Tech	25.00%	25.00%	30,000
Sports  	25.00%	25.00%	30,000
World   	25.00%	25.00%	30,000


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 7600
    })
})

In [11]:
pprint(dataset['train'][0])

{'label': 2,
 'text': 'Wall St. Bears Claw Back Into the Black (Reuters) Reuters - '
         "Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are "
         'seeing green again.',
 'text_clean': 'seller ultra cynic see green'}


In [12]:
src.preparation.utils.describe_labels(dataset)

Class count: 4
Label   	Train	Test	Support (Train)
Business	25.00%	25.00%	30,000
Sci/Tech	25.00%	25.00%	30,000
Sports  	25.00%	25.00%	30,000
World   	25.00%	25.00%	30,000


In [13]:
path = SAVE_PATH_DATASET.joinpath(f'{dataset_name}_processed')
dataset.save_to_disk(path)

Saving the dataset (0/1 shards):   0%|          | 0/120000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/7600 [00:00<?, ? examples/s]

## VIII. 20 Newsgroups

**Source:** [scikit-learn datasets](https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset)
**Task:** News Categorisation
**Training Size:** 11,314
**Testing Size:** 7,532
**Target:** Multiclass (15 classes)

In [2]:
dataset_name = '20_newsgroups'

In [3]:
mapping = {
    'comp.graphics': 'Computer Graphics',
    'comp.os.ms-windows.misc': 'Computer Operating System MS-Windows',
    'comp.sys.ibm.pc.hardware': 'Computer Systems IBM PC Hardware',
    'comp.sys.mac.hardware': 'Computer Systems Mac Hardware',
    'comp.windows.x': 'Computer Windows X',
    'rec.autos': 'Automobiles',
    'rec.motorcycles': 'Motorcycles',
    'rec.sport.baseball': 'Sports Baseball',
    'rec.sport.hockey': 'Sports Hockey',
    'sci.crypt': 'Cryptography',
    'sci.electronics': 'Electronics',
    'sci.med': 'Science Medicine',
    'sci.space': 'Science Space',
    'misc.forsale': 'Miscellaneous For Sale',
    'talk.politics.misc': 'Talk Politics Miscellaneous',
    'talk.politics.guns': 'Talk Politics Guns',
    'talk.politics.mideast': 'Talk Politics Middle East',
    'talk.religion.misc': 'Talk Religion Miscellaneous',
    'alt.atheism': 'Atheism',
    'soc.religion.christian': 'Religion Christian',
}

In [4]:
bunch = fetch_20newsgroups(subset='train')
examples_train = src.preparation.newsgroups.consturct_dataset(bunch, mapping)
pprint(examples_train[0])

{'label': 1,
 'text': "From: lerxst@wam.umd.edu (where's my thing)\n"
         'Subject: WHAT car is this!?\n'
         'Nntp-Posting-Host: rac3.wam.umd.edu\n'
         'Organization: University of Maryland, College Park\n'
         'Lines: 15\n'
         '\n'
         ' I was wondering if anyone out there could enlighten me on this car '
         'I saw\n'
         'the other day. It was a 2-door sports car, looked to be from the '
         'late 60s/\n'
         'early 70s. It was called a Bricklin. The doors were really small. In '
         'addition,\n'
         'the front bumper was separate from the rest of the body. This is \n'
         'all I know. If anyone can tellme a model name, engine specs, years\n'
         'of production, where this car is made, history, or whatever info '
         'you\n'
         'have on this funky looking car, please e-mail.\n'
         '\n'
         'Thanks,\n'
         '- IL\n'
         '   ---- brought to you by your neighborhood Lerxst ----\n'
 

In [5]:
bunch = fetch_20newsgroups(subset='test')
examples_test = src.preparation.newsgroups.consturct_dataset(bunch, mapping)

In [6]:
dataset = DatasetDict(
    train=Dataset.from_list(examples_train),
    test=Dataset.from_list(examples_test)
)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 11314
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7532
    })
})

In [7]:
dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': Value(dtype='int64', id=None)}

In [8]:
dataset = src.preparation.preprocess_dataset(
    dataset_dict=dataset,
    class_names=sorted(set(mapping.values())),
    undersample_test=False,
    nlp=nlp,
    phrases_save_path=SAVE_PATH_PHRASES.joinpath(f'phrases_{dataset_name}.pkl'),
    verbose=True
)
dataset

{'label': ClassLabel(names=['Atheism', 'Automobiles', 'Computer Graphics', 'Computer Operating System MS-Windows', 'Computer Systems IBM PC Hardware', 'Computer Systems Mac Hardware', 'Computer Windows X', 'Cryptography', 'Electronics', 'Miscellaneous For Sale', 'Motorcycles', 'Religion Christian', 'Science Medicine', 'Science Space', 'Sports Baseball', 'Sports Hockey', 'Talk Politics Guns', 'Talk Politics Middle East', 'Talk Politics Miscellaneous', 'Talk Religion Miscellaneous'], id=None),
 'text': Value(dtype='string', id=None)}


Map:   0%|          | 0/11314 [00:00<?, ? examples/s]

Map:   0%|          | 0/7532 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/11314 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/7532 [00:00<?, ? examples/s]

354it [00:01, 212.32it/s]

Phrase count: 1,487





Map:   0%|          | 0/11314 [00:00<?, ? examples/s]

Map:   0%|          | 0/7532 [00:00<?, ? examples/s]

Class count: 20
Label                               	Train	Test	Support (Train)
Sports Hockey                       	5.30%	5.30%	600
Religion Christian                  	5.29%	5.28%	599
Motorcycles                         	5.29%	5.28%	598
Sports Baseball                     	5.28%	5.27%	597
Cryptography                        	5.26%	5.26%	595
Automobiles                         	5.25%	5.26%	594
Science Medicine                    	5.25%	5.26%	594
Science Space                       	5.24%	5.23%	593
Computer Windows X                  	5.24%	5.24%	593
Computer Operating System MS-Windows	5.22%	5.23%	591
Electronics                         	5.22%	5.22%	591
Computer Systems IBM PC Hardware    	5.21%	5.20%	590
Miscellaneous For Sale              	5.17%	5.18%	585
Computer Graphics                   	5.16%	5.16%	584
Computer Systems Mac Hardware       	5.11%	5.11%	578
Talk Politics Middle East           	4.98%	4.99%	564
Talk Politics Guns                  	4.83%	4.83%	546
Atheism            

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 11314
    })
    test: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 7532
    })
})

In [9]:
path = SAVE_PATH_DATASET.joinpath(f'{dataset_name}_processed')
dataset.save_to_disk(path)

Saving the dataset (0/1 shards):   0%|          | 0/11314 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/7532 [00:00<?, ? examples/s]

## IX. Web of Science

**Source:** [Mendeley Data](https://data.mendeley.com/datasets/9rw3vkcfy4/6)
**Task:** Topic Classification
**Training Size:** 37,589
**Testing Size:** 9,396
**Target:** Multiclass (15 classes)

In [22]:
dataset_name = 'web_of_science'

In [23]:
%%bash

export filename="9rw3vkcfy4-6.zip"
curl -O https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/$filename
unzip -o $filename "WebOfScience.zip"
rm $filename

export filename="WebOfScience.zip"
unzip -o $filename -d ../data/raw/web_of_science
rm $filename

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 57.4M  100 57.4M    0     0   961k      0  0:01:01  0:01:01 --:--:--  733k


Archive:  9rw3vkcfy4-6.zip
  inflating: WebOfScience.zip        
Archive:  WebOfScience.zip
  inflating: ../data/raw/web_of_science/Dictionary.txt  
  inflating: ../data/raw/web_of_science/Meta-data/Data.xlsx  
  inflating: ../data/raw/web_of_science/WOS11967/X.txt  
  inflating: ../data/raw/web_of_science/WOS11967/Y.txt  
  inflating: ../data/raw/web_of_science/WOS11967/YL1.txt  
  inflating: ../data/raw/web_of_science/WOS11967/YL2.txt  
  inflating: ../data/raw/web_of_science/WOS46985/X.txt  
  inflating: ../data/raw/web_of_science/WOS46985/Y.txt  
  inflating: ../data/raw/web_of_science/WOS46985/YL1.txt  
  inflating: ../data/raw/web_of_science/WOS46985/YL2.txt  
  inflating: ../data/raw/web_of_science/WOS5736/X.txt  
  inflating: ../data/raw/web_of_science/WOS5736/Y.txt  
  inflating: ../data/raw/web_of_science/WOS5736/YL1.txt  
  inflating: ../data/raw/web_of_science/WOS5736/YL2.txt  


In [24]:
path = os.path.join(os.pardir, 'data', 'raw', 'web_of_science', 'Meta-data', 'Data.xlsx')
df_data = pd.read_excel(path)
print('Shape:', df_data.shape)
display(df_data.head())

Shape: (46985, 7)


Unnamed: 0,Y1,Y2,Y,Domain,area,keywords,Abstract
0,0,12,12,CS,Symbolic computation,(2+1)-dimensional non-linear optical waves; e...,(2 + 1)-dimensional non-linear optical waves t...
1,5,2,74,Medical,Alzheimer's Disease,Aging; Tau; Amyloid; PET; Alzheimer's disease...,(beta-amyloid (A beta) and tau pathology becom...
2,4,7,68,Civil,Green Building,LED lighting system; PV system; Distributed l...,(D)ecreasing of energy consumption and environ...
3,1,10,26,ECE,Electric motor,NdFeB magnets; Electric motor; Electric vehic...,(Hybrid) electric vehicles are assumed to play...
4,5,43,115,Medical,Parkinson's Disease,Parkinson's disease; dyskinesia; adenosine A(...,"(L)-3,4-Dihydroxyphenylalanine ((L)-DOPA) rema..."


In [25]:
# some Y area incorrectly mapped to several areas
df_data.nunique()

Y1              7
Y2             53
Y             134
Domain          7
area          143
keywords    46835
Abstract    46985
dtype: int64

In [26]:
df_labels = df_data.groupby('Y')\
    .agg({'area': [Counter, 'nunique']})\
    .sort_values(('area', 'nunique'), ascending=False)
print('Shape:', df_labels.shape)
display(df_labels.head(10))

Shape: (134, 2)


Unnamed: 0_level_0,area,area
Unnamed: 0_level_1,Counter,nunique
Y,Unnamed: 1_level_2,Unnamed: 2_level_2
71,"{' Smart Material ': 363, ' Transparent Concr...",3
76,"{' Anxiety ': 262, ' Bamboo as a Building Mat...",2
89,"{' Digestive Health ': 95, ' Outdoor Health ...",2
65,"{' Water Pollution ': 446, ' Underwater Windm...",2
25,"{' Electrical generator ': 240, ' Analog sign...",2
62,"{' Geotextile ': 419, ' Highway Network Syste...",2
126,"{' Cell biology ': 552, ' DNA/RNA sequencing ...",2
26,"{' Electric motor ': 372, ' Single-phase elec...",2
93,{' Headache ': 341},1
90,{' Emergency Contraception ': 291},1


In [27]:
# for a few cases where multiple area labels are mapped to a single area id, use the most common label
mapping = df_labels[('area', 'Counter')].apply(lambda counter: counter.most_common()[0][0].strip()).to_dict()
mapping[71]

'Smart Material'

In [28]:
df_data['area'] = df_data['Y'].replace(mapping)

In [29]:
df_train = df_data.groupby('Domain').sample(frac=.8, random_state=42)
df_test = df_data.drop(df_train.index, axis=0)
print('Shape train:', df_train.shape)
print('Shape test:', df_test.shape)
display(df_train.head())

Shape train: (37589, 7)
Shape test: (9396, 7)


Unnamed: 0,Y1,Y2,Y,Domain,area,keywords,Abstract
44776,0,5,5,CS,Computer graphics,Glyph; Video visualization; Traffic surveilla...,Video visualization (VV) is considered to be a...
16132,0,6,6,CS,Image processing,Human target; Micro-Doppler; Time-frequency a...,Human target characteristic parameter extracti...
11087,0,14,14,CS,Computer programming,Dyslexia; Computer programming; Inclusion ...,Computer programmers with dyslexia can be foun...
45262,0,4,4,CS,Operating systems,Software - Software systems; Software - Syste...,We describe and evaluate a software-only imple...
41099,0,2,2,CS,network security,Probing; Malware; Darknet preprocessing; Big ...,This paper presents a new approach to infer wo...


In [30]:
pd.concat(
    objs=[
        df_train['area'].value_counts(normalize=True).rename('train'),
        df_test['area'].value_counts(normalize=True).rename('test'),
    ],
    axis=1
).multiply(100).round(1).astype(str).add('%')

Unnamed: 0,train,test
Polymerase chain reaction,1.6%,1.6%
Molecular biology,1.6%,1.7%
Northern blotting,1.5%,1.5%
Immunology,1.4%,1.2%
Analog signal processing,1.4%,1.4%
...,...,...
Digestive Health,0.2%,0.2%
Kidney Health,0.2%,0.2%
Voltage law,0.1%,0.1%
Structured Storage,0.1%,0.1%


In [31]:
to_rename = {
    'Abstract': 'text',
    'area': 'label'
}

examples_train = df_train.reindex(to_rename, axis=1).rename(to_rename, axis=1).to_dict(orient='records')
examples_test = df_test.reindex(to_rename, axis=1).rename(to_rename, axis=1).to_dict(orient='records')
pprint(examples_train[0])

{'label': 'Computer graphics',
 'text': 'Video visualization (VV) is considered to be an essential part of '
         'multimedia visual analytics. Many challenges have arisen from the '
         'enormous video content of cameras which can be solved with the help '
         'of data analytics and hence gaining importance. However, the rapid '
         'advancement of digital technologies has resulted in an explosion of '
         'video data, which stimulates the needs for creating computer '
         'graphics and visualization from videos. Particularly, in the '
         'paradigm of smart cities, video surveillance as a widely applied '
         'technology can generate huge amount of videos from 24/7 '
         'surveillance. In this paper, a state of the art algorithm has been '
         'proposed for 3D conversion from traffic video content to Google Map. '
         'Time-stamped glyph-based visualization is used effectively in '
         'outdoor surveillance videos and can be 

In [32]:
dataset = DatasetDict(
    train=Dataset.from_list(examples_train),
    test=Dataset.from_list(examples_test),
)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 37589
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 9396
    })
})

In [33]:
dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None)}

In [34]:
def recode_labels(example: dict, class_names: list[str]) -> dict:
    example['label'] = class_names.index(example['label'])
    return example
class_names = sorted(df_train['area'].unique())
dataset = dataset.map(function=recode_labels, fn_kwargs={'class_names': class_names})

Map:   0%|          | 0/37589 [00:00<?, ? examples/s]

Map:   0%|          | 0/9396 [00:00<?, ? examples/s]

In [35]:
dataset = src.preparation.preprocess_dataset(
    dataset_dict=dataset,
    class_names=class_names,
    undersample_test=False,
    nlp=nlp,
    phrases_save_path=SAVE_PATH_PHRASES.joinpath(f'phrases_{dataset_name}.pkl'),
    verbose=True
)
dataset

{'label': ClassLabel(names=['Addiction', 'Algorithm design', 'Allergies', "Alzheimer's Disease", 'Ambient Intelligence', 'Analog signal processing', 'Ankylosing Spondylitis', 'Antisocial personality disorder', 'Anxiety', 'Asthma', 'Atopic Dermatitis', 'Atrial Fibrillation', 'Attention', 'Autism', 'Bioinformatics', 'Bipolar Disorder', 'Birth Control', 'Borderline personality disorder', 'Cancer', 'Cell biology', 'Child abuse', "Children's Health", 'Computer graphics', 'Computer programming', 'Computer vision', 'Construction Management', 'Control engineering', "Crohn's Disease", 'Cryptography', 'Data structures', 'Dementia', 'Depression', 'Diabetes', 'Digestive Health', 'Digital control', 'Distributed computing', 'Eating disorders', 'Electric motor', 'Electrical circuits', 'Electrical network', 'Electricity', 'Emergency Contraception', 'Enzymology', 'False memories', 'Fluid mechanics', 'Fungal Infection', 'Gender roles', 'Genetics', 'Geotextile', 'Green Building', 'HIV/AIDS', 'Headache', 

Map:   0%|          | 0/37589 [00:00<?, ? examples/s]

Map:   0%|          | 0/9396 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/37589 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/9396 [00:00<?, ? examples/s]

1175it [00:04, 235.34it/s]


Phrase count: 3,321


Map:   0%|          | 0/37589 [00:00<?, ? examples/s]

Map:   0%|          | 0/9396 [00:00<?, ? examples/s]

Class count: 134
Label                          	Train	Test	Support (Train)
Polymerase chain reaction      	1.60%	1.59%	601
Molecular biology              	1.56%	1.71%	585
Northern blotting              	1.48%	1.52%	556
Immunology                     	1.44%	1.17%	542
Analog signal processing       	1.37%	1.39%	516
Human Metabolism               	1.31%	1.39%	491
Enzymology                     	1.22%	1.27%	457
Cell biology                   	1.21%	1.17%	456
Genetics                       	1.18%	1.29%	445
Southern blotting              	1.11%	0.99%	417
Depression                     	1.05%	1.25%	393
Water Pollution                	0.99%	0.81%	371
Parallel computing             	0.97%	0.82%	366
Electricity                    	0.96%	0.93%	360
Operational amplifier          	0.95%	0.66%	357
Rainwater Harvesting           	0.94%	0.95%	352
Digital control                	0.94%	0.79%	352
Social cognition               	0.93%	0.52%	348
Ambient Intelligence           	0.92%	0.68%	346
Computer pro

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 37589
    })
    test: Dataset({
        features: ['text', 'label', 'text_clean'],
        num_rows: 9396
    })
})

In [36]:
pprint(dataset['train'][0])

{'label': 22,
 'text': 'Video visualization (VV) is considered to be an essential part of '
         'multimedia visual analytics. Many challenges have arisen from the '
         'enormous video content of cameras which can be solved with the help '
         'of data analytics and hence gaining importance. However, the rapid '
         'advancement of digital technologies has resulted in an explosion of '
         'video data, which stimulates the needs for creating computer '
         'graphics and visualization from videos. Particularly, in the '
         'paradigm of smart cities, video surveillance as a widely applied '
         'technology can generate huge amount of videos from 24/7 '
         'surveillance. In this paper, a state of the art algorithm has been '
         'proposed for 3D conversion from traffic video content to Google Map. '
         'Time-stamped glyph-based visualization is used effectively in '
         'outdoor surveillance videos and can be used for event-aw

In [37]:
path = SAVE_PATH_DATASET.joinpath(f'{dataset_name}_processed')
dataset.save_to_disk(path)

Saving the dataset (0/1 shards):   0%|          | 0/37589 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/9396 [00:00<?, ? examples/s]

## X. DBPedia 14

**Source:** [Hugging Face Datasets](https://huggingface.co/datasets/dbpedia_14)
**Task:** Topic Classification
**Training Size:** 560,000
**Testing Size:** 7,000
**Target:** Multiclass (14 classes)

In [65]:
dataset_name = 'dbpedia_14'
dataset = load_dataset(dataset_name)
dataset

Found cached dataset dbpedia_14 (/Users/mykolaskrynnyk/.cache/huggingface/datasets/dbpedia_14/dbpedia_14/2.0.0/01dab9e10d969eadcdbc918be5a09c9190a24caeae33b10eee8f367a1e3f1f0c)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 560000
    })
    test: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 70000
    })
})

In [66]:
dataset = dataset.remove_columns(['title']).rename_columns({'content': 'text'})

In [67]:
dataset['train'].features

{'label': ClassLabel(names=['Company', 'EducationalInstitution', 'Artist', 'Athlete', 'OfficeHolder', 'MeanOfTransportation', 'Building', 'NaturalPlace', 'Village', 'Animal', 'Plant', 'Album', 'Film', 'WrittenWork'], id=None),
 'text': Value(dtype='string', id=None)}

In [68]:
dataset = src.preparation.preprocess_dataset(
    dataset_dict=dataset,
    class_names=[src.preparation.utils.split_camel_case(label) for label in dataset['train'].features['label'].names],
    undersample_test=True,
    nlp=nlp,
    phrases_save_path=SAVE_PATH_PHRASES.joinpath(f'phrases_{dataset_name}.pkl'),
    verbose=True
)
dataset

Loading cached split indices for dataset at /Users/mykolaskrynnyk/.cache/huggingface/datasets/dbpedia_14/dbpedia_14/2.0.0/01dab9e10d969eadcdbc918be5a09c9190a24caeae33b10eee8f367a1e3f1f0c/cache-0ca66c025b3b261f.arrow and /Users/mykolaskrynnyk/.cache/huggingface/datasets/dbpedia_14/dbpedia_14/2.0.0/01dab9e10d969eadcdbc918be5a09c9190a24caeae33b10eee8f367a1e3f1f0c/cache-bd49dd9ad0fc080b.arrow


{'label': ClassLabel(names=['Company', 'Educational Institution', 'Artist', 'Athlete', 'Office Holder', 'Mean Of Transportation', 'Building', 'Natural Place', 'Village', 'Animal', 'Plant', 'Album', 'Film', 'Written Work'], id=None),
 'text': Value(dtype='string', id=None)}


Map:   0%|          | 0/560000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/560000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/7000 [00:00<?, ? examples/s]

17500it [00:11, 1542.37it/s]


Phrase count: 4,522


Map:   0%|          | 0/560000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

Class count: 14
Label                  	Train	Test	Support (Train)
Company                	7.14%	7.14%	40,000
Educational Institution	7.14%	7.14%	40,000
Artist                 	7.14%	7.14%	40,000
Athlete                	7.14%	7.14%	40,000
Office Holder          	7.14%	7.14%	40,000
Mean Of Transportation 	7.14%	7.14%	40,000
Building               	7.14%	7.14%	40,000
Natural Place          	7.14%	7.14%	40,000
Village                	7.14%	7.14%	40,000
Animal                 	7.14%	7.14%	40,000
Plant                  	7.14%	7.14%	40,000
Album                  	7.14%	7.14%	40,000
Film                   	7.14%	7.14%	40,000
Written Work           	7.14%	7.14%	40,000


DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'text_clean'],
        num_rows: 560000
    })
    test: Dataset({
        features: ['label', 'text', 'text_clean'],
        num_rows: 7000
    })
})

In [69]:
pprint(dataset['train'][0])

{'label': 0,
 'text': ' Abbott of Farnham E D Abbott Limited was a British coachbuilding '
         'business based in Farnham Surrey trading under that name from 1929. '
         'A major part of their output was under sub-contract to motor vehicle '
         'manufacturers. Their business closed in 1972.',
 'text_clean': 'british coachbuilde business base trading name major part '
               'output sub contract motor_vehicle manufacturer business close'}


In [70]:
path = SAVE_PATH_DATASET.joinpath(f'{dataset_name}_processed')
dataset.save_to_disk(path)

Saving the dataset (0/1 shards):   0%|          | 0/560000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/7000 [00:00<?, ? examples/s]