Write a Python function that takes a dataset name as an argument, loads the dataset, and calculates and prints out the following information about the dataset:

    The description of the dataset (use load_dataset_builder)
    Relative sizes of the subsets of the dataset (e.g. 'train', 'validation', and 'test') in terms of examples (rows). For example: "train: 50%, validation: 25%, test: 25%"
    Distribution of labels in the 'train' subset of the dataset, using the names of the labels. For example: "positive: 53%, negative: 47%"

(You can assume that the function will only be called with the names of datasets representing text classification corpora.)

Apply this function to the following datasets: 'emotion', 'rotten_tomatoes', 'snli', 'sst2', 'emo'.


In [27]:
!pip install datasets



In [28]:
# Imports
import datasets
from datasets import load_dataset_builder
from datasets import load_dataset
datasets.disable_progress_bar()


# Actual func
def extractor(x):

  # The description of the dataset (use load_dataset_builder)
  ds_builder = load_dataset_builder(x, trust_remote_code=True)
  description = ds_builder.info.description


  # Relative sizes of the subsets of the dataset
  dataset = load_dataset(x, trust_remote_code=True)
  subsets = dict(dataset.num_rows)
  allrows = sum(subsets.values())
  st = ""
  for k,v in subsets.items():
    st += f'{k}: {round((v * 100) / allrows)}% \t'


  #Distribution of labels in the 'train' subset of the dataset, using the names of the labels. For example: "positive: 53%, negative: 47%"
  only_train = load_dataset(x, split = 'train', trust_remote_code=True)
  label_list = list(only_train.features['label'].names)
  label_counts = [0]*len(label_list)

  for entry in only_train:
    # entry['label'] <- this extracts only the assigned label
    label_counts[entry['label']] += 1 # accumulate the index of said

  label_allcounts = sum(label_counts)
  LABELS = dict()
  for i, label in enumerate(label_list):
    LABELS[label] = label_counts[i]

  sol = ""
  for k,v in LABELS.items():
    sol += f'{k}: {round((v * 100) / label_allcounts)}% \t'


  # PRINTS
  print("*"*20)
  if len(description) == 0:
    print(f'No description found for dataset {x}')
  else:
    print(description) # A

  print(st) # B
  print(sol) # C
  print("*"*20)

# Function calls
DATASETS = ['emotion', 'rotten_tomatoes', 'snli', 'sst2', 'emo']

for x in DATASETS:
  extractor(x)



********************
Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.

train: 80% 	validation: 10% 	test: 10% 	
sadness: 29% 	joy: 34% 	love: 8% 	anger: 13% 	fear: 12% 	surprise: 4% 	
********************
********************
No description found for dataset rotten_tomatoes
train: 80% 	validation: 10% 	test: 10% 	
neg: 50% 	pos: 50% 	
********************
********************
No description found for dataset snli
test: 2% 	validation: 2% 	train: 96% 	
entailment: 33% 	neutral: 33% 	contradiction: 33% 	
********************
********************
No description found for dataset sst2
train: 96% 	validation: 1% 	test: 3% 	
negative: 44% 	positive: 56% 	
********************
********************
In this dataset, given a textual dialogue i.e. an utterance along with two previous turns of context, the goal was to infer the underlying emotion of the utterance by choo


What patterns can you notice in the relative sizes of the subsets? Can you tell why this might be?

Most of the datasets are 80/10/10 split or more.

Amount of training data needed is vastly bigger to train the model so that it learns more. Some datasets provide additional validation data to validate the model during training.