# CS 195: Natural Language Processing
## More On Dataset Organization

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F2_1_MoreOnDatasets.ipynb)


## References


Hugging Face *Load a dataset from the Hub tutorial*: https://huggingface.co/docs/datasets/load_hub

Hugging Face *dataset features doc*: https://huggingface.co/docs/datasets/about_dataset_features

In [None]:
#install what you need for this notebook
import sys
!{sys.executable} -m pip install transformers datasets

Collecting transformers
  Downloading transformers-4.33.1-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.1-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Download

## Picking up where we left off with datasets

**Previously:** We had some code that looked like this

In [None]:
from transformers import pipeline
from datasets import load_dataset
from sklearn.metrics import accuracy_score

# loading a dataset and a model
emotions_dataset = load_dataset("go_emotions", split="test")
classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

number_to_try = 10

#get predictions for the first number_to_try samples
results = classifier(emotions_dataset["text"][0:number_to_try])

predicted_labels = []
actual_labels = []

for idx in range(number_to_try):

    print("\nItem from the dataset:",emotions_dataset[idx])
    print("Prediction from the model:",results[idx])


    predicted_label = results[idx]["label"]
    actual_label_numeric = emotions_dataset[idx]["labels"][0] #this dataset returns a list of numeric labels
    actual_label_name = emotions_dataset.features["labels"].feature.int2str( actual_label_numeric ) #look up the name for this numeric label

    predicted_labels.append(predicted_label)
    actual_labels.append( actual_label_name )

print("Accuracy:",accuracy_score(actual_labels,predicted_labels) )

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/7.03k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.12k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/203k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/201k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/43410 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5426 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5427 [00:00<?, ? examples/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]


Item from the dataset: {'text': 'I’m really sorry about your situation :( Although I love the names Sapphira, Cirilla, and Scarlett!', 'labels': [25], 'id': 'eecwqtt'}
Prediction from the model: {'label': 'remorse', 'score': 0.6783004999160767}

Item from the dataset: {'text': "It's wonderful because it's awful. At not with.", 'labels': [0], 'id': 'ed5f85d'}
Prediction from the model: {'label': 'admiration', 'score': 0.6606240272521973}

Item from the dataset: {'text': 'Kings fan here, good luck to you guys! Will be an interesting game to watch! ', 'labels': [13], 'id': 'een27c3'}
Prediction from the model: {'label': 'optimism', 'score': 0.5494066476821899}

Item from the dataset: {'text': "I didn't know that, thank you for teaching me something today!", 'labels': [15], 'id': 'eelgwd1'}
Prediction from the model: {'label': 'gratitude', 'score': 0.9829797744750977}

Item from the dataset: {'text': 'They got bored from haunting earth for thousands of years and ultimately moved on to the

## But how did we know how to do this?


Note that the labels stored in the dataset have values like `[25]` or `[13]`

but the model predicts `'remorse'` or `'optimism'`

You can learn about some the of the things you can do with these class labels by looking at the documentation here: https://huggingface.co/docs/datasets/about_dataset_features

In [None]:
emotions_dataset.features

{'text': Value(dtype='string', id=None),
 'labels': Sequence(feature=ClassLabel(names=['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral'], id=None), length=-1, id=None),
 'id': Value(dtype='string', id=None)}

In [None]:
emotions_dataset.features["labels"]

Sequence(feature=ClassLabel(names=['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral'], id=None), length=-1, id=None)

In [None]:
emotions_dataset.features['labels'].feature

ClassLabel(names=['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral'], id=None)

In [None]:
emotions_dataset.features['labels'].feature.names

['admiration',
 'amusement',
 'anger',
 'annoyance',
 'approval',
 'caring',
 'confusion',
 'curiosity',
 'desire',
 'disappointment',
 'disapproval',
 'disgust',
 'embarrassment',
 'excitement',
 'fear',
 'gratitude',
 'grief',
 'joy',
 'love',
 'nervousness',
 'optimism',
 'pride',
 'realization',
 'relief',
 'remorse',
 'sadness',
 'surprise',
 'neutral']

In [None]:
emotions_dataset.features['labels'].feature.names[13]

'excitement'

You can find the documentation for each of these kinds of objects to figure out what they can do.

https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Value


https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Sequence


https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.ClassLabel

And you'll see that a `ClassLabel` has a function called `int2str`, so you can also find out which label goes with an int by

In [None]:
emotions_dataset.features['labels'].feature.int2str(13)

'excitement'

## Problems with plugging in a different dataset and model

Because of differences in the way that datasets organize their data, you may not be able to use the exact same code.

Dataset: https://huggingface.co/datasets/papluca/language-identification

Model: https://huggingface.co/papluca/xlm-roberta-base-language-detection

In [None]:
from transformers import pipeline
from datasets import load_dataset
from sklearn.metrics import accuracy_score

# loading a dataset and a model
lang_dataset = load_dataset("papluca/language-identification")
classifier = pipeline("text-classification", model="papluca/xlm-roberta-base-language-detection")

number_to_try = 10

#get predictions for the first number_to_try samples
results = classifier(lang_dataset["test"]["text"][0:number_to_try])

predicted_labels = []
actual_labels = []

for idx in range(number_to_try):

    print("\nItem from the dataset:",lang_dataset["test"][idx])
    print("Prediction from the model:",results[idx])


    predicted_label = results[idx]["label"]
    actual_label_numeric = lang_dataset["test"][idx]["labels"][0] #this dataset returns a list of numeric labels
    actual_label_name = lang_dataset["test"].features["label"].int2str( actual_label_numeric ) #look up the name for this numeric label

    predicted_labels.append(predicted_label)
    actual_labels.append( actual_label_name )

print("Accuracy:",accuracy_score(actual_labels,predicted_labels) )


Item from the dataset: {'labels': 'nl', 'text': 'Een man zingt en speelt gitaar.'}
Prediction from the model: {'label': 'nl', 'score': 0.9956241250038147}


KeyError: ignored

Notice that the item from the dataset has a key `'labels'` with the value `'nl'`

and the prediction has a key `'label'` with the value `'nl'`

These already match, so we can compare them directly - we don't need to extract the numeric value and then look up which name it represents

In [None]:
from transformers import pipeline
from datasets import load_dataset
from sklearn.metrics import accuracy_score

# loading a dataset and a model
lang_dataset = load_dataset("papluca/language-identification")
classifier = pipeline("text-classification", model="papluca/xlm-roberta-base-language-detection")

number_to_try = 10

#get predictions for the first number_to_try samples
results = classifier(lang_dataset["test"]["text"][0:number_to_try])

predicted_labels = []
actual_labels = []

for idx in range(number_to_try):

    print("\nItem from the dataset:",lang_dataset["test"][idx])
    print("Prediction from the model:",results[idx])


    predicted_label = results[idx]["label"]
    actual_label = lang_dataset["test"][idx]["labels"] #this dataset returns labels as strings

    predicted_labels.append(predicted_label)
    actual_labels.append( actual_label )

print("Accuracy:",accuracy_score(actual_labels,predicted_labels) )

## Another dataset

Here's another dataset that is also different: https://huggingface.co/datasets/tweet_eval

In [None]:
tweet_dataset = load_dataset("tweet_eval", "emoji")

In [None]:
tweet_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

In [None]:
tweet_test_dataset = tweet_dataset["test"]
tweet_test_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 50000
})

In [None]:
tweet_test_dataset[12]

{'text': 'Welcome to New York! @ Times Square, New York City', 'label': 5}

This one seems to store the labels as numbers.

Note that it isn't a list with a number in it like it was for `go_emotions`

It's also `'label'` instead of `'labels'`

In [None]:
emotions_dataset[0]

{'text': 'I’m really sorry about your situation :( Although I love the names Sapphira, Cirilla, and Scarlett!',
 'labels': [25],
 'id': 'eecwqtt'}

What do the numbers represent the tweet dataset?

Let's look the data set's features attribute.

In [None]:
tweet_test_dataset.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜'], id=None)}

In [None]:
tweet_test_dataset.features['label']

ClassLabel(names=['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜'], id=None)

This one organizes it differently!

Compare to our `go_emotions` example

In [None]:
emotions_dataset.features["labels"]

Sequence(feature=ClassLabel(names=['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral'], id=None), length=-1, id=None)

Since the tweet data already has a `ClassLabel` here - not a `Sequence` with `feature=ClassLabel`, we can proceed directly to getting the string from the int

In [None]:
tweet_test_dataset.features['label'].int2str(5)

'😊'

## Adapting our code to make it work

In [None]:
from transformers import pipeline
from datasets import load_dataset
from sklearn.metrics import accuracy_score

# loading a dataset and a model
tweet_dataset = load_dataset("tweet_eval", "emoji", split="test")
classifier = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-emoji")

number_to_try = 10

#get predictions for the first number_to_try samples
results = classifier(tweet_dataset["text"][0:number_to_try])

predicted_labels = []
actual_labels = []

for idx in range(number_to_try):

    print("\nItem from the dataset:",tweet_dataset[idx],tweet_test_dataset.features['label'].int2str(tweet_dataset[idx]['label']))
    print("Prediction from the model:",results[idx])


    predicted_label = results[idx]["label"]
    actual_label_numeric = tweet_dataset[idx]["label"] #this dataset returns a list of numeric labels
    actual_label_name = tweet_dataset.features["label"].int2str( actual_label_numeric ) #look up the name for this numeric label

    predicted_labels.append(predicted_label)
    actual_labels.append( actual_label_name )

print("Accuracy:",accuracy_score(actual_labels,predicted_labels) )


Item from the dataset: {'text': 'en Pelham Parkway', 'label': 2} 😂
Prediction from the model: {'label': '😍', 'score': 0.1873828023672104}

Item from the dataset: {'text': 'The calm before...... | w/ sofarsounds @user | : B. Hall.......#sofarsounds…', 'label': 10} 📷
Prediction from the model: {'label': '📷', 'score': 0.65936678647995}

Item from the dataset: {'text': 'Just witnessed the great solar eclipse @ Tampa, Florida', 'label': 6} 😎
Prediction from the model: {'label': '😎', 'score': 0.17400723695755005}

Item from the dataset: {'text': 'This little lady is 26 weeks pregnant today! Excited for baby Cam to come! @ Springfield,…', 'label': 1} 😍
Prediction from the model: {'label': '😍', 'score': 0.4535968601703644}

Item from the dataset: {'text': 'Great road trip views! @ Shartlesville, Pennsylvania', 'label': 16} 😁
Prediction from the model: {'label': '😍', 'score': 0.5469061136245728}

Item from the dataset: {'text': 'CHRISTMAS DEALS BUY ANY 3 SMALL POMADES 1.5 OR 1.7 OZ RECEIVE THE