# HF with Python - Understanding the Basics

In [1]:
!nvidia-smi

Tue Jul  9 13:38:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
from huggingface_hub import notebook_login

In [3]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Loading a Dataset

You can load a dataset for evaluation or training from HF datasets. We can use the datasets library for this. Let's download this movie review dataset: https://huggingface.co/datasets/rotten_tomatoes

In [4]:
!pip install datasets



In [5]:
import datasets

In [6]:
datasets.__version__

'2.20.0'

In [7]:
from datasets import load_dataset

In [8]:
# We can find and view the datasets in HF:
# https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes/viewer
# We can specify the cache_dir, otherwise it defaults to ~/.cache/huggingface
reviews = load_dataset('rotten_tomatoes',cache_dir='rotten_tomatoes_data')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/699k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

In [11]:
# datasets.dataset_dict.DatasetDict
# It is a dictionary which contains the splits,
# which are of type Dataset
# reviews['train'], reviews['test'], reviews['validation'] -> Dataset
type(reviews)

datasets.dataset_dict.DatasetDict

In [12]:
reviews

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [10]:
# Export a Dataset split to pandas
# Then, we can grab the columns 'text' and 'label' from the pandas df
reviews['train'].to_pandas()

Unnamed: 0,text,label
0,the rock is destined to be the 21st century's ...,1
1,"the gorgeously elaborate continuation of "" the...",1
2,effective but too-tepid biopic,1
3,if you sometimes like to go to the movies to h...,1
4,"emerges as something rare , an issue movie tha...",1
...,...,...
8525,any enjoyment will be hinge from a personal th...,0
8526,if legendary shlockmeister ed wood had ever ma...,0
8527,hardly a nuanced portrait of a young woman's b...,0
8528,"interminably bleak , to say nothing of boring .",0


In [None]:
reviews['test'].to_pandas()

Unnamed: 0,text,label
0,lovingly photographed in the manner of a golde...,1
1,consistently clever and suspenseful .,1
2,"it's like a "" big chill "" reunion of the baade...",1
3,the story gives ample opportunity for large-sc...,1
4,"red dragon "" never cuts corners .",1
...,...,...
1061,a terrible movie that some people will neverth...,0
1062,there are many definitions of 'time waster' bu...,0
1063,"as it stands , crocodile hunter has the hurrie...",0
1064,the thing looks like a made-for-home-video qui...,0


## Transformers Library

### Pipelines

The module `pipeline` is the basic usage interface for the `transformers` library.
We can go to to Hugging Face, select models for a given task (e.g., text classification) and specify them to `pipeline`. Then, the input text will be automatically processed generating the expected task output.

In [17]:
import transformers

In [18]:
transformers.__version__

'4.41.2'

In [19]:
from transformers import pipeline

In [20]:
# If no model is passed, the default model for the task is used
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [21]:
result = classifier("This movie was great!")

In [22]:
result

[{'label': 'POSITIVE', 'score': 0.9998677968978882}]

In [23]:
classifier("This film was the worst I have ever seen, horrible")

[{'label': 'NEGATIVE', 'score': 0.9997925162315369}]

In [24]:
def label(review):
    label = classifier(review)[0]['label']
    if label == 'POSITIVE':
        return 1
    else:
        return 0

In [25]:
label("This movie was so bad, I would have walked out if I wasn't on a plane! lol")

0

In [26]:
label("Amazing movie!")

1

In [27]:
test_df = reviews['test'].to_pandas()

In [28]:
test_df.head(3)

Unnamed: 0,text,label
0,lovingly photographed in the manner of a golde...,1
1,consistently clever and suspenseful .,1
2,"it's like a "" big chill "" reunion of the baade...",1


In [29]:
test_df['predicted_label'] = test_df['text'].apply(label)

In [30]:
test_df.head()

Unnamed: 0,text,label,predicted_label
0,lovingly photographed in the manner of a golde...,1,1
1,consistently clever and suspenseful .,1,1
2,"it's like a "" big chill "" reunion of the baade...",1,0
3,the story gives ample opportunity for large-sc...,1,1
4,"red dragon "" never cuts corners .",1,1


In [31]:
test_df['label']==test_df['predicted_label']

0        True
1        True
2       False
3        True
4        True
        ...  
1061     True
1062     True
1063     True
1064     True
1065     True
Length: 1066, dtype: bool

In [32]:
# Number of matches, since True is 1 and False is treated as 0
sum(test_df['label']==test_df['predicted_label'])

956

In [33]:
# This would be our accuracy
sum(test_df['label']==test_df['predicted_label'])/1066

0.8968105065666041

Pipelines that are available can be found here: https://huggingface.co/docs/transformers/en/main_classes/pipelines