# HF with Python - Understanding the Basics

In [1]:
from huggingface_hub import notebook_login

In [2]:
# notebook_login()

## Loading a Dataset

You can load a dataset for evaluation or training from HF datasets. We can use the datasets library for this. Let's download this movie review dataset: https://huggingface.co/datasets/rotten_tomatoes

In [3]:
# !pip install datasets

In [4]:
import datasets

In [5]:
datasets.__version__

'2.19.0'

In [6]:
from datasets import load_dataset

In [7]:
reviews = load_dataset('rotten_tomatoes',cache_dir='rotten_tomatoes_data')



In [16]:
reviews

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [21]:
reviews['train'].to_pandas()

Unnamed: 0,text,label
0,the rock is destined to be the 21st century's ...,1
1,"the gorgeously elaborate continuation of "" the...",1
2,effective but too-tepid biopic,1
3,if you sometimes like to go to the movies to h...,1
4,"emerges as something rare , an issue movie tha...",1
...,...,...
8525,any enjoyment will be hinge from a personal th...,0
8526,if legendary shlockmeister ed wood had ever ma...,0
8527,hardly a nuanced portrait of a young woman's b...,0
8528,"interminably bleak , to say nothing of boring .",0


In [22]:
reviews['test'].to_pandas()

Unnamed: 0,text,label
0,lovingly photographed in the manner of a golde...,1
1,consistently clever and suspenseful .,1
2,"it's like a "" big chill "" reunion of the baade...",1
3,the story gives ample opportunity for large-sc...,1
4,"red dragon "" never cuts corners .",1
...,...,...
1061,a terrible movie that some people will neverth...,0
1062,there are many definitions of 'time waster' bu...,0
1063,"as it stands , crocodile hunter has the hurrie...",0
1064,the thing looks like a made-for-home-video qui...,0


## Transformers Library

### Pipelines

In [9]:
import transformers

In [10]:
transformers.__version__

'4.26.1'

In [11]:
from transformers import pipeline

In [12]:
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [13]:
result = classifier("This movie was great!")

In [29]:
result

[{'label': 'POSITIVE', 'score': 0.9998677968978882}]

In [30]:
classifier("This film was the worst I have ever seen, horrible")

[{'label': 'NEGATIVE', 'score': 0.9997925162315369}]

In [32]:
def label(review):
    label = classifier(review)[0]['label']
    if label == 'POSITIVE':
        return 1
    else:
        return 0

In [33]:
label("This movie was so bad, I would have walked out if I wasn't on a plane! lol")

0

In [35]:
label("Amazing movie!")

1

In [37]:
test_df = reviews['test'].to_pandas()

In [38]:
test_df.head(3)

Unnamed: 0,text,label
0,lovingly photographed in the manner of a golde...,1
1,consistently clever and suspenseful .,1
2,"it's like a "" big chill "" reunion of the baade...",1


In [40]:
test_df['predicted_label'] = test_df['text'].apply(label)

In [41]:
test_df.head()

Unnamed: 0,text,label,predicted_label
0,lovingly photographed in the manner of a golde...,1,1
1,consistently clever and suspenseful .,1,1
2,"it's like a "" big chill "" reunion of the baade...",1,0
3,the story gives ample opportunity for large-sc...,1,1
4,"red dragon "" never cuts corners .",1,1


In [42]:
test_df['label']==test_df['predicted_label']

0        True
1        True
2       False
3        True
4        True
        ...  
1061     True
1062     True
1063     True
1064     True
1065     True
Length: 1066, dtype: bool

In [44]:
# Number of matches, since True is 1 and False is treated as 0
sum(test_df['label']==test_df['predicted_label'])

956

In [46]:
# This would be our accuracy
sum(test_df['label']==test_df['predicted_label'])/1066

0.8968105065666041

Pipelines that are available can be found here: https://huggingface.co/docs/transformers/en/main_classes/pipelines