# Huggingface Basics
Basic usage of huggingface.

In [9]:
from transformers import pipeline
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

import datasets

## Sentiment analysis on strings

Download a pretrained model and tokenizer for sentiment analysis.

In [2]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
Downloading: 100%|██████████| 629/629 [00:00<00:00, 146kB/s]
Downloading: 100%|██████████| 256M/256M [00:09<00:00, 27.9MB/s] 
2022-05-01 08:48:45.338820: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-01 08:48:45.363206: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint a

Use classifier on a single example.

In [3]:
classifier("We are very happy to show you the 🤗 Transformers library.")

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

Use classifiers on a list of examples.

In [4]:
results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


Sentiment classification on a dataset.

TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include - irony, hate, offensive, stance, emoji, emotion, and sentiment. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits.

In [13]:
dataset = datasets.load_dataset("tweet_eval", name='emotion', split="train")

Downloading and preparing dataset tweet_eval/emotion (download: 472.47 KiB, generated: 511.52 KiB, post-processed: Unknown size, total: 984.00 KiB) to /Users/Lauren/.cache/huggingface/datasets/tweet_eval/emotion/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


Downloading data: 307kB [00:00, 10.3MB/s]                   ]
Downloading data: 6.51kB [00:00, 2.67MB/s]                   .01it/s]
Downloading data: 133kB [00:00, 10.6MB/s]                    .73it/s]
Downloading data: 2.84kB [00:00, 1.19MB/s]                  3.78it/s]
Downloading data: 34.6kB [00:00, 6.47MB/s]                   .05it/s]
Downloading data: 748B [00:00, 307kB/s]                     4.22it/s]
Downloading data files: 100%|██████████| 6/6 [00:01<00:00,  4.06it/s]
Extracting data files: 100%|██████████| 6/6 [00:00<00:00, 1047.70it/s]
                                                                           

Dataset tweet_eval downloaded and prepared to /Users/Lauren/.cache/huggingface/datasets/tweet_eval/emotion/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.




In [16]:
files = dataset["text"]
classifier(files[:4])

[{'label': 'NEGATIVE', 'score': 0.9921145439147949},
 {'label': 'NEGATIVE', 'score': 0.9914141297340393},
 {'label': 'POSITIVE', 'score': 0.9987362027168274},
 {'label': 'NEGATIVE', 'score': 0.6745527982711792}]

# Summarization Example
Using BillSum dataset.

In [22]:
from datasets import load_dataset
from transformers import AutoTokenizer

## Get Data

In [18]:
billsum = load_dataset("billsum", split="ca_test")

Downloading builder script: 3.62kB [00:00, 370kB/s]                    
Downloading metadata: 1.75kB [00:00, 668kB/s]                  
Using custom data configuration default


Downloading and preparing dataset billsum/default (download: 64.14 MiB, generated: 259.80 MiB, post-processed: Unknown size, total: 323.94 MiB) to /Users/Lauren/.cache/huggingface/datasets/billsum/default/3.0.0/d1e95173aed3acb71327864be74ead49b578522e4c7206048b2f2e5351b57959...


Downloading data: 100%|██████████| 67.3M/67.3M [00:55<00:00, 1.20MB/s]
                                                                                      

Dataset billsum downloaded and prepared to /Users/Lauren/.cache/huggingface/datasets/billsum/default/3.0.0/d1e95173aed3acb71327864be74ead49b578522e4c7206048b2f2e5351b57959. Subsequent calls will reuse this data.


In [19]:
# train test split
billsum = billsum.train_test_split(test_size=0.2)

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 1938 of the Civil Code is amended to read:\n1938.\n(a) A commercial property owner or lessor shall state on every lease form or rental agreement executed on or after January 1, 2016, whether or not the subject premises have undergone inspection by a Certified Access Specialist (CASp).\n(b) If the subject premises have undergone inspection by a CASp and, to the best of the commercial property owner’s or lessor’s knowledge, there have been no modifications or alterations completed or commenced between the date of the inspection and the date of the lease or rental agreement which have impacted the subject premises’ compliance with construction-related accessibility standards, the commercial property owner or lessor shall provide, prior to execution of the lease or rental agreement, a copy of any report prepared by the CASp with an agreement from the prospective lessee or tenant that information i

In [21]:
billsum["train"][0].keys()

dict_keys(['text', 'summary', 'title'])

## Preprocess

In [23]:
#load T5 tokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")

Downloading: 100%|██████████| 1.17k/1.17k [00:00<00:00, 451kB/s]
Downloading: 100%|██████████| 773k/773k [00:00<00:00, 922kB/s] 
Downloading: 100%|██████████| 1.32M/1.32M [00:00<00:00, 2.05MB/s]
