### Hugging Face 

🤗

First, install the transformers library in the Colab notebook. 

In [2]:
%%capture
!pip install transformers 
!pip install Xformers 

#### Sentiment analysis
Next, we can use a simple pipeline to perform sentiment analysis. 

In [None]:
from transformers import pipeline

# initialize the classifier 
classifier = pipeline("sentiment-analysis") 

# apply the classifier to the following string
res = classifier("I've been waiting for a Hugging Face course my whole life.")

# print results from sentiment analysis
print(res) 

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9982948899269104}]


Notice Python recognized that no model was supplied. To provide a specific model in a pipeline, use the following argument. 

In [None]:
pipe = pipeline(model = "roberta-large-mnli") 

# apply the model that was specified 
pipe("This restaurant is awesome") 

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'label': 'NEUTRAL', 'score': 0.7313134670257568}]

#### Text generation

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model = "distilgpt2") 

res = generator(
    "In this course, we will teach you how to work with data types like", 
    max_length = 50, 
    num_return_sequences = 1,
)
print(res)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to work with data types like a spreadsheet.\n\n\nIf you follow or follow the tutorial from this course, look for tutorials and videos on how you can get started with the same principles as ours.'}]


#### Zero-Shot classification

In [None]:
# initialize 
classifier_zero = pipeline("zero-shot-classification")

res = classifier_zero(
    "This is a course about Python list comprehension",
    candidate_labels = ["education", "politics", "business"]
)
print(res) 

res1 = classifier_zero(
    "the candidate was absent from the polling site", 
    candidate_labels = ["sports", "politics", "economics"]
)

print(res1) 

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about Python list comprehension', 'labels': ['education', 'business', 'politics'], 'scores': [0.9622024893760681, 0.026841476559638977, 0.010956035926938057]}
{'sequence': 'the candidate was absent from the polling site', 'labels': ['politics', 'economics', 'sports'], 'scores': [0.9342160820960999, 0.04413432255387306, 0.021649545058608055]}


#### Instantiating Models 

Models can be easily accessed from Hugging Face. Further, they can be accessed as pre-trained models for transfer learning tasks or in their base form to be trained by the user. 

Below, I import the GPT2 model pre-trained. Then, for BERT, I import the base BERT, which could be used to train on a given data source if the pre-trained model is not desired. 

In [4]:
from transformers import AutoModel, BertConfig, BertModel 

# GPT2
gpt_model = AutoModel.from_pretrained("gpt2") 
print(type(gpt_model)) 

# BERT
bert_config = BertConfig.from_pretrained("bert-base-cased")
bert_model = BertModel(bert_config)

# adjust BERT - change num of hidden layers to 10 instead its default, 12 
bert_config = BertConfig.from_pretrained("bert-base-cased", num_hidden_layers = 10) 
bet_model = BertModel(bert_config)

# view the updated BERT configuration 
print(bert_config)


<class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>


Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 10,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.29.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



#### 🤗 Datasets Library

Provides an API to easily download data sets for training and testing from Hugging Face. 

In [6]:
%%capture 
!pip install datasets 

In [7]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

#### Tokenization

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification 




ModuleNotFoundError: ignored