# HuggingFace

- https://techcrunch.com/2021/03/11/hugging-face-raises-40-million-for-its-natural-language-processing-library/

# Install & Import

In [None]:
!pip install transformers
!pip install --upgrade keras
!pip install --upgrade tensorflow
!pip install datasets
!pip install bertviz

- 비슷한 형태로 한국어 dataset을 제공하는 python library
  - `pip install Korpora`



# Introduction to HuggingFace

https://huggingface.co/transformers/index.html

**Contents**

1. Model & Tasks
2. Loading Pre-Trained Models
3. Fine-Tuning Models
4. Interpreting Your Model

# 1. Models & Tasks
 

### Models
(https://huggingface.co/transformers/model_summary.html)

   - **Autoregressive models:** Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the previous ones. They correspond to the *decoder* of the original transformer model, and a mask is used on top of the full sentence so that the attention heads can only see what was before in the text, and not what’s after. Although those models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A typical example of such models is GPT.
   - **Autoencoding models:** Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the full inputs without any mask. Those models usually build a bidirectional representation of the whole sentence. They can be fine-tuned and achieve great results on many tasks such as text generation, but their most natural application is sentence classification or token classification. A typical example of such models is BERT.
   - **Sequence-to-sequence models:** Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their most natural applications are translation, summarization and question answering. The original transformer model is an example of such a model (only for translation), T5 is an example that can be fine-tuned on other tasks.
   - **Multimodal models:** Multimodal models mix text inputs with other kinds (e.g. images) and are more specific to a given task.
   - **Retrieval-based models:** Some models use documents retrieval during (pre)training and inference for open-domain question answering, for example




### Tasks
(https://huggingface.co/transformers/task_summary.html)

   - **Sequence Classification:** classifying sequences according to a given number of classes. (ex. GLUE)
   - **Extractive Question Answering:** extracting an answer from a text given a question (ex. SQUAD)
   - **Language Modeling:** task of fitting a model to a corpus, which can be domain specific
       - **Masked Language Modeling:** task of masking tokens in a sequence with a masking token, and prompting the model to fill that mask with an appropriate token. (ex. BERT pre-training)
       - **Causal Language Modeling:** predicting the token following a sequence of tokens (ex. GPT-2)
   - **Text Generation:** create a coherent portion of text that is a continuation from the given context
   - **Named Entity Recognition (Token Classification):** classifying tokens according to a class, for example, identifying a token as a person, an organisation or a location
   - **Summarization:** task of summarizing a document or an article into a shorter text
   - **Translation:** task of translating a text from one language to another
   


## Pipeline
(https://huggingface.co/transformers/main_classes/pipelines.html)

The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including **Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering.** See the task summary for examples of use. 

- ConversationalPipeline, FeatureExtractionPipeline, FillMaskPipeline, QuestionAnsweringPipeline, SummarizationPipeline, TextClassificationPipeline, TextGenerationPipeline, TokenClassificationPipeline, TranslationPipeline, ZeroShotClassificationPipeline, Text2TextGenerationPipeline, TableQuestionAnsweringPipeline

In [None]:
from transformers import pipeline

### Sequence Classification

In [None]:
# Sequence Classification
classifier = pipeline('sentiment-analysis')

In [None]:
print(classifier("I love you")[0])
print(classifier("I hate you")[0])

### Question Answering

In [None]:
# Question Answering
qa = pipeline("question-answering")

In [None]:
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the `run_squad.py`.
"""

In [None]:
print(qa(question="What is extractive question answering?", context=context))
print(qa(question="What is a good example of a question answering dataset?", context=context))

### Text Generation

In [None]:
text_generator = pipeline("text-generation")

In [None]:
print(text_generator("When the Titanic crashed, I", max_length=50, do_sample=False))

Try Language Modeling, Token Classification and more on your own...

# 2. Loading Pre-Trained Models

We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions, just three standard classes required to use each model: 
1. configuration, 
2. models and 
3. tokenizer.

- [Models](https://huggingface.co/models)

In [None]:
from transformers import AutoConfig, AutoTokenizer, AutoModel, AutoModelForSequenceClassification

### Configuration

The base class PretrainedConfig implements the common methods for loading/saving a configuration either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace’s AWS S3 repository).

`classmethod: .from_pretrained(pretrained_model_name_or_path, **kwargs)`

In [None]:
# Set Keyword Args 
config_args = {'hidden_dropout_prob':0.2, 
              'num_labels':2}

# Or.. load from path
# config_path = './model/checkpoint.ckpt.index'

In [None]:
model_name = 'xlm-roberta-base'
config = AutoConfig.from_pretrained(model_name, **config_args)

In [None]:
config

### Models

- The base classes PreTrainedModel, TFPreTrainedModel, and FlaxPreTrainedModel implement the common methods for loading/saving a model either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace’s AWS S3 repository).

- Automodel is a generic model class that will be instantiated as one of the base model classes of the library. For specific tasks, load AutoModelForTASK.

In [None]:
model_config = AutoModel.from_config(config)
model_pretrained = AutoModel.from_pretrained(model_name)

In [None]:
# Notice Dropout(p=0.2)
model_config

In [None]:
# Notice Dropout(p=0.1)
model_pretrained

In [None]:
config = AutoConfig.from_pretrained(model_name, **config_args)
model_seq = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)

In [None]:
# Instead of Pooling Layer, Classification Layer added on model
model_seq

### Tokenizer

A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library tokenizers. The “Fast” implementations allows:

1. a significant speed-up in particular when doing batched tokenization and

2. additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token). Currently no “Fast” implementation is available for the SentencePiece-based tokenizers (for T5, ALBERT, CamemBERT, XLMRoBERTa and XLNet models).

`classmethod: from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)`

In [None]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name, config=config)

In [None]:
tokenizer

In [None]:
print(tokenizer.tokenize('그 영화는 재밌었다'))
print(tokenizer('그 영화는 재밌었다'))
print(tokenizer(['그 영화는 재밌었다','나는 별로였다']))

### Using Fine-Tuned models in Pipeline

https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment

#### Sentiment Classification (EN)

In [None]:
# Use task specific fine-tuned model in Pipeline
# fine-tuned model that predicts the sentiment of the review as a number of stars (between 1 and 5).
model_name = "nlptown/bert-base-multilingual-uncased-sentiment" 

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

pipe = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

In [None]:
print(pipe("Je t'adore")) # I love you
print(pipe("Je te deteste")) # I hate you

# 3. Fine-Tuning Models

https://huggingface.co/transformers/examples.html

## Fine Tuning Example: NSMC

Naver Sentiment Movie Classification
https://huggingface.co/datasets/nsmc

In [None]:
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification

model_path_or_name = 'xlm-roberta-base'
config = AutoConfig.from_pretrained(model_path_or_name, num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained(model_path_or_name)
tokenizer = AutoTokenizer.from_pretrained(model_path_or_name, config=config)

### Load Custom Dataset

The datasets.Dataset object behaves like a normal python container. You can query its length, get rows, columns and also lot of metadata on the dataset (description, citation, split sizes, etc).

https://huggingface.co/docs/datasets/exploring.html

### 1. Using Datasets 

https://huggingface.co/datasets/nsmc

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset('nsmc')

In [None]:
dataset['train'][0]

In [None]:
dataset['test'][0]

In [None]:
import torch
class nsmc_data(torch.utils.data.Dataset):
    def __init__(self, dataset, tokenizer):
        super().__init__()
        self.label = torch.tensor(dataset['label']).long()
        features = tokenizer([str(x) for x in dataset['document']], padding=True, truncation=True)
        self.input = torch.tensor(features['input_ids'])
        self.mask = torch.tensor(features['attention_mask'])
    
    def __len__(self):
        return len(self.input)

    def __getitem__(self, index):
        return {'input_ids': self.input[index], 'attention_mask': self.mask[index], 'label':self.label[index]}

In [None]:
train_data = nsmc_data(dataset['train'], tokenizer)

In [None]:
test_data = nsmc_data(dataset['test'], tokenizer)

### Load Trainer 

The **Trainer** and **TFTrainer** classes provide an API for feature-complete training in most standard use cases. It’s used in most of the example scripts.

Before instantiating your Trainer/TFTrainer, create a TrainingArguments/TFTrainingArguments to access all the points of customization during training.

In [None]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1, 
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

In [None]:
training_args

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=test_data
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
#model.save_pretrained('./nsmc_model')
tokenizer.save_pretrained('./nsmc_tokenzier')

# 4. Interpreting Your Model

### Looking at Attention

In [None]:
from bertviz import head_view

In [None]:
model_path_or_name = './nsmc_model'
model = AutoModelForSequenceClassification.from_pretrained(model_path_or_name, output_attentions=True)

In [None]:
sentence = "이 영화는 아주 재미있다."
inputs = tokenizer.encode_plus(sentence, return_tensors='pt', add_special_tokens=True)

In [None]:
input_ids = inputs['input_ids'].to(model.device)
attention = model(input_ids)[-1]

In [None]:
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

In [None]:
input_id_list

In [None]:
tokens

In [None]:
head_view(attention, tokens)

### Error Analysis

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sn
import numpy as np

In [None]:
#Get Predictions as Array
preds = trainer.predict(test_data)

In [None]:
#Argmax Softmax Values
predictions = np.argmax(preds[0], axis=1)
labels = preds[1]
print(predictions)
print(labels)

In [None]:
#Confusion Matrix

cm = confusion_matrix(labels,predictions)
df_cm = pd.DataFrame(cm, index = ["Positive", "Negative"],
                  columns = ["Positive", "Negative"])
sn.heatmap(df_cm, annot=True, fmt=".0f").set(title="Confusion Matrix", xlabel="Predicted", ylabel="Observed",)

In [None]:
(labels == 1)&(predictions == 0)

In [None]:
test_dataset = pd.read_csv(test_file_path, sep='\t', quoting=3)

In [None]:
# False Negative
test_dataset['document'][(labels == 1)&(predictions == 0)].sample(10)

In [None]:
# False Positive
test_dataset['document'][(labels == 0)&(predictions == 1)].sample(10)