In [6]:
try:
    from transformers import pipeline
except:
    !pip install transformers datasets

情感分析是將文本情感分為積極、消極或中性的過程。情緒分析在不同行業有廣泛的應用，例如從產品評論中監控客戶的情緒，甚至在政治中，例如在選舉年衡量公眾對特定主題的興趣。這篇文章的重點是使用 Hugging Face 來完成各種任務，因此我們不會深入討論每個主題，但如果您有興趣深入了解有關情感分析的更多信息，您可以參考這篇文章：
https://towardsdatascience.com/sentiment-analysis-intro-and-implementation-ddf648f79327

1. Import libraries 導入庫
2. Specify the name of the pre-trained model to be used for this specific task指定用於此特定任務的預訓練模型的名稱（即情感分析）
3. Specify the task (i.e. sentiment analysis) 指定任務（即情緒分析）
4. Specify the sentence, which will be sentiment analyzed 指定將進行情感分析的句子
5. Create an instance of pipeline as analyzer 創建 pipeline 的實例作為 analyzer
6. Perform the sentiment analysis and save the results as output 執行情感分析並將結果保存為 output
7. Return the results 返回結果

In [7]:
# Specify pre-trained model to use
model = 'distilbert-base-uncased-finetuned-sst-2-english'

# Specify task
task = 'sentiment-analysis'

# Text to be analyzed
input_text = 'Performing NLP tasks using HuggingFace pipeline is super easy!'

# Instantiate pipeline
analyzer = pipeline(task, model = model)

# Store the output of the analysis
output = analyzer(input_text)

# Return output
output

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.8548853993415833}]

reference: https://medium.com/nlplanet/two-minutes-nlp-beginner-intro-to-hugging-face-main-classes-and-functions-fb6a1d5579c4

### sentiment-analysis 情緒分析
在本文中，我們測試一個包含情感分析任務的 pipeline。為了預測句子的情緒，只需將句子傳遞給模型即可。模型輸出是一個字典列表，其中每個字典都有一個標籤（對於這個特定示例，值為“正”或“負”）和一個分數（即預測標籤的分數）。

In [16]:
# sentiment-analysis
pipe = pipeline('sentiment-analysis')
text = input('Enter some words here :')


# one world 
out = pipe(text)
print(out) 
print(f"{out[0]['label']} with score {out[0]['score']}")
    
    
# sentnece
text = pipe(["I'm so happy today!", "I hate U..."])
for out in text:
    print(f"{out['label']} with score {out['score']}")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


enter some words hereHello
[{'label': 'POSITIVE', 'score': 0.9995185136795044}]
POSITIVE with score 0.9995185136795044
POSITIVE with score 0.9998742341995239
NEGATIVE with score 0.9995265007019043


### Dataset 數據集
通過 dataset 庫，我們可以輕鬆下載 NLP 中使用的一些最常見的基準測試。本次測試為 Stanford Sentiment Treebank (SST2)，它由電影評論中的句子和人類對其情感的註釋組成。它使用雙向（正向和負向）類分割，僅具有句子級標籤。我們可以在數據集庫下找到 SST2 數據集，它存儲為 GLUE 數據集的子集。我們使用 load_dataset 函數加載數據集。

數據集已經分為訓練集、驗證集和測試集。我們可以使用 split 參數調用 load_dataset 函數來直接獲取我們感興趣的數據集的分割。

In [20]:
import datasets

dataset = datasets.load_dataset("glue", "sst2")
#print(dataset)

dataset = datasets.load_dataset("glue", "sst2", split='train')
#print(dataset)

import pandas as pd # 或是利用 Pandas 探索數據

df = pd.DataFrame(dataset)
df.head()

Found cached dataset glue (C:/Users/cti110016/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})


Found cached dataset glue (C:/Users/cti110016/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 67349
})


Unnamed: 0,sentence,label,idx
0,hide new secretions from the parental units,0,0
1,"contains no wit , only labored gags",0,1
2,that loves its characters and communicates som...,1,2
3,remains utterly satisfied to remain the same t...,0,3
4,on the worst revenge-of-the-nerds clichés the ...,0,4


### Pipeline on GPU GPU 上的管道

現在我們已經加載了有關情感分析的數據集，讓我們嘗試使用情感分析模型。要提取數據集中的句子列表，我們可以訪問其 data 屬性。讓我們預測 500 個句子的情緒並測量需要多長時間。

In [21]:
classifier = pipeline("sentiment-analysis")
%time results = classifier(dataset.data["sentence"].to_pylist()[:500])
# CPU times: user 21.9 s, sys: 56.9 ms, total: 22 s
# Wall time: 21.8 s

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


CPU times: total: 1min 30s
Wall time: 43.6 s


預測 500 個句子的情緒需要 21.8 秒，平均每秒 23 個句子。不錯，但我們可以利用 GPU 做得更好。為了讓我們的分類器使用 GPU，我們必須使用 pipeline 創建它並傳遞 device=0 ：通過這樣做，我們要求在關聯的 CUDA 設備 ID 上運行模型，其中從零開始的每個id 都映射到CUDA 設備，值-1 與CPU 關聯。

In [22]:
classifier = pipeline("sentiment-analysis", device=0)
%time results = classifier(dataset.data["sentence"].to_pylist()[:500])
# CPU times: user 4.07 s, sys: 49.6 ms, total: 4.12 s
# Wall time: 4.11 s

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


CPU times: total: 1min 26s
Wall time: 45.1 s


### Metrics 指標
如果我們想在 SST2 數據集上測試情感分類器的質量該怎麼辦？我們應該使用哪個指標？

在 Hugging Face 中，指標和數據集在數據集庫中配對在一起。為了檢索正確的指標，我們可以使用與 load_dataset 函數使用的相同參數來調用 load_metric 函數。

然後，我們使用模型做出的預測和直接從數據集中獲取的引用作為參數來調用度量對象的 compute 函數。特別是對於 SST2 數據集，衡量標準是準確性。

In [23]:
metric = datasets.load_metric("glue", "sst2")

n_samples = 500

X = dataset.data["sentence"].to_pylist()[:n_samples]
y = dataset.data["label"].to_pylist()[:n_samples]

results = classifier(X)
predictions = [0 if res["label"] == "NEGATIVE" else 1 for res in results]

print(metric.compute(predictions=predictions, references=y))
# {'accuracy': 0.988}

  metric = datasets.load_metric("glue", "sst2")


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

{'accuracy': 0.988}


### AutoClasses 自動類 (需要安裝 torch)

在底層，管道由 AutoModel 和 AutoTokenizer 類提供支持。 AutoClass （即像 AutoModel 和 AutoTokenizer 這樣的通用類）是一種快捷方式，可以從預訓練模型（或標記生成器）的名稱或路徑中自動檢索其架構。您只需為您的任務選擇適當的 AutoModel 及其與 AutoTokenizer 關聯的標記器：在我們的示例中，由於我們正在對文本進行分類，因此正確的 AutoModel 是 AutoModelForSequenceClassification。

In [1]:
try:
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
except:
    !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
    #https://pytorch.org/get-started/locally/

In [2]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


我們使用 AutoTokenizer 創建一個標記生成器對象，並使用AutoModelForSequenceClassification 創建一個模型對象。在這兩種情況下，我們所需要做的就是傳遞模型的名稱，庫會管理其他一切。

接下來，讓我們看看如何使用分詞器對句子進行分詞。分詞器輸出是一個字典，由 input_ids （即在輸入句子中檢測到的每個標記的id，取自分詞器詞彙表）、 token_type_ids （用於兩個文本的模型中）組成。預測所需的，我們現在可以忽略它們）和 attention_mask （顯示標記化期間發生填充的位置）。

In [25]:
encoding = tokenizer(["Hello!", "How are you?"], padding=True,
                     truncation=True, max_length=512, return_tensors="pt")
print(encoding)
"""
{'input_ids': tensor([[  101, 29155,   106,   102,     0,     0],
        [  101, 12548, 10320, 10855,   136,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1]])}
"""

ImportError: Unable to convert output to PyTorch tensors format, PyTorch is not installed.

然後將標記化的句子傳遞給模型，模型輸出預測。這個特定模型輸出五個分數，其中每個分數是人類評論分數從 1 到 5 的概率。

In [3]:
outputs = model(**encoding)
print(outputs)
"""
SequenceClassifierOutput(loss=None, logits=tensor([[-0.2410, -0.9115, -0.3269, -0.0462,  1.2899],
        [-0.3575, -0.6521, -0.4409,  0.0471,  0.9552]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
"""

NameError: name 'encoding' is not defined

### Save and load models locally 本地保存和加載模型
最後，我們看看如何在本地保存模型。這可以使用分詞器和模型的 save_pretrained 函數來完成。如果您想加載之前保存的模型，可以使用右側 AutoModel 類的 from_pretrained 函數加載它。

In [4]:
pt_save_directory = "./model"
tokenizer.save_pretrained(pt_save_directory)
model.save_pretrained(pt_save_directory)
model = AutoModelForSequenceClassification.from_pretrained("./model")