# Sentiment Classification Using BERT

情緒分類使用BERT神經網路

    負面:0 正面:1

    負面:0 正面:1 中立:2

    負面:0 正面:1 中立:2 無情緒:3



## What is Sentiment Analysis?

    Sentiment Analysis is a process of extracting opinions that have different polarities.
    
    By polarities, we mean positive, negative or neutral.

    It is also known as opinion mining and polarity detection.

    With the help of sentiment analysis, you can find out the nature of opinion that is reflected in documents, websites, social media feed, etc.

    Sentiment Analysis is a type of classification where the data is classified into different classes.

    These classes can be binary in nature (positive or negative) or, they can have multiple classes (happy, sad, angry, etc.).


    Google's and Microsoft's prediction API are two of the best sentiment analysis engines in the market and they keep improving.


    Does it really work?
    Sentiment Analysis works. It's not 100% reliable and it errs but even if you get sentiments with a confidence level of 70 or 80% it’s a lot more than nothing.


## Google NLP Services

Take a look at the Google NLP services.

# Load model and tokenizer

In [100]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

In [101]:
!pip install opencc






[notice] A new release of pip is available: 24.2 -> 25.0.1





[notice] To update, run: python.exe -m pip install --upgrade pip


In [102]:
from opencc import OpenCC
s2t = OpenCC('s2t')  # convert from Simplified Chinese to Traditional Chinese
t2s = OpenCC('t2s')  # convert from  Traditional Chinese to Simplified Chinese

In [103]:
# You can download the best trained model from huggingface
# https://huggingface.co/clhuang

# (1) Load model from huggingface
model = AutoModelForSequenceClassification.from_pretrained("uer/roberta-base-finetuned-dianping-chinese")

In [104]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [105]:
# tokenizer
tokenizer = AutoTokenizer.from_pretrained("uer/roberta-base-finetuned-dianping-chinese") #from huggingface



In [106]:
len(tokenizer)

21128

In [107]:
# tokenize can encode text to input_ids and decode input_ids to text
tokenizer.get_vocab()

{'趸': 6642,
 'ﾌ': 8094,
 'b2': 9668,
 'water': 11060,
 '致': 5636,
 '够': 1916,
 '対': 2194,
 '葭': 5874,
 'view': 8876,
 '##韬': 20564,
 '##out': 9408,
 '橹': 3587,
 '##с': 13416,
 '##垮': 14866,
 '姚': 2001,
 '##灌': 17175,
 '##祷': 17933,
 '睑': 4713,
 '##ml': 8477,
 '407': 12458,
 '≈': 391,
 '##馳': 20739,
 '##躇': 19766,
 '##浔': 16907,
 '损': 2938,
 '##劑': 14269,
 '##毓': 16739,
 '##do': 8828,
 '##とは': 11896,
 '##礪': 17904,
 '##虐': 19047,
 '滩': 4013,
 '##text': 11816,
 '仍': 793,
 '##勘': 14299,
 '痣': 4582,
 '辙': 6788,
 '笈': 5007,
 '婊': 2040,
 '综': 5341,
 '##挫': 15976,
 'ᄒ': 303,
 '带': 2372,
 '噱': 1694,
 '蚯': 6021,
 'beyond': 12352,
 '##ugh': 12667,
 '495': 12895,
 '##棣': 16536,
 '喎': 1593,
 '鰲': 7815,
 'uv': 9473,
 'ikea': 10513,
 '##憑': 15788,
 '##€': 13517,
 '卵': 1317,
 '萊': 5844,
 '摯': 3041,
 '##塚': 14909,
 '##⑨': 13564,
 '璽': 4473,
 '冏': 1087,
 '舎': 5651,
 '##fr': 13245,
 '##埋': 14870,
 '埤': 1820,
 '##筷': 18096,
 '兽': 1077,
 '靚': 7475,
 'ｂ': 8052,
 '墮': 1876,
 '1939': 9459,
 '##崽': 15369,
 '#

In [108]:
text="我喜歡"
# prepare our text into tokenized sequence
inputs = tokenizer(text)
inputs

{'input_ids': [101, 2769, 1599, 3631, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

In [109]:
tokenizer.decode(inputs['input_ids'])

'[CLS] 我 喜 歡 [SEP]'

# Predict or generate result using pipeline

    可能的輸出結果如下:

    [{'label': 'LABEL_1', 'score': 0.9885562062263489}]
    [{'label': 'LABEL_0', 'score': 0.9052111506462097}]

    因此需要用到if去判端label的值，才能決定score是正面還是負面。
    若為:LABEL_1 就是正面的score
    若為:LABEL_0 就是負面的score

In [110]:
sentiment_classify = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

In [111]:
new_text = '速度很快，昨天下單，今天上午就到啦，看著挺不錯。'
sentiment_classify(new_text)

[{'label': 'positive (stars 4 and 5)', 'score': 0.7087053060531616}]

In [112]:
new_text = '速度很快，昨天下單，今天上午就到啦，看著挺不錯。'
new_text = t2s.convert(new_text)
sentiment_classify(new_text)

[{'label': 'positive (stars 4 and 5)', 'score': 0.6945704817771912}]

In [113]:
outputs = sentiment_classify(new_text)
outputs[0]['score']

0.6945704817771912

In [114]:
type(outputs[0]['score'])

float

In [115]:
round(outputs[0]['score'],2)

0.69

# Define prediction function using pipeline

In [116]:
def get_sentiment_proba(text):
    max_length = 300 # 最多字數 若超出模型訓練時的字數，以模型最大字數為依據
    #max_length = 512 # 最多字數 若超出模型訓練時的字數，以模型最大字數為依據
    outputs = sentiment_classify(text, padding=True, max_length=max_length, truncation=True)
    if outputs[0]['label']=='LABEL_1':
        # Get the positive score
        prob_positive = round(outputs[0]['score'],2)
        prob_negatitive = round(1 - prob_positive, 2)
    else:
        # Calculate the negative score
        prob_negatitive = round(outputs[0]['score'],2)
        prob_positive = round(1 - prob_negatitive, 2)

    response = {'Negative':prob_negatitive, 'Positive': prob_positive}
    return response

# Define prediction function using model or model.generate()

In [117]:
## Pediction
target_names=['Negative','Positive']
max_length = 200 # 最多字數 若超出模型訓練時的字數，以模型最大字數為依據
def get_sentiment_proba_from_model(text):
    new_text = t2s.convert(text)
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)

    response = {'Negative': round(float(probs[0, 0]), 2), 'Positive': round(float(probs[0, 1]), 2)}
    # executing argmax function to get the candidate label
    #return probs.argmax()
    return response

In [118]:
new_text = '速度很快，昨天下單，今天上午就到啦，看著挺不錯。'
get_sentiment_proba( new_text )

{'Negative': 0.71, 'Positive': 0.29}

In [119]:
new_text = '已經買了這種蘋果好多次了，寶寶喜歡上了這款蘋果，一直選擇這款'
get_sentiment_proba( new_text )

{'Negative': 0.94, 'Positive': 0.06}

In [120]:
new_text = '不喜歡這款產品'
get_sentiment_proba( new_text )

{'Negative': 1.0, 'Positive': 0.0}

In [121]:
new_text = '不喜歡這款產品'
get_sentiment_proba_from_model( new_text )

{'Negative': 1.0, 'Positive': 0.0}

# Step by step demonstration on model prediction and model.generate()

In [122]:
text="我喜歡"
# prepare our text into tokenized sequence
inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt")


In [123]:
inputs

{'input_ids': tensor([[ 101, 2769, 1599, 3631,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

In [124]:
inputs.input_ids

tensor([[ 101, 2769, 1599, 3631,  102]])

In [125]:
tokenizer.decode(inputs.input_ids[0])

'[CLS] 我 喜 歡 [SEP]'

In [126]:

# perform inference to our model
outputs = model(**inputs)


In [127]:
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-1.0823,  1.0425]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [128]:
outputs[0]

tensor([[-1.0823,  1.0425]], grad_fn=<AddmmBackward0>)

In [129]:
outputs[0].softmax(1)

tensor([[0.1067, 0.8933]], grad_fn=<SoftmaxBackward0>)

In [130]:
probs = outputs[0].softmax(1)
probs

tensor([[0.1067, 0.8933]], grad_fn=<SoftmaxBackward0>)

In [131]:
probs[0, 0]

tensor(0.1067, grad_fn=<SelectBackward0>)

In [132]:
round(float(probs[0, 0]), 2)

0.11

In [133]:
{'Negative': round(float(probs[0, 0]), 2), 'Positive': round(float(probs[0, 1]), 2)}

{'Negative': 0.11, 'Positive': 0.89}

# Calculate news article sentiment score

In [134]:
import pandas as pd

In [135]:
df = pd.read_csv('news.csv', sep='|')

In [136]:
df.head(1)

Unnamed: 0,item_id,date,category,title,content,sentiment,summary,top_key_freq,tokens,tokens_v2,entities,token_pos,link,photo_link,sentiment2,sentiment3
0,_20250327_1,2025-03-27,焦點,台股重挫308點 失守22000關卡,美國總統川普準備徵收汽車關稅，美股主要指數全數下跌，台積電重挫4.09%，台積電台北現股今日...,0.36,暫無,"[('台積電', 4), ('指數', 3), ('月線', 3), ('美國', 2), ...","['美國', '總統', '川普', '準備', '徵收', '汽車', '關稅', '，'...","['美國', '總統', '川普', '汽車', '關稅', '美股', '指數', '台積...","[NerToken(word='美國', ner='GPE', idx=(0, 2)), N...","[('美國', 'Nc'), ('總統', 'Na'), ('川普', 'Nb'), ('準...",https://tw.news.yahoo.com/https://tw.stock.yah...,https://s.yimg.com/ny/api/res/1.2/3qEveVGKp070...,0.65,0.03


In [137]:
%%time
sentiment_scores = []
for text in df.content:
    prob = get_sentiment_proba( text )['Positive']
    sentiment_scores.append(prob)

CPU times: total: 4min 58s
Wall time: 1min 54s


In [138]:
sentiment_scores

[0.03,
 0.05,
 0.17,
 0.43,
 0.39,
 0.07,
 0.18,
 0.31,
 0.03,
 0.38,
 0.11,
 0.01,
 0.0,
 0.01,
 0.24,
 0.46,
 0.19,
 0.32,
 0.08,
 0.13,
 0.02,
 0.01,
 0.27,
 0.06,
 0.21,
 0.08,
 0.5,
 0.01,
 0.01,
 0.07,
 0.05,
 0.15,
 0.03,
 0.11,
 0.03,
 0.11,
 0.29,
 0.0,
 0.34,
 0.0,
 0.43,
 0.24,
 0.0,
 0.09,
 0.13,
 0.01,
 0.28,
 0.02,
 0.27,
 0.05,
 0.05,
 0.0,
 0.13,
 0.27,
 0.2,
 0.05,
 0.27,
 0.13,
 0.02,
 0.02,
 0.32,
 0.27,
 0.02,
 0.42,
 0.08,
 0.28,
 0.06,
 0.02,
 0.02,
 0.02,
 0.2,
 0.39,
 0.01,
 0.4,
 0.03,
 0.04,
 0.27,
 0.06,
 0.07,
 0.06,
 0.21,
 0.08,
 0.08,
 0.3,
 0.4,
 0.02,
 0.28,
 0.02,
 0.03,
 0.2,
 0.18,
 0.01,
 0.02,
 0.02,
 0.07,
 0.26,
 0.2,
 0.23,
 0.25,
 0.26,
 0.32,
 0.24,
 0.09,
 0.07,
 0.03,
 0.0,
 0.0,
 0.35,
 0.0,
 0.03,
 0.02,
 0.02,
 0.02,
 0.0,
 0.0,
 0.0,
 0.37,
 0.02,
 0.01,
 0.0,
 0.1,
 0.01,
 0.09,
 0.27,
 0.35,
 0.01,
 0.23,
 0.39,
 0.07,
 0.05,
 0.43,
 0.14,
 0.14,
 0.09,
 0.23,
 0.02,
 0.22,
 0.23,
 0.5,
 0.07,
 0.11,
 0.39,
 0.02,
 0.02,
 0.43,
 0.4,
 

In [139]:
df['sentiment3']=sentiment_scores

In [140]:
df[['content','sentiment','sentiment2','sentiment3']].head()

Unnamed: 0,content,sentiment,sentiment2,sentiment3
0,美國總統川普準備徵收汽車關稅，美股主要指數全數下跌，台積電重挫4.09%，台積電台北現股今日...,0.36,0.65,0.03
1,去年7月間，國民黨立委賴士葆駕車行駛在台北市大安區時，不慎撞傷走在斑馬線上2位女性行人，沒想...,0.18,0.34,0.05
2,台灣祭將於4月3日至5日登場，有民眾在社群平台反映，活動期間發現墾丁地區一間旅館哄抬價格。屏...,0.92,0.92,0.17
3,海軍今天表示，中和軍艦凌晨在台中港外海與中國籍漁船擦碰，雙方人員均無損傷，艦艇受損部分不影響...,0.98,0.99,0.43
4,北一女中教師區桂芝日前接受大陸官媒採訪，批評總統賴清德將中國定調為「境外敵對勢力」，遭民眾檢...,0.97,0.98,0.39


In [141]:
df.to_csv('news.csv', sep='|',index=None)

In [143]:
df = pd.read_csv('news.csv', sep='|')

In [146]:
# 篩選指定欄位建立新的 DataFrame
selected_columns = [
    'item_id', 'title', 'category', 'content', 'link', 'date', 'photo_link',
    'tokens_v2', 'top_key_freq', 'summary', 'sentiment'
]

new_df = df[selected_columns]

# 顯示新資料框的前幾筆
new_df.head(1)


Unnamed: 0,item_id,title,category,content,link,date,photo_link,tokens_v2,top_key_freq,summary,sentiment
0,_20250327_1,台股重挫308點 失守22000關卡,焦點,美國總統川普準備徵收汽車關稅，美股主要指數全數下跌，台積電重挫4.09%，台積電台北現股今日...,https://tw.news.yahoo.com/https://tw.stock.yah...,2025-03-27,https://s.yimg.com/ny/api/res/1.2/3qEveVGKp070...,"['美國', '總統', '川普', '汽車', '關稅', '美股', '指數', '台積...","[('台積電', 4), ('指數', 3), ('月線', 3), ('美國', 2), ...",暫無,0.36


In [None]:
new_df.to_csv('news_for_django.csv', sep='|')

: 

# Put them all together for Django

In [142]:
from django.shortcuts import render
from django.views.decorators.csrf import csrf_exempt
from django.http import JsonResponse
from transformers import BertTokenizer, BertForSequenceClassification
from opencc import OpenCC
s2t = OpenCC('s2t')  # convert from Simplified Chinese to Traditional Chinese
t2s = OpenCC('t2s')  # convert from  Traditional Chinese to Simplified Chinese

# We don't use GPU
#import os
#os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

# Load model and tokenizer
model_path = "app_sentiment_bert/best-model"
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=2)
# reload our model/tokenizer. Optional, only usable when in Python files instead of notebooks
tokenizer = BertTokenizer.from_pretrained(model_path)

# home
def home(request):
    return render(request, "app_sentiment_bert/home.html")

# api get sentiment score
@csrf_exempt
def api_get_sentiment(request):

    new_text = request.POST.get('input_text')
    #new_text = request.POST['input_text']
    print(new_text)

    # See the content_type and body從前端送過來的資料格式
    print(request.content_type)
    print(request.body) # byte format

    sentiment_prob = get_sentiment_proba(new_text)

    return JsonResponse(sentiment_prob)

# Define prediction function using pipeline
def get_sentiment_proba(text):
    max_length = 300 # 最多字數 若超出模型訓練時的字數，以模型最大字數為依據
    #max_length = 512 # 最多字數 若超出模型訓練時的字數，以模型最大字數為依據
    outputs = sentiment_classify(text, padding=True, max_length=max_length, truncation=True)
    if outputs[0]['label']=='LABEL_1':
        # Get the positive score
        prob_positive = round(outputs[0]['score'],2)
        prob_negatitive = round(1 - prob_positive, 2)
    else:
        # Calculate the negative score
        prob_negatitive = round(outputs[0]['score'],2)
        prob_positive = round(1 - prob_negatitive, 2)

    response = {'Negative':prob_negatitive, 'Positive': prob_positive}
    return response



print("Loading app bert sentiment classification.")


OSError: app_sentiment_bert/best-model is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`