### 模型加载与调用

In [None]:
from transformers import AutoConfig, AutoModel, AutoTokenizer, BertModel

#### 模型加载可以通过在线加载和离线加载的方式

可以科学上网直接指定模型地址即可
```python
model = AutoModel.from_pretrained("hfl/rbt3", force_download=True)
```

无法科学上网使用modelscope下载到本地，然后再加载
```python
from modelscope.hub.snapshot_download import snapshot_download
snapshot_download(model_id="xxx", cache_dir="./models")
model = AutoModel.from_pretrained("models/xxx", force_download=True)
```

#### 离线加载模型

In [None]:
model = AutoModel.from_pretrained("models/rbt3") # 直接指定模型本地下载路径

In [3]:
tokenizer = AutoTokenizer.from_pretrained("models/rbt3")

In [4]:
config = AutoConfig.from_pretrained("models/rbt3")

#### 模型调用

分为不带任务头调用和带任务头调用

不带任务头调用，直接导入AutoModel，返回结果是last_hiddent_state和pooling结果

带任务头调用则是返回对应任务的结果，导入AutoModelForxxxTask，会对AutoModel的结果进行进一步封装

##### 1. 不带head的模型调用

In [9]:
saying = "弱小的我也有大梦想！"
inputs = tokenizer(saying, return_tensors="pt")

In [13]:
output = model(**inputs)



In [15]:
output.keys()

odict_keys(['last_hidden_state', 'pooler_output', 'attentions'])

##### 2. 带head的模型调用

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("models/rbt3", num_labels=10)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at models/rbt3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
output = model(**inputs)

In [22]:
model.config.num_labels

10

In [26]:
output.logits.argmax(dim=-1).item()

9

#### 实战部分

##### step1 导包

In [28]:
import pandas as pd

from transformers import AutoTokenizer, AutoModelForSequenceClassification

##### step2 导入数据

大概看一下数据，正式代码可删

In [30]:
data = pd.read_csv("../../datas/ChnSentiCorp_htl_all.csv")
data

Unnamed: 0,label,review
0,1,"距离川沙公路较近,但是公交指示不对,如果是""蔡陆线""的话,会非常麻烦.建议用别的路线.房间较..."
1,1,商务大床房，房间很大，床有2M宽，整体感觉经济实惠不错!
2,1,早餐太差，无论去多少人，那边也不加食品的。酒店应该重视一下这个问题了。房间本身很好。
3,1,宾馆在小街道上，不大好找，但还好北京热心同胞很多~宾馆设施跟介绍的差不多，房间很小，确实挺小...
4,1,"CBD中心,周围没什么店铺,说5星有点勉强.不知道为什么卫生间没有电吹风"
...,...,...
7761,0,尼斯酒店的几大特点：噪音大、环境差、配置低、服务效率低。如：1、隔壁歌厅的声音闹至午夜3点许...
7762,0,盐城来了很多次，第一次住盐阜宾馆，我的确很失望整个墙壁黑咕隆咚的，好像被烟熏过一样家具非常的...
7763,0,看照片觉得还挺不错的，又是4星级的，但入住以后除了后悔没有别的，房间挺大但空空的，早餐是有但...
7764,0,我们去盐城的时候那里的最低气温只有4度，晚上冷得要死，居然还不开空调，投诉到酒店客房部，得到...


In [32]:
data_labels = data["label"].unique()
print(data_labels, len(data_labels))

[1 0] 2


正儿八经加载数据

In [34]:
from torch.utils.data import Dataset

class MyData(Dataset):
    def __init__(self):
        super().__init__()
        data = pd.read_csv("../../datas/ChnSentiCorp_htl_all.csv")
        self.data = data.dropna()
    
    def __getitem__(self, index):
        selected_data = self.data.iloc[index]
        return selected_data["review"], selected_data["label"]
    
    def __len__(self):
        return len(self.data)

##### step3 数据预处理

In [35]:
from torch.utils.data import random_split

dataset = MyData()
trainset, testset = random_split(dataset, lengths=[0.9, 0.1])

In [45]:
import torch

tokenizer = AutoTokenizer.from_pretrained("models/rbt3")

def collator_func(batch): # 使用dataloader进行处理时，数据会以batch形式传递
    texts, labels = [], []
    for item in batch:
        texts.append(item[0])
        labels.append(item[1])
    tokenized_data = tokenizer(texts, max_length=128, padding="max_length", truncation=True, return_tensors="pt")
    tokenized_data["labels"] = torch.tensor(labels)
    return tokenized_data

In [46]:
from torch.utils.data import DataLoader

trainloader = DataLoader(trainset, batch_size=32, shuffle=True, collate_fn=collator_func)

testloader = DataLoader(testset, batch_size=32, shuffle=True, collate_fn=collator_func)

##### step4 创建模型

In [47]:
model = AutoModelForSequenceClassification.from_pretrained("models/rbt3")
if torch.cuda.is_available():
    model = model.cuda()

from torch.optim import Adam

optimizer = Adam(model.parameters(), lr=2e-5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at models/rbt3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##### step5 配置训练参数

In [52]:
def train(epoch=3, log_step=20):
    global_step = 0
    for ep in range(epoch):
        model.train()

        for batch in trainloader:
            if torch.cuda.is_available():
                batch = {k: v.cuda() for k, v in batch.items()}
            optimizer.zero_grad()
            outputs = model(**batch)
            outputs.loss.backward()
            optimizer.step()

            if global_step % log_step == 0:
                print(f"ep: {ep + 1}, global_step: {global_step}, loss: {outputs.loss.item()}")
            global_step += 1

        acc = evaluate()
        print(f"ep: {ep + 1}, acc: {acc}")

def evaluate():
    model.eval()
    with torch.inference_mode():
        right_num = 0
        for batch in testloader:
            if torch.cuda.is_available():
                batch = {k: v.cuda() for k, v in batch.items()}
            outputs = model(**batch)
            pred = outputs.logits.argmax(dim=-1)
            right_num += (pred.long() == batch["labels"].long()).float().sum()
    return right_num / len(testset)

tensor(0.6572, device='cuda:0')

In [53]:
train()

ep: 1, loss: 0.6023589372634888
ep: 1, loss: 0.4994417428970337
ep: 1, loss: 0.35254302620887756
ep: 1, loss: 0.3994499444961548
ep: 1, loss: 0.2758346199989319
ep: 1, loss: 0.35559654235839844
ep: 1, loss: 0.41432616114616394
ep: 1, loss: 0.25966593623161316
ep: 1, loss: 0.2134041041135788
ep: 1, loss: 0.477927565574646
ep: 1, loss: 0.33135950565338135
ep: 1, acc: 0.8711339831352234
ep: 2, loss: 0.2784104645252228
ep: 2, loss: 0.31218916177749634
ep: 2, loss: 0.3016769587993622
ep: 2, loss: 0.3304864764213562
ep: 2, loss: 0.19822059571743011
ep: 2, loss: 0.16498878598213196
ep: 2, loss: 0.12936681509017944
ep: 2, loss: 0.22968313097953796
ep: 2, loss: 0.2906884551048279
ep: 2, loss: 0.17467628419399261
ep: 2, loss: 0.14967025816440582
ep: 2, acc: 0.875
ep: 3, loss: 0.27342769503593445
ep: 3, loss: 0.1483253687620163
ep: 3, loss: 0.433015376329422
ep: 3, loss: 0.20122946798801422
ep: 3, loss: 0.047586701810359955
ep: 3, loss: 0.159860759973526
ep: 3, loss: 0.2711983323097229
ep: 3, los

In [55]:
sentence = "我觉得这家酒店不错，饭很好吃！"

In [58]:
id2label = {0: "差评", 1: "好评"}

with torch.inference_mode():
    inputs = tokenizer(sentence, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}
    outputs = model(**inputs).logits.argmax(dim=-1)
    print(id2label.get(outputs.item()))

好评
