How to Import Files from Google Drive to Colab
參考[這裏](https://saturncloud.io/blog/how-to-import-files-from-google-drive-to-colab/)

*   Step 1: Mount Your Google Drive
*   Step 2: Locate the File You Want to Import ([img](https://saturncloud.io/images/blog/files-to-colab.png))


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

下載之前的 [clean_data_segment.pkl](https://github.com/miniricer/topic_model_example/blob/master/data_my/clean_data_segment.pkl) 檔案

並上傳到個人的google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

這裏修改你的路徑

In [None]:
df = pd.read_pickle('/content/drive/MyDrive/clean_data_segment.pkl')
df.head(3)

將 topics 轉爲數字， 成爲label

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df['labels'] = labelencoder.fit_transform(df['topics'])
df.head(3)

In [None]:
import seaborn as sns
ax = sns.countplot(x="topics",data=df)

In [None]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-chinese')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [None]:
sample_txt = '豪 大雨 造成 南部 地區 重大 災情 除了 淹水 災民 收拾 家園 很 辛苦 也 要'
tokens = tokenizer.tokenize(sample_txt)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f' Sentence: {sample_txt}')
print(f'   Tokens: {tokens}')
print(f'Token IDs: {token_ids}')

BERT 有 512 長度的限制， 所以先限定字數在200以内

In [None]:
df200 = df.loc[(df['num_word'] <= 200)]

In [None]:
df200.shape

In [None]:
tokenized = df200['contents'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [None]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [None]:
np.array(padded).shape

In [None]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

In [None]:
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [None]:
features = last_hidden_states[0][:,0,:].numpy()

In [None]:
labels = df200['labels']

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

In [None]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

In [None]:
lr_clf.score(test_features, test_labels)

In [None]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[參考](https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/)