<a href="https://colab.research.google.com/github/lov435/SOEmotions/blob/main/hugging_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install PyTorch and BERT transformers from HuggingFace

In [1]:
!pip install transformers
!pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Import necessary packages

In [2]:
from sklearn.decomposition import PCA
from transformers import BertTokenizer, BertModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
import torch

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

### Read the emotions prediction spreadsheet

In [3]:
url='https://drive.google.com/file/d/1WyIDTtuaf2wFdDhXdV4QPE-uhgom5Cqq/view?usp=sharing'
file_id=url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id
df = pd.read_csv(dwn_url)

### Extract the Comment text and the corresponding Haoxiang's group from the spreadsheet. Also, remove the rows with empty comments

In [4]:
df = df[['CommentTextProc', 'Group']]
df = df.dropna(axis=0, subset=['CommentTextProc'])


### Split the data into training and test

In [5]:
X_train, X_test, y_train, y_test = train_test_split(df["CommentTextProc"], df["Group"],
                                   random_state=104, 
                                   test_size=0.40, 
                                   shuffle=True)

### Initialize the BERT tokenizer and model

In [6]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased").to(device)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Produce the features (BERT vectors) from text

In [7]:
#X_train, X_test, y_train, y_test
encoded_x_train = tokenizer(list(X_train), padding = True, truncation = True, return_tensors='pt')
encoded_x_test = tokenizer(list(X_test), padding = True, truncation = True, return_tensors='pt')

#move on device (GPU)
encoded_x_train = {k:torch.tensor(v).to(device) for k,v in encoded_x_train.items()}
encoded_x_test = {k:torch.tensor(v).to(device) for k,v in encoded_x_test.items()}

with torch.no_grad():
  output_train = model(**encoded_x_train)
  output_test = model(**encoded_x_test)

#We need the [CLS] output for our classification task  
cls_output_train = output_train.last_hidden_state[:,0,:]
cls_output_test = output_test.last_hidden_state[:,0,:]

print("Shape is")
print(cls_output_train.shape)
print(cls_output_test.shape)

  
  import sys


Shape is
torch.Size([2190, 768])
torch.Size([1460, 768])


#### Perform PCA

In [21]:
# Use PCA to reduce dimensions from 768 to 10
pca = PCA(n_components = 10, random_state = 7)
#Temporarily concatenate the training and test features for PCA
nump_features = np.concatenate((cls_output_train.detach().cpu().numpy(), cls_output_test.detach().cpu().numpy()))
#print(nump_features.shape)
#print(nump_features)
X1 = pca.fit_transform(nump_features)
print(X1.shape)
print(X1)
#Now slice the PCA feature set to training and test
X_pca_train = X1[:len(y_train),]
X_pca_test = X1[len(y_train):,]
print(X_pca_train.shape)



(3650, 10)
[[ 3.4247296  -1.8072746   1.6299404  ...  0.81381917  0.40350807
  -0.68725854]
 [ 0.24995682  4.0505233  -0.768812   ...  0.9909862   0.355702
   0.4945887 ]
 [ 1.3525051   0.7163822  -0.01073296 ... -0.9621817  -0.84351486
   1.1976556 ]
 ...
 [ 1.5906237  -2.0225756  -0.97227955 ...  0.12410364  0.4289484
  -0.59154457]
 [-1.9809676   1.7551179   0.54981154 ...  2.9065666   0.3360001
   1.4243469 ]
 [-0.3502619   0.31541148 -0.04347146 ... -0.867883   -0.8761377
  -0.63801533]]
(2190, 10)


### Perform a classification task

In [22]:
rf = RandomForestClassifier()
rf.fit(X_pca_train, y_train)
rf.score(X_pca_test, y_test)


0.560958904109589