<a href="https://colab.research.google.com/github/langdonholmes/lexical_analysis/blob/main/ICNALE_lexical_embedding_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Using BERT Embeddings as Features in a Regression**

We want to predict the lexical proficiency of the author of a text.

We feed BERT text and collect the embeddings on the [CLS] token. This is not ideal, but works well enough and is conceptually manageable.

In this showcase notebook, I am just using all of ICNALE and randomly selecting training/validation sets. In the publishable analysis, we may want to reserve more of the data for validation and/or choose specific texts to use in the validation set (e.g. L2 speakers, only smoking prompt, etc.)

First step is to install some packages to the Colab runtime using %%writefile magic.

In [None]:
%%writefile requirements.txt
transformers
datasets

Writing requirements.txt


In [None]:
!pip install -q -r requirements.txt

[K     |████████████████████████████████| 3.8 MB 13.8 MB/s 
[K     |████████████████████████████████| 325 kB 74.8 MB/s 
[K     |████████████████████████████████| 895 kB 59.2 MB/s 
[K     |████████████████████████████████| 6.5 MB 57.1 MB/s 
[K     |████████████████████████████████| 67 kB 7.1 MB/s 
[K     |████████████████████████████████| 596 kB 64.7 MB/s 
[K     |████████████████████████████████| 134 kB 24.1 MB/s 
[K     |████████████████████████████████| 1.1 MB 9.6 MB/s 
[K     |████████████████████████████████| 212 kB 58.8 MB/s 
[K     |████████████████████████████████| 127 kB 59.0 MB/s 
[K     |████████████████████████████████| 94 kB 1.9 MB/s 
[K     |████████████████████████████████| 271 kB 62.1 MB/s 
[K     |████████████████████████████████| 144 kB 54.6 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires foli

In [None]:
# Files
import os
import pathlib

# Feedback
from tqdm import tqdm

# Digits and Strings
import numpy as np
import pandas as pd

# Learning
from sklearn import linear_model, neighbors

from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel

import torch
assert torch.cuda.is_available()
device = torch.device("cuda")

# Learning Utilities
from sklearn import linear_model
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_error, r2_score
from datasets import Dataset, DatasetDict
from torch.utils.data import DataLoader

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
corpus_dir = '/content/drive/MyDrive/data/icnale_bert_lexical/ICNALE_W_single_folder_cleaned/'
meta_data = '/content/drive/MyDrive/data/icnale_bert_lexical/final_icnale_variable_smoking_lme.csv'
fname = '/content/drive/MyDrive/data/icnale_bert_lexical/smj.csv'

if not os.path.isfile(fname):
    texts = []
    for file in tqdm(pathlib.Path(corpus_dir).rglob("*.txt")):
        with open(file, 'r', encoding="utf-8") as f: 
            text = f.read()
            texts.append((file.name, text))
    df = pd.DataFrame.from_records(texts, columns=['Filename', 'text'])
    df = df.merge(pd.read_csv(meta_data), on='Filename')
    df.to_csv(fname)

filenames = ['/content/drive/MyDrive/data/icnale_bert_lexical/smj_train.csv',
             '/content/drive/MyDrive/data/icnale_bert_lexical/smj_valid.csv']

if not os.path.isfile(filenames[0]):
    df = pd.read_csv(fname, index_col=0)
    train, validandtest = train_test_split(
        df,
        test_size=.2,
        stratify=df['Country'],
        random_state=42)
    for fname, pandaframe in zip(filenames, [train, validandtest]):
      if not os.path.isfile(fname):
        pandaframe.to_csv(fname)  

The wonderful folks at Huggingface expect that the target variable wil be in a column called "label". 

I need to use a minmax scaler when finetuning BERT (a different approach in a different notebook), so let's do that here as well. 

Here, I am giving native English speakers a VST score of 1.0 (the highest). This is a fair assumption and helps to increase the size of our training set.

We keep the plain text in the same dataframe. This data structure works well with Huggingface DataSets.


In [None]:
class _dataset:
  def __init__(self, fname, index_col=None, max_length=512,
               check_seq_lengths=False, simplify=True, use_min_max=False):
    self.fname = fname
    self.index_col = index_col 
    self.max_length = max_length
    self.check_seq_lengths=check_seq_lengths
    self.simplify=simplify
    self.use_min_max=use_min_max

  def min_max_scaler(self, s):
    return (s-s.min())/(s.max()-s.min())

  def tokenize(self, batch):
    return tokenizer(batch['text'],
                    padding="max_length",
                    truncation=True,
                    max_length=self.max_length,
                    return_overflowing_tokens=self.check_seq_lengths)

  def prep_frame(self):
    df = pd.read_csv(self.fname, index_col=self.index_col)
    df['label'] = df['VST']
    df = df[df['label'].notna()]

    if self.use_min_max:
      df['label'] = self.min_max_scaler(df['label'])
    
    if self.simplify:
      df = df[['label', 'text']]

    self.df = df

  def make(self):
    self.prep_frame()
      
    dataset = Dataset.from_pandas(self.df.reset_index(drop=True))
    dataset = dataset.map(self.tokenize, batched=True, batch_size=1)
    dataset = dataset.remove_columns('text')

    if self.check_seq_lengths:
      truncated = [tens for tens in dataset['overflowing_tokens']
                if len(tens) > 0]
      print(f'{len(truncated)} documents have been truncated in the dataset.')

    columns = ['input_ids', 'token_type_ids', 'attention_mask', 'label']
    columns = list(set(columns).intersection(dataset.column_names))
    dataset.set_format(type='torch',
                      columns=columns,
                      device='cuda')
    
    return dataset

# tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
# fname = '/content/drive/MyDrive/data/icnale_bert_lexical/all_icnale_for_bert_training.csv'
dt = _dataset(fname).prep_frame()

Unnamed: 0,text,Filename,id,prompt,VST,label
0,﻿Smoking does not just have ill-effects on the...,W_PHL_SMK0_126_A2_0.txt,W_PHL_126,SMK0,22.0,22.0
1,﻿I agree with the statement because smoking in...,W_PHL_SMK0_161_A2_0.txt,W_PHL_161,SMK0,14.0,14.0
2,﻿I agree with the statement because having par...,W_PHL_PTJ0_161_A2_0.txt,W_PHL_161,PTJ0,14.0,14.0
3,"﻿In today's society, more and more college stu...",W_PHL_PTJ0_126_A2_0.txt,W_PHL_126,PTJ0,22.0,22.0
4,﻿Smoking has been causing serious issues in th...,W_HKG_SMK0_075_B2_0.txt,W_HKG_075,SMK0,46.0,46.0
...,...,...,...,...,...,...
5595,﻿I don't know why there is a kind of thing cal...,W_CHN_SMK0_089_B1_2.txt,W_CHN_089,SMK0,46.0,46.0
5596,﻿Smoking is one of the most dangerous potentia...,W_CHN_SMK0_247_B1_2.txt,W_CHN_247,SMK0,39.0,39.0
5597,"﻿Nowadays, more and more college students are ...",W_CHN_PTJ0_290_B1_2.txt,W_CHN_290,PTJ0,39.0,39.0
5598,﻿Smoking may be the most usually thing that ca...,W_CHN_SMK0_016_B1_2.txt,W_CHN_016,SMK0,39.0,39.0


First, we need to convert the texts into BERT tokens (ie, input_ids).

Processing is faster if all inputs are the same length, so shorter inputs are padded to the max_length. Padding is done by adding zeroes to the input. We tell BERT to ignore these zeroes when processing.

Huggingface provides a Dataset class that helps manage model inputs.

The set_format method sends the important columns of our dataset to the GPU, where it will be processed blazingly fast (compared to a CPU, anyway).

DataLoader is a PyTorch class that helps batch our dataset. We set a batch_size as a mutliple of 8 to improve efficiency. Later, we will loop over the loader at processing time.

In [None]:
loader = DataLoader(dataset, batch_size=16, shuffle=False)

We will collect the [CLS] embedding for each document, which is a vector of length 768. First, we create an empty numpy array of length 768. Then we loop over the loader (16 texts at a time) and send them to BERT. From BERT we collect the last hidden states of the model using [0][:,0,:] indexing as illustrated by our friend Jay Alammar (we have 16 rows per batch where he has 2,000 rows). But this PyTorch tensor lives on the GPU right now. Let's first move it back to RAM with .cpu() then make it a normal numpy array with .numpy().


![](https://camo.githubusercontent.com/6c2185c7620a3fe52f1968752febb6467723f4485c257442d3b0ed03bb0da197/68747470733a2f2f6a616c616d6d61722e6769746875622e696f2f696d616765732f64697374696c424552542f626572742d6f75747075742d74656e736f722d73656c656374696f6e2e706e67)


In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
model = BertModel.from_pretrained("bert-base-uncased").to(device)
fname = '/content/drive/MyDrive/data/icnale_bert_lexical/all_icnale_for_bert_training.csv'
dataset = _dataset(fname).make()
loader = DataLoader(dataset, batch_size=16, shuffle=False)

features = np.empty([0,768])
for batch in tqdm(loader):
  with torch.no_grad():
    last_hidden_states = model(batch['input_ids'],
                               batch['attention_mask'])['last_hidden_state']
    features = np.vstack((features,
                          last_hidden_states[:,0,:].cpu().numpy()))
    
features.shape

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 325/325 [01:40<00:00,  3.25it/s]


(5200, 768)

In [None]:
_model = 'sentence-transformers/all-mpnet-base-v2'
model = AutoModel.from_pretrained(_model).to(device)
tokenizer = AutoTokenizer.from_pretrained(_model)

dataset = _dataset().make()
loader = DataLoader(dataset, batch_size=16, shuffle=False)

def mean_pooling(token_embeddings, attention_mask):
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

# model = BertModel.from_pretrained("/content/drive/MyDrive/data/icnale_bert_lexical/r2_23/").to(device)
features = np.empty([0,768])
for batch in tqdm(loader):
  with torch.no_grad():
    last_hidden_state = model(batch['input_ids'],
                               batch['attention_mask'])['last_hidden_state']
    sentence_embeddings = mean_pooling(last_hidden_state, batch['attention_mask'])
    sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
    features = np.vstack((features, sentence_embeddings.detach().cpu().numpy()))


  0%|          | 0/2599 [00:00<?, ?ba/s]

100%|██████████| 163/163 [00:53<00:00,  3.07it/s]


In [None]:
df = pd.read_csv('/content/drive/MyDrive/data/icnale_bert_lexical/smj_valid.csv', index_col=0)
df2 = pd.read_csv('/content/drive/MyDrive/data/icnale_bert_lexical/all_icnale_for_bert_training.csv')
df2 = df2[df2['VST'].notna()].reset_index()
valid_inds = df2['Filename'][df2['Filename'].isin(df['Filename'])].index
print(valid_inds)

Int64Index([   1,   10,   14,   18,   22,   25,   28,   72,   74,   91,
            ...
            5037, 5049, 5050, 5058, 5068, 5088, 5093, 5101, 5119, 5133],
           dtype='int64', length=520)


In [None]:
def validation_set(features, valid_inds=None, linear=linear_model.LinearRegression()):
  target = dataset['label'].cpu().numpy()
  if valid_inds is None:
    valid_inds = pd.read_csv('/content/drive/MyDrive/data/icnale_bert_lexical/smj_valid.csv',
                             index_col=0).index
  train_ind = np.setdiff1d(range(len(target)), valid_inds)
  X_train = features[train_ind]
  X_test = features[valid_inds]
  y_train = target[train_ind]
  y_test = target[valid_inds]
  linear.fit(X_train, y_train)
  return linear.score(X_test, y_test)

def do_pca(feats, n_components=100):
  from sklearn.decomposition import PCA
  pca = PCA(n_components=n_components)
  return pca.fit_transform(feats)

feats = do_pca(features)
validation_set(feats, valid_inds=valid_inds, linear=linear_model.LinearRegression())

0.26314275462036074

SKlearn makes the train_test_split dead simple. Each document has 768 features to predict one label (VST).

We fit a regression line to the training sample and score it on the test sample.

I didn't set a random seed, but you should see an R2 in the 0.35 to 0.40
 range if we train on all documents. It is lower if we just use the smoking prompt.

In [None]:
def run_tests(features, targets, linear=linear_model.LinearRegression()):
  kf = KFold(n_splits=20, shuffle=True)
  res = []
  for train_index, test_index in kf.split(features):
    X_train, X_test = features[train_index], features[test_index]
    y_train, y_test = targets[train_index], targets[test_index]
    linear.fit(X_train, y_train)
    res.append(linear.score(X_test, y_test)) # R Squared
  print(f'{linear}: {np.mean(res):.2f}')
  return res
run_tests(features, dataset['label'].cpu(), linear=linear_model.Ridge())

Ridge(): 0.49


[0.4765100409608489,
 0.4520179983717878,
 0.5831952410763299,
 0.5682444446011823,
 0.577116103828585,
 0.4624321593947277,
 0.4132917806939189,
 0.5086643063546257,
 0.4343154359953054,
 0.5434955025526471,
 0.46564597616660364,
 0.47302140852525887,
 0.6025802114622456,
 0.5657156482526953,
 0.49937987635004266,
 0.3927063944387371,
 0.502011798260345,
 0.39412528271448244,
 0.4372541101830847,
 0.5474006152075844]

In [None]:
outname = '/content/drive/MyDrive/data/icnale_bert_lexical/smj_bert_embeddings.csv'

outdf = df[['Filename', 'VST']]
outdf = pd.concat([outdf, pd.DataFrame(features)], axis='columns')
outdf.to_csv(outname, index=False)