### Extracting BERT embeddings from texts for readability assessment.

In [None]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/cc/75/df441011cd1726822b70fbff50042adb4860e9327b99b346154ead704c44/sentence-transformers-1.2.0.tar.gz (81kB)
[K     |████████████████████████████████| 81kB 5.3MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 23.1MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 36.4MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37

In [None]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
import torch
import pickle
import numpy

Read contents from a specific corpus. Corpus should be 1 per line and in .txt format.

In [None]:
titles = []
contents = []

with open("commoncore.txt","r") as file:
  file_contents = file.readlines()
  for i in file_contents:
    parsed_text = i.split(';',2)
    parsed_text[2] = parsed_text[2].strip()
    print(parsed_text[2])
    titles.append(parsed_text[0].strip())
    contents.append(parsed_text[2])

The policeman on the beat moved up the avenue impressively. The impressiveness was habitual and not for show, for spectators were few. The time was barely o'clock at night, but chilly gusts of wind with a taste of rain in them had well nigh de-peopled the streets. Trying doors as he went, twirling his club with many intricate and artful movements, turning now and then to cast his watchful eye adown the pacific thoroughfare, the officer, with his stalwart form and slight swagger, made a fine picture of a guardian of the peace. The vicinity was one that kept early hours. Now and then you might see the lights of a cigar store or of an all-night lunch counter but the majority of the doors belonged to business places that had long since been closed. When about midway of a certain block the policeman suddenly slowed his walk. In the doorway of a darkened hardware store a man leaned, with an unlighted cigar in his mouth. As the policeman walked up to him the man spoke up quickly. It's all rig

Mean Pooling - take attention mask into account for correct averaging.

In [None]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

Load BERT models from Huggingface model repository.

In [None]:
# FOR FILIPINO
# tokenizer = AutoTokenizer.from_pretrained("jcblaise/bert-tagalog-base-cased")
# model = AutoModel.from_pretrained("jcblaise/bert-tagalog-base-cased")

# FOR ENGLISH
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModel.from_pretrained("bert-base-cased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Preprocess and compute embeddings

In [None]:
#Tokenize sentences
encoded_input = tokenizer(contents, padding=True, truncation=True, max_length=512, return_tensors='pt')

#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

Transform embeddings to numpy format. Show example at index 0.

In [None]:
sentence_embeddings_np = sentence_embeddings.numpy()
sentence_embeddings_np[0]

array([ 3.00258428e-01, -2.55088925e-01, -4.44831848e-02,  2.43792981e-01,
        5.13282061e-01,  3.49847414e-02, -2.19882876e-01,  3.26985121e-02,
        8.45087543e-02,  7.38475844e-02,  6.68110251e-02,  3.10143769e-01,
       -1.87994301e-01,  2.98425049e-01, -1.29232749e-01, -1.97815269e-01,
        4.51567806e-02, -2.10220993e-01, -1.81306884e-01,  2.79641468e-02,
        1.04797877e-01, -4.71706912e-02, -1.86185449e-01,  9.06723365e-02,
        2.26074576e-01, -7.64455721e-02, -7.51836225e-02,  8.26824248e-01,
        8.53071362e-03,  2.60233492e-01, -1.04036227e-01, -2.09600888e-02,
       -9.93568450e-02, -2.24021181e-01,  1.23696715e-01, -9.36410874e-02,
       -1.32401615e-01,  3.53948295e-01, -5.06256409e-02,  1.63166493e-01,
        2.24918634e-01, -3.01274844e-03, -4.57667373e-02,  4.43920732e-01,
       -7.43953660e-02, -3.71251047e-01, -1.24648698e-01,  3.35631847e-01,
       -3.77991736e-01,  9.70294699e-02,  2.96499252e-01,  3.64546701e-02,
        1.38769373e-02, -

Save in csv format. Can be added to csv files of linguistic features for readability assessment.

In [None]:
numpy.savetxt('commoncore_sbert_embeddings.csv', sentence_embeddings_np, delimiter=',')