# Generating BERT embeddings

The pretrained BERT model needs to be downloaded under https://github.com/hanxiao/bert-as-service. This work uses "BERT Base, Uncased" as a BERT model. After the download, a folder called "uncased_L-12_H-768_A-12" is created which has to be saved under the relative path "data/bert_model/". 

Now, a new service needs to be started. For this, you have to open a new command prompt, go into the directory of this file and then execute the following command: 

bert-serving-start -num_worker=1 -model_dir=data/bert_model/uncased_L-12_H-768_A-12/ -max_seq_len=512

The number of workers can be chosen accordingly. In this work just one worker was chosen. The maximum sequence length was set to 512 which is the number of maximum tokens of the original BERT model. Sequences that are too long will then be cut off on the right side. No special truncation has been chosen here. The reason for this is that it has already been shown that the proportion of reviews longer than 100 tokens is rather little, therefore the number of reviews longer than 512 tokens is assumed to be negligibly small.

After executing the above command, a message called "all set, ready to serve request!" is returned in the command prompt. This notebook can then be executed in order to generate sentence embeddings.

Make sure to create a directory "data/bert_embeddings/train/" and "data/bert_embeddings/test/" such that the train/test data can be saved there.

In [1]:
# imports
import os
import re
import glob
import nltk
import numpy as np
import pandas as pd
import pickle as pkl
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from bert_serving.client import BertClient

In the following, the data is cleaned and then encoded to sentence embeddings. The reason for cleaning the data again is that creating the BERT embeddings is a rather time-consuming matter. By cleaning the data file wise and then encoding the reviews, it's easier to feed portions into BERT and divide the process of generating sentence embeddings. The sentence embeddings are saved in the same file structure as the raw data files as well (per domain two files: train and test).

In [2]:
# create the stopwords list
stopwords = stopwords.words('english')
stopwords_keep =['no', 'not', 'nor']
stopwords = list(set(stopwords).difference(set(stopwords_keep)))

# set the lemmatizer
lemmatizer = WordNetLemmatizer()

In [3]:
# data cleaning function
def data_cleaning(df):
    
    # exchange some incomplete words (due to contractions and opening them up some words are incomplete)
    df['text'] = df['text'].apply(lambda x: ' '.join(['can' if word == 'ca' else word for word in str(x).split()]))
    df['text'] = df['text'].apply(lambda x: ' '.join(['will' if word == 'wo' else word for word in str(x).split()]))
    df['text'] = df['text'].apply(lambda x: ' '.join(['shall' if word == 'sha' else word for word in str(x).split()]))
    df['text'] = df['text'].apply(lambda x: ' '.join(['not' if word == 'nt' or word == "n't" else word for word in str(x).split()]))
    
    # remove punctuation & special characters
    df['text'] = df['text'].apply(lambda x: ' '.join(re.split('\W+', x)))
    df['text'] = df['text'].apply(lambda x: ' '.join(word for word in x.split() if word.isalnum()))

    # remove nouns and numbers
    df['text'] = df['text'].astype(str).apply(lambda x: nltk.tag.pos_tag(x.split()))
    df['text'] = df['text'].apply(lambda x: ' '.join([word for word, tag in x if tag != 'NN' and tag != 'NNS' and tag != 'CD']))
 
    # remove stopwords
    df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))
    
    # lemmatize
    df['text'] = df['text'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

    return df

In [4]:
# set the data path and output path (take care that both train and test data get encoded)
data_path = 'data/uncleaned_data/train/*'
output_path = 'data/bert_embeddings/train/'

# start bert client
bc = BertClient()

# open each file in the directory data_path
for file_name in glob.glob(data_path):

    # read and clean the data
    df = pd.read_csv(file_name, delimiter = '\t', names=["label","text"], encoding='latin-1')
    df = data_cleaning(df)
    df['text'].replace('', np.nan, inplace=True)
    df.dropna(subset=['text'], inplace=True)
    
    # encode and prepare final df
    sentence_list = [s for s in df['text']]
    embeddings = bc.encode(sentence_list)
    df['embeddings'] = pd.DataFrame(zip(embeddings), columns=["embeddings"])
    df = df.drop('text', axis = 1)

    # save to file
    with open(output_path + os.path.basename(file_name) + '.pkl', 'wb') as f:
        pkl.dump(df, f)