## Getting Representations for Phrasal VERBS
In this notebook, I generate all the necessary embeddings for each element of the phrasal verb:

1. the whole phrasal verb
2. the verb only
3. all non-verb elements in the phrasal verb

This notebook ~28 minutes to run (currently).

In [9]:
import numpy as np
import pandas as pd
import pickle
from miniconsbatched import generate_representations

from minicons import cwe

In [10]:
# Import data
data = pd.read_csv('data/pvc_data.csv')

In [11]:
# Isolating verbs and non-verb elements
data['verb_string_only'] = data['verbs_fixed'].apply(lambda x: eval(x)[0])
data['non_verbs_only'] = data['verbs_fixed'].apply(lambda x: ' '.join(eval(x)[1:]))

## Getting Representations for Phrasal Verbs
In this section, I get the BERT, GPT2, and RoBERTa representations for all of the phrasal verbs.

In [4]:
# Generate BERT Reps
vectors_bert, pairs_bert = generate_representations(data['verb_string'], data['sent_string'], layer=[0,3,6,9,12])

########################################
Running bert-base-cased for layer [0, 3, 6, 9, 12] !
Run complete.



In [6]:
# Saving BERT Reps
with open('bert_vectors_full.np', 'wb') as f:
    np.save(f, vectors_bert)

In [7]:
# Generate GPT2 Reps
vectors_gpt2, pairs_gpt2 = generate_representations(data['verb_string'], data['sent_string'], model='gpt2', layer=[0,3,6,9,12])

########################################
Running gpt2 for layer [0, 3, 6, 9, 12] !


Using pad_token, but it is not set yet.


Run complete.



In [8]:
# Saving GPT2 Reps
with open('gpt_vectors_full.np', 'wb') as gpt:
    np.save(gpt, vectors_gpt2)

In [9]:
# Generate roberta Reps
vectors_roberta, pairs_roberta = generate_representations(data['verb_string'], data['sent_string'], model='roberta-base', layer=[0,3,6,9,12])

########################################
Running roberta-base for layer [0, 3, 6, 9, 12] !
Run complete.



In [10]:
# Saving RoBERTa Reps
with open('roberta_vectors_full.np', 'wb') as rob:
    np.save(rob, vectors_roberta)

## Getting Representations for Verbs Only
In this section, I get the BERT, GPT2, and RoBERTa representations for _only_ the verb in each phrasal verb:

In [11]:
# Generate BERT Reps
vectors_bert_verb, pairs_bert_verb = generate_representations(data['verb_string_only'], data['sent_string'], layer=[0,3,6,9,12])

########################################
Running bert-base-cased for layer [0, 3, 6, 9, 12] !
Run complete.



In [12]:
# Saving BERT Reps
with open('bert_vectors_verb.np', 'wb') as bert_verb:
    np.save(bert_verb, vectors_bert_verb)

In [13]:
# Generate GPT2 Reps
vectors_gpt2_verb, pairs_gpt2_verb = generate_representations(data['verb_string_only'], data['sent_string'], model='gpt2', layer=[0,3,6,9,12])

########################################
Running gpt2 for layer [0, 3, 6, 9, 12] !


Using pad_token, but it is not set yet.


Run complete.



In [14]:
# Saving GPT2 Reps
with open('gpt_vectors_verb.np', 'wb') as gpt_verb:
    np.save(gpt_verb, vectors_gpt2_verb)

In [15]:
# Generate RoBERTa Reps
vectors_roberta_verb, pairs_roberta_verb = generate_representations(data['verb_string_only'], data['sent_string'], model='roberta-base', layer=[0,3,6,9,12])

########################################
Running roberta-base for layer [0, 3, 6, 9, 12] !
Run complete.



In [16]:
# Saving RoBERTa Reps
with open('roberta_vectors_verb.np', 'wb') as rob_verb:
    np.save(rob_verb, vectors_roberta_verb)

## Getting Representations for Non-Verbs Only
In this section, I get the BERT, GPT2, and RoBERTa representations for everything _except_ the verb in each phrasal verb:

In [17]:
# Generate BERT Reps
vectors_bert_nonverb, pairs_bert_nonverb = generate_representations(data['non_verbs_only'], data['sent_string'], layer=[0,3,6,9,12])

########################################
Running bert-base-cased for layer [0, 3, 6, 9, 12] !
Run complete.



In [18]:
# Saving BERT Reps
with open('bert_vectors_nonverb.np', 'wb') as bert_nonverb:
    np.save(bert_nonverb, vectors_bert_nonverb)

In [19]:
# Generate GPT2 Reps
vectors_gpt2_nonverb, pairs_gpt2_nonverb = generate_representations(data['non_verbs_only'], data['sent_string'], model='gpt2', layer=[0,3,6,9,12])

########################################
Running gpt2 for layer [0, 3, 6, 9, 12] !


Using pad_token, but it is not set yet.


Run complete.



In [20]:
# Saving GPT2 Reps
with open('gpt_vectors_nonverb.np', 'wb') as gpt_nonverb:
    np.save(gpt_nonverb, vectors_gpt2_nonverb)

In [21]:
# Generate RoBERTa Reps
vectors_roberta_nonverb, pairs_roberta_nonverb = generate_representations(data['non_verbs_only'], data['sent_string'], model='roberta-base', layer=[0,3,6,9,12])

########################################
Running roberta-base for layer [0, 3, 6, 9, 12] !
Run complete.



In [22]:
# Saving RoBERTa Reps
with open('roberta_vectors_nonverb.np', 'wb') as rob_nonverb:
    np.save(rob_nonverb, vectors_roberta_nonverb)

The following code loads in some of the representations:

In [23]:
# RoBERTa Reps
with open('roberta_vectors_nonverb.np', 'rb') as test_rob:
    test_rob_loaded = np.load(test_rob)

# Master Function

In the following code blocks, I construct a master function that gets embeddings (full phrasal verb, verb only, nonverb only) for all 12 layers of BERT, GPT2, RoBERTa, XLNet, DistilGPT, and DistilBERT. 

On my laptop (Dell XPS 13 from 2020; no GPUs), the code takes this long to run:

In [12]:
models = ['bert-base-cased', 'roberta-base', 'gpt2', 'distilbert-base-cased', 'distilgpt2', 'xlnet-base-cased']
embedding_types = ['full_verb', 'verb_only', 'nonverb_only']
verb_strings = np.array((data['verb_string'], data['verb_string_only'], data['non_verbs_only']))
sent_strings = data['sent_string']
layers = [i for i in range(13)]

In [13]:
def build_and_save_representations(model, embedding_type, verb_strings, sent_strings, layer):
    vectors, pairs = generate_representations(verb_strings, sent_strings, model=model, layer=layer)

    with open(f'{model}_{embedding_type}_vectors.np', 'wb') as out:
        np.save(out, vectors)

In [19]:
def representation_pipeline(models=models, embedding_types=embedding_types, verb_strings=verb_strings, sent_strings=sent_strings, layers=layers):
    for model in models:
        print('LAYERS:', layers)
        input_layers = layers
        print('INPUT LAYERS BEFORE MODEL CHECK:', input_layers)
        if model == 'distilbert-base-cased' or 'distilgpt2':
            input_layers = input_layers[:7]
        print('INPUT LAYERS AFTER MODEL CHECK:', input_layers)
        print(f'################# WORKING ON {model} #################')
        for i in range(len(embedding_types)):
            print(f'### BUILDING {embedding_types[i]} ###')
            build_and_save_representations(model, embedding_types[i], verb_strings[i],
                                            sent_strings, input_layers)
        print(f'################# {model} COMPLETE #################')
        print()
        print()
        print()

In [20]:
# The whole shibang!
representation_pipeline(models=['distilgpt2', 'bert-base-cased'], embedding_types=['full_verb'])

LAYERS: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
################# WORKING ON distilgpt2 #################
### BUILDING full_verb ###
########################################
Running distilgpt2 for layer [0, 1, 2, 3, 4, 5, 6] !


Using pad_token, but it is not set yet.


Run complete.

################# distilgpt2 COMPLETE #################



LAYERS: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
################# WORKING ON bert-base-cased #################
### BUILDING full_verb ###
########################################
Running bert-base-cased for layer [0, 1, 2, 3, 4, 5, 6] !


KeyboardInterrupt: 

########################################
Running roberta-base for layer [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] !
Run complete.

