# Machine Learning and Conceptual Complexity - Computational Methods for Music, Media, and Minds REU

By Jens Kipper, Max Barlow, Sara Jo Jeiter-Johnson, and Tianyi Ma

## Introduction

Our research focuses on determining if word embeddings in language models can show us information about conceptual complexity.

# Preparing Language Models

## Gensim

### Imports

In [2]:
import os
import numpy as np

# Get the interactive Tools for Matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from sklearn.metrics.pairwise import cosine_similarity

### Introduction

"For looking at word vectors, I'll use Gensim. We also use it in hw1 for word vectors. Gensim isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.

Our homegrown Stanford offering is GloVe word vectors. Gensim doesn't give them first class support, but allows you to convert a file of GloVe vectors into word2vec format. You can download the GloVe vectors from [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

(I use the 100d vectors below as a mix between speed and smallness vs. quality. If you try out the 50d vectors, they basically work for similarity but clearly aren't as good for analogy problems. If you load the 300d vectors, they're even better than the 100d vectors.)" -Tianyi

### Loading word2vec

100d

In [3]:
glove_file = datapath(os.path.abspath("glove.6B/glove.6B.100d.txt"))
word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)

  glove2word2vec(glove_file, word2vec_glove_file)


(400000, 100)

300d

In [4]:
glove_file_300d = datapath(os.path.abspath("glove.6B/glove.6B.300d.txt"))
word2vec_glove_file_300d = get_tmpfile("glove.6B.300d.word2vec.txt")
glove2word2vec(glove_file_300d, word2vec_glove_file_300d)

  glove2word2vec(glove_file_300d, word2vec_glove_file_300d)


(400000, 300)

Creating model

In [5]:
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)
model_300d = KeyedVectors.load_word2vec_format(word2vec_glove_file_300d)

## BERT

### Introduction

Using pre-trained BERT langauge model to create word vectors for words and check their cosine similarity 

### Imports

In [None]:
from transformers import BertTokenizer, BertModel
import pandas as pd
import numpy as np
import nltk
import torch
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

### Model

In [None]:
model = BertModel.from_pretrained('bert-large-uncased',
                                  output_hidden_states = True,
                                  )
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')



Vocab

In [None]:
f = open("commonWords.txt", "r")

vocab += f.read().split("\n")

vocab = list(np.unique(np.array(vocab)))[1:]

Text prep and embeddings

In [None]:
def bert_text_preparation(text, tokenizer):
    """Preparing the input for BERT
    
    Takes a string argument and performs
    pre-processing like adding special tokens,
    tokenization, tokens to ids, and tokens to
    segment ids. All tokens are mapped to seg-
    ment id = 1.
    
    Args:
        text (str): Text to be converted
        tokenizer (obj): Tokenizer object
            to convert text into BERT-re-
            adable tokens and ids
        
    Returns:
        list: List of BERT-readable tokens
        obj: Torch tensor with token ids
        obj: Torch tensor segment ids
    
    
    """
    marked_text = "[CLS] " + text + " [SEP]"
    tokenized_text = tokenizer.tokenize(marked_text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    segments_ids = [1]*len(indexed_tokens)

    # Convert inputs to PyTorch tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    return tokenized_text, tokens_tensor, segments_tensors
    
def get_bert_embeddings(tokens_tensor, segments_tensors, model):
    """Get embeddings from an embedding model
    
    Args:
        tokens_tensor (obj): Torch tensor size [n_tokens]
            with token ids for each token in text
        segments_tensors (obj): Torch tensor size [n_tokens]
            with segment ids for each token in text
        model (obj): Embedding model to generate embeddings
            from token and segment ids
    
    Returns:
        list: List of list of floats of size
            [n_tokens, n_embedding_dimensions]
            containing embeddings for each token
    
    """
    
    # Gradient calculation id disabled
    # Model is in inference mode
    with torch.no_grad():
        outputs = model(tokens_tensor, segments_tensors)
        # Removing the first hidden state
        # The first state is the input state
        hidden_states = outputs[2][1:]

    # Getting embeddings from the final BERT layer
    token_embeddings = hidden_states[-1]
    # Collapsing the tensor into 1-dimension
    token_embeddings = torch.squeeze(token_embeddings, dim=0)
    # Converting torchtensors to lists
    list_token_embeddings = [token_embed.tolist() for token_embed in token_embeddings]

    return list_token_embeddings

Get embeddings for vocab

In [None]:
# Getting embeddings for the target word
target_word_embeddings = []

n = 0
for text in vocab:
    tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(text, tokenizer)
    list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)
    
    word_embedding = list_token_embeddings[1]
    if text not in tokenized_text:
        for idx in range(2,len(tokenized_text)-1):
            word_embedding += list_token_embeddings[idx]
        word_embedding = list(np.array(word_embedding)/len(tokenized_text))
        word_embedding = np.array(word_embedding).reshape(-1, 1)
        word_embedding = PCA().fit_transform(word_embedding)
        word_embedding = word_embedding[:n,:]
    elif n == 0:
        n = len(word_embedding)

    target_word_embeddings.append(word_embedding)

# Testing Methods

### Comparing Similarity

#### Gensim Functions

In [None]:
# Use 100d model and cosine_similarity to calculate similarity between a defining 
# phrase and the word.
#
# You can use doc2vec for this but there isn't any good pre-trained doc2vec models 
# and it takes too long to train my own model so I just averaged the two defining 
# words. This should make some sense.
# 
# The Phrases model for word2vec doesn't support extracting individual word vectors

def calculate_similarity(x1, x2, x3):
    phrase = ((model[x1] + model[x2])/2).reshape(1, -1)
    word = model[x3].reshape(1, -1)
    result = cosine_similarity(phrase, word)
    return result

#### BERT functions

In [None]:
def calculate_similarity_bert(x1, x2, x3):
    phrase = (np.array(model_bert[x1]) + np.array(model_bert[x2])).reshape(1, -1)
    word = np.array(model_bert[x3]).reshape(1, -1)
    result = cosine_similarity(phrase, word)
    return result

### Comparing Complexity

In [None]:
# sum of the word vector

def sum_complexity(x1):
    models = ['BERT', 'GloVe_100d', 'GloVe_300d']
    data = []
    try:
        data = [sum([abs(x) for x in model_bert[x1]]), sum(model[x1]), sum(model_300d[x1])]
    except KeyError:
        print("Key Error")
    return dict(zip(models, data))


# count the number of features that has significant information
# threshold = how far away a feature is from 0 (0 <= threshold <= 1). default = 0.5

def above_zero_complexity(x1, threshold = 0.5):
    models = ['BERT', 'GloVe_100d', 'GloVe_300d']
    bert = len([x for x in model_bert[x1] if abs(x) >= threshold])/len(model_bert[x1])
    glove_100d = 0
    glove_300d = 0
    try:
        glove_100d = len([x for x in model[x1] if x >= threshold])/len(model[x1])
        glove_300d = len([x for x in model_300d[x1] if x >= threshold])/len(model_300d[x1])
    except KeyError:
        print("Key Error")
    data = [bert, glove_100d, glove_300d]
    return dict(zip(models, data))

def calculate_complexity(words):
    for word in words:
        print(word)
        print("Sum Complexity: \n", sum_complexity(word))
        print("Above Zero Complexity: (threshold = 0.5)\n", above_zero_complexity(word))
        print("Above Zero Complexity: (threshold = 0.3)\n", above_zero_complexity(word, 0.3))
        print()

def compare_complexity(x, y):
    c1 = above_zero_complexity(x)
    c2 = above_zero_complexity(y)
    print(x, c1)
    print(y, c2)

# Results

In [None]:
compare_complexity('man','husband')