# COGS 118B - Project Proposal

# Names

- Anna Sim
- Jing Yin (Trevor) Yip
- Kevin Fisher
- Ashesh Kaji

# Abstract 
This project aims to enhance the interpretability of relationships within word embeddings generated by large language models (LLMs), specifically focusing on Word2Vec embeddings. Understanding LLMs is crucial, especially as they are increasingly used in high-risk environments where comprehension of their predictions is vital. While prior studies have identified analytical relationships between words within these embeddings, these observations have been limited and hand-picked.

To address this, we propose a clustering approach on the relationships between word pairs within the dataset. This allows distinct groups of similar relationships to emerge, offering insights into related points within each cluster and potentially revealing unconsidered relationships. We calculate relationships between words by taking the vector difference and ensure comparability through hyperplane splitting.

The solution's performance will be evaluated using the silhouette score, which measures cluster density and separation. Additionally, we will manually inspect clusters to ensure known relationships (e.g., capital/country) are grouped logically. This project poses minimal privacy risks but necessitates careful consideration of existing biases within the Word2Vec dataset, which could impact training data and lead to biased analogical reasoning capabilities.

# Background

In recent years, there has been a drastic increase in the implementation of large language models (LLM). LLMs accelerated the development of tools like recommendation systems, language sentiment analysis, and machine translation. The increased use of the language models requires further analysis and interpretation work to ensure responsible use of the architectures and models. High-risk environments require not only high accuracy of the predictions but also an understanding of why the model makes certain predictions<a name="molnar"></a>[<sup>[4]</sup>](#molnarnote). As users rely more and more on machine learning systems to make decisions, there is an increasing risk of misuse and misinterpretation of the results and capabilities of the system. Therefore, there is a demand for interpretability research both in academia and industry.

Interpretability of results of language models depends on understanding the relationships between semantic and syntactic content of words and their vector representations. Word2Vec is a popular algorithm developed by Tomas Mikolov et al. in 2013, which revolutionized the world of language models by proposing a new method of creating such continuous vector representations of words<a name="mikolov1"></a>[<sup>[2]</sup>](#mikolov1note). Preliminary analysis of Word2Vec embeddings showed that the vector representation captures relationships between the words. In the original paper, the authors showed how similar words are not only represented with vectors that have high cosine similarity, but also exhibit complex relationships like vector(“smallest”) = vector(”biggest”) - vector(”big”) + vector(”small”). They also show intricate semantic relationships, like country/city relationships and syntactic relationships from adjectives to adverbs<a name="mikolov1"></a>[<sup>[2]</sup>](#mikolov1note).

In the follow-up paper, “Distributed Representations of Words and Phrases and their Compositionality,” Mikolov introduces an extension to the model, focusing on the word phrases and adding distributed representations of words and phrases, which allowed for even more robust vector representations and semantic relationships like idiomatic phrases<a name="mikolov2"></a>[<sup>[3]</sup>](#mikolov2note). The paper showed connections from cities to their respective newspaper outlets and countries to their airlines<a name="mikolov2"></a>[<sup>[3]</sup>](#mikolov2note).

However, the research of Tal Linzen showed that there are potential limitations in relying on the cosine similarity for analogical reasoning to emerge, and the analogous relationships presented in the original papers break down when you reverse them<a name="linzen"></a>[<sup>[1]</sup>](#linzennote). The analogical reasoning can be partially explained by the model picking up the closest neighbor word and not the relative difference in similarity in the analogous pair<a name="linzen"></a>[<sup>[1]</sup>](#linzennote). Therefore, there is a need for further research looking into the relative neighborhoods and analogical reasoning capabilities of Word2Vec models. 
We will present an extensive analysis of the clustering of differences between pairs of Word2Vec embeddings, in order to examine the semantic and vector representation relationship of the words, gain insight into what drives the apparent vector-based analogical reasoning, and expand the potential areas of research on the interpretability of language models.

# Problem Statement

It is well known that embeddings generated by Word2Vec models respect relationships between words. For example, the differences between the vectors for countries and their respective capitals are very similar between all such pairs. In this sense, we get “relationship vectors”, which capture an entire relationship with one vector, and allow for reasoning by analogy. However, most observations of this have been hand-picked, and have considered only a couple of possible relationships. As such, we would like to tackle the issue of investigating all possible “relationship vectors”, so that we are not leaving out potentially overlooked members of these relationships or overlooked relationships entirely.

# Data

We will be using a set of [Word2Vec embeddings](https://code.google.com/archive/p/word2vec/), namely the [Google News negative sampling dataset](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) that was trained in the original paper. This is the dataset in which analogical vector behavior was first identified, so it is the perfect candidate. Raw, each observation in this dataset is the 300-dimensional vector embedding of a word in the vocabulary, listed in frequency order. Each individual dimension doesn’t represent anything in particular, but together they can capture the relationships between large vocabularies of words. This particular dataset has a vocabulary of 3,000,000 words and phrases (observations), but since we are going to be looking at every pair of words, we will be restricting our data to the most frequent 10,000 points (roughly).

# Proposed Solution

Our proposed solution to increase the interpretability of embedding relationships is to perform a clustering on the relationships between pairs of words within the dataset. This solution was chosen due to the fact that it is difficult to examine all relationships manually, and clustering will allow for distinct groups of similar relationships to emerge. By examining the relevant points within a cluster that already contains related points (as stated in previous literature), we can gain insight into other points within the same cluster. Furthermore, the unlabelled nature of clusters might even draw our attention to relationships that were not previously considered. A relationship between two words will be calculated mathematically by taking the difference between their corresponding vectors; and to ensure comparability, all resulting vectors will be split using a hyperplane, where one side of the hyperplane will be inverted. By inspecting the relationships in the optimal clustering, we hope to glean additional knowledge about how words are represented and their semantic meanings within the embedding vector space.

# Evaluation Metrics

An evaluation metric that can be used to quantify the performance of our solution is the silhouette coefficient. The silhouette coefficient of a sample is defined based on:
- a: the mean distance to all other points in the same cluster
- b: the mean distance to all other points in the next nearest cluster
- $s = \frac{b - a}{max(a, b)}$

To compute the score for all points in a clustering, the average score across all points can be taken. This score has a range between (-1, 1), where a higher score represents dense and well separated clusters. Although the silhouette score can give a good approximation of the optimal amount of clusters, it is not perfect, so we will also be inspecting the clusters and checking whether pairs that we already know have some kind of relationship (e.g. capital/country, man/woman, etc.) are getting put into reasonable clusters.

# Ethics & Privacy

In general, this is a relatively low-risk project in terms of privacy, since the observations are embeddings of individual words and not data gathered about individuals. The main concern with respect to people is that there are phrases in the vocabulary, and some of those phrases are the names of people. However, looking at the phrases in the 10,000 most frequently used words, we see only extremely famous names, like presidents and other extremely public figures. To those people, the possible impacts are in comparison very small, and even so our main focus will not be on them in our investigations, since they are such a small proportion of the vocabulary.

The other main ethics concern has to do with bias. There are known existing biases in the Word2Vec dataset we are working with; for example, the vector consisting of woman minus man plus doctor moves closer to nurse (although notably the closest vector is still doctor). We also see other biases, such as a noted US-centrism in terms of the most frequent vectors, as well as the vectors closest to concepts such as “country”. This comes from inherent biases in the training dataset (Google News), probably influenced by the combination of Google being a US-based company and the dataset being primarily in the English language. In some sense, investigating these biases and how they affect the analogical reasoning capabilities of Word2Vec is one of the goals of this project, but it is still important to keep them in mind beforehand, so that we do not misinterpret or overgeneralize results that are biased in these ways.

# Team Expectations 

* Communication among team members
* Try to show up to most meetings
* Be explicit when trying to deal with conflict, its better to have it out than cause tension
* Set expectations for each member so there is maximum efficiency and less confusion about work
* Set deadlines for each member so that we can keep each other accountable
* Be open to criticism and feedback

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/21  | 4 PM | Get dataset into the necessary format (pairs) | Any initial observations | 
| 2/28  | 4 PM | Finish EDA | Discuss results and whether we are seeing the expected patterns | 
| 3/6   | 4 PM | Explore clustering algorithms | Optimal clustering algorithm and/or parameters |
| 3/13  | 4 PM | Finalize clustering, explore results | Discuss results and assign final project parts |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="linzennote"></a>1.[^](#linzen): Linzen, Tal. “Issues in Evaluating Semantic Spaces Using Word Analogies.” arXiv.Org, 24 June 2016, arxiv.org/abs/1606.07736

<a name="mikolov1note"></a>2.[^](#mikolov1): Mikolov, Tomas, et al. “Efficient Estimation of Word Representations in Vector Space.” arXiv.Org, 7 Sept. 2013, arxiv.org/abs/1301.3781

<a name="mikolov2note"></a>3.[^](#mikolov2): Mikolov, Tomas, Ilya Sutskever, et al. “Distributed Representations of Words and Phrases and Their Compositionality.” arXiv.Org, 16 Oct. 2013, arxiv.org/abs/1310.4546

<a name="molnarnote"></a>4.[^](#molnar): Molnar, Christoph. “Interpretable Machine Learning.” 3.1 Importance of Interpretability, 21 Aug. 2023, christophm.github.io/interpretable-ml-book