# Semantic Subspace plotter for people who don't want to do any coding

This is all the code you should need to take a text document, train a word2vec model on it, and then search for a semantic subspace in your model in the way outlined in the very cool paper 'Semantic projection recovers rich human knowledge of multiple object features from word embeddings': https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10349641/ . There are various points here where you can change parameters if you want but in theory, anyone should be able to use this.   

In [None]:
#For cleaning the data
import nltk
import string
from nltk.tokenize import sent_tokenize, word_tokenize

#For training the network
import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from gensim.models import KeyedVectors

#For representing/reducing the data
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import adjustText
import seaborn as sns

#Because you always need numpy
import numpy as np

The first bit of code takes the text from your document and does some basic cleaning and tokenisation to make it ready for training the model. 
All you need to do is plug the address of your document into the open() function.

In [None]:
with open('your_document_here.txt', 'r', encoding='utf-8') as f:
    corpus = f.read()
    
corpus_lower = corpus.lower()
data = [word_tokenize(t) for t in sent_tokenize(corpus_lower)]

The next six lines of code build the model and then train it on your textual data. Feel free to change the hyperparameters if you feel like it. 

In [None]:
#Build the model
model = gensim.models.Word2Vec(window=10, min_count=2, workers=4, negative=10)
model.build_vocab(data, progress_per=6000)
model.epochs=15
model.corpus_count

#Train the model
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
word_vectors = model.wv

The following two functions create the semantic subspace from your antonyms and project the vectors in a list on to it. 
The code here is basically taken directly from the original paper except that it has been translated from Matlab into python. 
The original code is here: https://osf.io/5r2sz/?view_only= 

In [None]:
#Semantic subspace plotter. This can be improved by using a dictionary with multiple values rather than taking individual values
def get_semantic_subspace(vectors, value_a, value_b):
    """Computes the semantic subspace for a pair of antonyms"""
    sideA = vectors[value_a]
    sideB = vectors[value_b]

    # Compute the subspace vector (from SideB to SideA)
    theSubspace = sideA - sideB

    # Normalize the subspace vector
    norm = np.linalg.norm(theSubspace)
    if norm == 0:
        raise ValueError("The subspace vector has zero magnitude.")
    theSubspace = theSubspace / norm
    
    return theSubspace

#Project onto subspace
def project_onto_subspace(vectors, subspace):
    """
    Projects your vectors onto the semantic subspace.
    Parameters:
    - vectors (np.ndarray): A numpy array of shape (n, d) where n is the number of vectors
                            and d is the dimensionality of each vector.
    - subspace (np.ndarray): The normalized subspace vector.
    Returns:
    - projections (np.ndarray): A numpy array containing the projections of the vectors onto the subspace.
    """
    if len(vectors.shape) == 1:
        # Single vector case, reshape to make it consistent with matrix operations
        vectors = vectors.reshape(1, -1)

    # Compute the dot product of each vector with the subspace
    dot_products = np.dot(vectors, subspace)

    # Compute the projections
    projections = np.outer(dot_products, subspace)
    
    return projections

The next function uses Principle Component Analysis and matplotlib to make your results presentable. 
Honestly, there is probably a much better way to present this data and what comes out is likely to have many of the usual problems you get with matplotlib. 
The PCA doesn't really do anything since your data has already been projected onto a one dimensionsal space.

In [None]:
#Use PCA to reduce those vectors to a two dimensional array in order to make things clearer
def plot_2d_representation_of_words(
    word_list, 
    word_vectors,
    antonyms,
    flip_x_axis = False,
    flip_y_axis = False,
    label_x_axis = "meaningless",
    label_y_axis = "meaningless", 
    label_label = "Semantic Subspace"):
    
    pca = PCA(n_components = 2)
    
    word_plus_coordinates=[]
    
    for word in word_list: 
        current_row = []
        current_row.append(word)
        current_row.extend(word_vectors[word])
        word_plus_coordinates.append(current_row)
    
    word_plus_coordinates = pd.DataFrame(word_plus_coordinates)   
    coordinates_2d = pca.fit_transform(
        word_plus_coordinates.iloc[:,1:300])
    coordinates_2d = pd.DataFrame(
        coordinates_2d, columns=[label_x_axis, label_y_axis])
    coordinates_2d[label_label] = word_plus_coordinates.iloc[:,0]
    if flip_x_axis:
        coordinates_2d[label_x_axis] = \
        coordinates_2d[label_x_axis] * (-1)
    if flip_y_axis:
        coordinates_2d[label_y_axis] = \
        coordinates_2d[label_y_axis] * (-1) 
    plt.figure(figsize = (60,45))
    p1=sns.scatterplot(
        data=coordinates_2d, x=label_x_axis, y=label_y_axis)
    x = coordinates_2d[label_x_axis]
    y = coordinates_2d[label_y_axis]  
    antonyms_key = []
    for side in antonyms:
        antonyms_key.append(word_vectors[side])
    plt.plot(antonyms_key, color = 'w')
    
    label = coordinates_2d[label_label]
    texts = [plt.text(x[i], y[i], label[i], family="serif", rotation=90) for i in range(len(x))]
    adjustText.adjust_text(texts)

The final chunk of code puts all these functions to work. 
First, it gets the subspace and then it plots this to a two dimensional image so you can read it. 
This is where you can plug in the antonyms you want to use. 
All you have to do to use this is put the words you are interested in into the word_list, and the antonyms you are interested in into the slots for 'antonym1' and 'antonym2' in the two spaces required. 

In [None]:
#Stick the words in the dataset that you want to examine into the word list. 
word_list = ['word', 'word1', 'word2', 'word3'...]

#Compute the semantic subspace and place it in a dictionary
output_projections = project_onto_subspace(vectors, get_semantic_subspace(vect_dict, "antonym1", "antonym2"))
output_tuple = [(key, value) for i, (key, value) in enumerate(zip(keys, output_projections))]
output_dict = dict(output_tuple)

#Print that two dimensional array
phil_map = plot_2d_representation_of_words(
    word_list = word_list, 
    word_vectors = output_dict, 
    antonyms = ['antonym1', 'antonym2'],
    flip_y_axis = False,
    flip_x_axis = False)    