# Phonetic similarity experiment

This notebook contains the code necessary to compare the phonetic similarity figures in [Vitz and Winkler (1973)](https://www.researchgate.net/publication/232418589_Predicting_the_Judged_Similarity_of_Sound_of_English_words) to the cosine similarity obtained between items in the vector embedding described in my paper. CSV files with the experimental data are included with this repository. There is an additional experiment in Vitz and Winkler that I didn't include (haven't had time to transcribe the data yet!).

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

The `adjustText` module adds the nice arrows from the labels to the points in the scatterplots. Recommended!

In [2]:
from adjustText import adjust_text

ModuleNotFoundError: No module named 'adjustText'

The cosine similarity function...

In [None]:
from numpy import dot
from numpy.linalg import norm

# cosine similarity
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

The "space" is defined as a dictionary, whose keys are the words and whose values are the vectors. These are parsed from a pre-computed set of vectors that were hard-coded to include the nonce words used in Vitz and Winkler.

In [None]:
space = dict()
for line in open("cmudict-0.7b-simvecs-vitz", encoding="latin1"):
    line = line.strip()
    word, vec_raw = line.split("  ")
    word = word.lower()
    space[word] = np.array([float(x) for x in vec_raw.split()])

Just to make sure everything loaded correctly:

In [None]:
space["cheese"]

The `runexperiment()` function takes a CSV file with data from the Vitz and Winkler experiment, along with a vector space (as loaded above) and a "standard word" (i.e., the word against which phonetic similarity is being tested). It returns a Pandas dataframe.

In [None]:
def runexperiment(csv_filename, space, exp_word):
    df = pd.read_csv(csv_filename)
    df["embedding_cosine"] = [cosine(space[exp_word], space[x]) for x in df["word"]]
    df["vw_predicted"] = [1-x for x in df["vw_predicted"]]
    df.sort_values(by="embedding_cosine")
    return df

The `getplot()` function takes a dataframe as returned by `runexperiment` and plots it as a scatterplot. Note: The labels need to be changed manually in the function call (see the implementation below for details). To save the plot as a PDF, specify the `fname` parameter with the desired filename.

In [None]:
def getplot(df, x="embedding_cosine", y="obtained", labelx="Vector cosine similarity",
           labely="Obtained (Vitz and Winkler)",
           title="Standard word: sit",
           fname=None):
    plt.figure(figsize=(6, 6), dpi=75) # change DPI here for print-ready
    plt.scatter(df[x], df[y], s=4.0)
    texts = []
    for i, text in enumerate(df["word"]):
        row = df.iloc[i]
        texts.append(plt.text(row[x], row[y], row["word"]))
    adjust_text(texts, arrowprops=dict(arrowstyle="->", lw=0.5, alpha=0.5))
    plt.xlabel(labelx)
    plt.ylabel(labely)
    plt.title(title)
    if fname:
        plt.savefig(fname)
    plt.show()

## Experiment 1: sit

Results from the vector space:

In [None]:
df = runexperiment("./vitz-1973-experiment-sit.csv", space, "sit")

In [None]:
getplot(df, labelx="Cosine similarity", labely="Obtained (Vitz and Winkler)",
       title="Standard word: sit")

Correlation:

In [None]:
np.corrcoef(df["obtained"], df["embedding_cosine"])[0, 1]

Results from Vitz and Winkler's "PPD" metric ("predicted phonemic distance"):

In [None]:
getplot(df, "vw_predicted", "obtained", labelx="PPD (Vitz and Winkler)",
        labely="Obtained (Vitz and Winkler)",
        title="Standard word: sit")

Correlation:

In [None]:
np.corrcoef(df["obtained"], df["vw_predicted"])[0, 1]

## Experiment 2: plant

Vector space results:

In [None]:
df = runexperiment("./vitz-1973-experiment-plant.csv", space, "plant")

In [None]:
getplot(df, labelx="Cosine similarity", labely="Obtained (Vitz and Winkler)",
       title="Standard word: plant")

Correlation:

In [None]:
np.corrcoef(df["obtained"], df["vw_predicted"])[0, 1]

Vitz and Winkler PPD results:

In [None]:
df = runexperiment("./vitz-1973-experiment-plant.csv", space, "plant")

In [None]:
getplot(df, "vw_predicted", "obtained", labelx="PPD (Vitz and Winkler)",
        labely="Obtained (Vitz and Winkler)",
        title="Standard word: plant")

Correlation:

In [None]:
np.corrcoef(df["obtained"], df["embedding_cosine"])[0, 1]

## Experiment 3: wonder

Vector space results:

In [None]:
df = runexperiment("./vitz-1973-experiment-wonder.csv", space, "wonder")
getplot(df, labelx="Cosine similarity", labely="Obtained (Vitz and Winkler)",
       title="Standard word: wonder")

Correlation:

In [None]:
np.corrcoef(df["obtained"], df["embedding_cosine"])[0, 1]

Vitz and Winkler results:

In [None]:
getplot(df, "vw_predicted", "obtained", labelx="Inverse PPD (Vitz and Winkler)", labely="Obtained (Vitz and Winkler)",
       title="Standard word: wonder")

Correlation:

In [None]:
np.corrcoef(df["obtained"], df["vw_predicted"])[0, 1]