# Glove Embeddings Demonstration

This notebook is meant to demonstrate the features of Glove Embeddings to see how they can potentially be used in conjunction with the COCO dataset to numerically analyze explanations and such. The link to download the embeddings is [here](https://nlp.stanford.edu/projects/glove), and I downloaded the **6b** one with Wikipedia and all.

## Imports that may be Necessary

```python
from collections import defaultdict
import numpy as np
import gensim
from gensim.models.keyedvectors import KeyedVectors
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
%matplotlib inline
```
*Note: You may need to conda install gensim*

In [2]:
#Imports Cell
from collections import defaultdict
import numpy as np
import gensim
from gensim.models.keyedvectors import KeyedVectors
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
#Put the path to glove here
path = r"C:/Users/Vishn/Downloads/glove.6B.50d.txt.w2v"

#Now load the model into the variable "glove" (may take some time)
glove = KeyedVectors.load_word2vec_format(path, binary=False)

## How to Use Glove
```python
glove["word"] # Will give glove embedding vector for the word

"word" in glove #Checks if word is in glove (acts like a dictionary

glove["husband"] - glove["man"] + glove["woman"] #Should give representation that is wife

#To find most similar term to a vector:
    
glove.similar_by_vector(query)

#More advanced way to do this

glove.most_similar_cosmul(positive=['husband', 'woman'], negative=['man'])

#Since they are vectors, we can find the distance using dot products
```


In [6]:
glove.similar_by_vector(glove["husband"] - glove["man"] + glove["woman"])

[('wife', 0.914989709854126),
 ('husband', 0.9109995365142822),
 ('mother', 0.8880481123924255),
 ('daughter', 0.8739314079284668),
 ('grandmother', 0.831427812576294),
 ('married', 0.8221434950828552),
 ('girlfriend', 0.8158676624298096),
 ('daughters', 0.8089475631713867),
 ('sister', 0.8080084323883057),
 ('widowed', 0.8063021898269653)]

# General Ideas for Symbolic Reasoning

Suppose we had a bunch of **labels** from each of the outputs for the subsystems (the sem. seg & the two captions), we try to figure out how close all the different labels are across different systems.

We figure out a threshold and if 2-3 of them are super close in their vectors, and another one is not, we suspect that one, generally, but here are some other general ideas:

- Since each subsystem will present its own set of labels, (all the labels must be relatively close to each other) if **any of them seem abnormaly far away from the others** (maybe we can scramble to see this), then we say that it is not reasonable
- We do the same thing across multiple ones as well, and see the distances (maybe min distances) and try to figure out at a high level who is not reasonable
- We combine these local, and high-level checks with symbolic checks to determine overall reasonability

# Testing Demonstration

Let's take 3 different images from the CoCo Dataset where:

- Two of the images will be similar, and 1 image will be different

We will see if we can use the **semantic segmentation tags** + some basic distance calculations to see which images should be closest to each other:

Links to the three images used:

- 
- 
- 

# Math for Calculating Distances Between Captions

Suppose we have **sem. seg** labels that identify **O** objects, assuming we are using **N** dimensional vectors, we have a (**O**x**N**) array representing our values. 

Now suppose we have **caption labels** that identify **K** objects, assuming we are still using **N** dimensional vectors, we have a (**K**x**N**) array representing our values. 

## Finding Distances

When we are trying to find the distances, we need to use **multiplication** to make this easy. However, we need to remember that order matters if we were to just blindly do it:

```python
[[man], [woman], [cat]] * [[man], [woman], [cat]] = 0 #As distances between each element and itself would be zero

#However
[[cat], [woman], [man]] * [[man], [woman], [cat]] !=0 #As order is different so what will get multiplied is different
```

Therefore we will specifically use **matrix multiplication**. Therefore, the steps to find distances based on this are:


The python code to do so is as follows (uses `numpy`):

```python
O = 7
N = 2
K = 6

#Create two matrixes based on dimensions
x = np.arange(14).reshape((O,N))
y = np.arange(12).reshape((K,N))

# Distances are x^2 + y^2 - 2*x*y
dists = np.sum(x**2, axis=1)[:,np.newaxis] + np.sum(y**2, axis=1) 
dists -= 2*np.matmul(x,y.T)

distances = np.sqrt(dists)

#Now we find the minimum of this to represent as our distance
#For the axis, select whichever axis is smaller (will either be zero or 1)


np.min(distances,axis = np.argmin(distances.shape))

#By doing the above, we cover the one with more objects, so that there is a greater chance of greater distances (vs not properly accounting for an object's distance to others. Though, it shouldn't really matter either way.

```

In [69]:
arr_a = np.array([glove["man"],glove["woman"],glove["cat"]])
arr_b = np.array([glove["man"],glove["woman"],glove["cat"],glove["dog"]])

a = np.sum(arr_a**2,axis = 1)[:,np.newaxis]
b = np.sum(arr_b**2,axis = 1)

dists = a + b - 2*np.matmul(arr_a,arr_b.T)
dists[dists < 1e-6] = float(0.0)
dists = np.sqrt(dists)
type(np.min(dists,axis = np.argmin(dists.shape)))

numpy.ndarray

In [74]:
# Function based on all the computations above
def calcuate_distances(label_set_a:list, 
                       label_set_b:list) -> np.ndarray:
    """
    This function takes in two sets of glove embeddings vectors and returns the min distances between the two
    
    Parameters
    -------------    
    label_set_a : list 
            the first set of glove embedding vectors from one input source
    label_set_b : list
            the second set of glove embedding vectors from the second source
    
    Returns
    ---------
    numpy.ndarray
        The list of distances, where length = max(len(label_set_a),len(label_set_b))
    """
    
    #Turn both into numpy arrays
    arr_a = np.array(label_set_a)
    arr_b = np.array(label_set_b)
    
    #Square and transform as needed
    a = np.sum(arr_a**2,axis = 1)[:,np.newaxis]
    b = np.sum(arr_b**2,axis = 1)
    
    #Calculate the distances and take the square root
    #We are also cutting off where values too small
    dists = a + b - 2*np.matmul(arr_a,arr_b.T)
    dists[dists < 1e-6] = float(0.0)
    dists = np.sqrt(dists)
    
    #Return the minimum values across the axis with more glove embeddings
    return np.min(dists,axis = np.argmin(dists.shape))

In [141]:
# Function based on all the computations above
def calcuate_distance(label_set_a:list, 
                       label_set_b:list) -> np.ndarray:
    """
    This function takes in two sets of glove embeddings vectors and returns a single value representing the distance between the two values
    
    Parameters
    -------------    
    label_set_a : list 
            the first set of glove embedding vectors from one input source
    label_set_b : list
            the second set of glove embedding vectors from the second source
    
    Returns
    ---------
    float32
        A single value representing the distance between label_set_a and label_set_b
    """
    
    #Turn both into numpy arrays
    arr_a = np.array(label_set_a)
    arr_b = np.array(label_set_b)
    
    #Square and transform as needed
    a = np.sum(arr_a**2,axis = 1)[:,np.newaxis]
    b = np.sum(arr_b**2,axis = 1)
    
    #Calculate the distances and take the square root
    #We are also cutting off where values too small
    dists = a + b - 2*np.matmul(arr_a,arr_b.T)
    dists[dists < 1e-6] = float(0.0)
    dists = np.sqrt(dists)
    
    #Return the minimum values across the axis with more glove embeddings
    return np.sum(np.min(dists,axis = np.argmin(dists.shape)))

## Now let's use our above function with labels of 2 similar images and 1 different

For this test, we will simply **only use the sem.seg labels to see how well we can tell distances with that** (idea is that captions will use similar idea):
- [Similar Image A](https://cocodataset.org/#explore?id=5253) and [Similar Image B](https://cocodataset.org/#explore?id=277614) selected on coco site by clicking *stop sign* (stop) and *traffic light* (stoplight)
- [Different Image A](https://cocodataset.org/#explore?id=360877) selected on coco site by clicking *apple* and *chair*

Let's see the 3 way comparison test for distances:

In [112]:
#Code to manually look through available words
# asd = list(glove.vocab.keys())
# asd.sort()
#print(asd[360000:370000])

In [133]:
#List of representaion based on sem. seg labels
similar_image_a = [glove["stop"],glove["stoplight"]]
similar_image_b = [glove["stop"],glove["stoplight"],glove["train"],glove["clock"]]
different_image_a = [glove["person"],glove["bottle"],glove["banana"],glove["apple"],glove["chair"]]

In [143]:
print("Distances between similar_image_a, similar_image_b: ")
print(calcuate_distances(similar_image_a,similar_image_b))

print("\n")
print("Distances between similar_image_a, different_image_a: ")
print(calcuate_distances(similar_image_a,different_image_a))

print("\n")
print("Distances between similar_image_b, different_image_a: ")
print(calcuate_distances(similar_image_b,different_image_a))

print("\n")
print("Printing the sums of each:")
print(calcuate_distance(similar_image_a,similar_image_b))
print(calcuate_distance(similar_image_a,different_image_a))
print(calcuate_distance(similar_image_b,different_image_a))

Distances between similar_image_a, similar_image_b: 
[0.        0.        3.8940194 4.6098065]


Distances between similar_image_a, different_image_a: 
[4.951073  5.6056447 5.913838  6.0824575 5.8651605]


Distances between similar_image_b, different_image_a: 
[4.951073  5.6056447 5.913838  5.834715  4.8754997]


Printing the sums of each:
8.503826
28.418173
27.18077


#### Now we will try the same process as above but with scrambling some of the ordering (not making it nice and uniform across the diff. inputs)

In [146]:
#List of representaion based on sem. seg labels but with different ordering
similar_image_a = [glove["stoplight"], glove["stop"]]
similar_image_b = [glove["train"], glove["stop"], glove["stoplight"], glove["clock"]]
different_image_a = [glove["bottle"],glove["banana"],glove["person"],glove["chair"], glove["apple"]]

In [147]:
print("Distances between similar_image_a, similar_image_b: ")
print(calcuate_distances(similar_image_a,similar_image_b))

print("\n")
print("Distances between similar_image_a, different_image_a: ")
print(calcuate_distances(similar_image_a,different_image_a))

print("\n")
print("Distances between similar_image_b, different_image_a: ")
print(calcuate_distances(similar_image_b,different_image_a))

print("\n")
print("Printing the sums of each:")
print(calcuate_distance(similar_image_a,similar_image_b))
print(calcuate_distance(similar_image_a,different_image_a))
print(calcuate_distance(similar_image_b,different_image_a))

Distances between similar_image_a, similar_image_b: 
[3.8940194 0.        0.        4.6098065]


Distances between similar_image_a, different_image_a: 
[5.6056447 5.913838  4.951073  5.8651605 6.0824575]


Distances between similar_image_b, different_image_a: 
[5.6056447 5.913838  4.951073  4.8754997 5.834715 ]


Printing the sums of each:
8.503826
28.418175
27.18077


## Observations

As you can see from the cell above, the two *similar images* were close to each other, **and had similar distances to the different image** which can be really useful for **outlier analysis**. 

Another great thing that you can see is that *independent of the order of the different label input* **the outputs for the overall distances were identical** (and the vectors just in a different ordering, same values)

- We can also try this with higher dimensional (this was just with 50D vectors) so higher dimensional might lead to even tighter distances (closer things are closer, farther things are farther)
- We can also try this process and calculate distances **within 1 single image label set to find outlier labels** (t