## Do word manipulation based on their vector representation
We are using GloVe: Global Vectors for Word Representation<br/>
from: Jeffrey Pennington,   Richard Socher,   Christopher D. Manning

see: https://nlp.stanford.edu/projects/glove/

Download the file glove.6B.zip, unzip it, and upload glove.6B.100d.txt into your project assets


## Gensim
To manipulate the vectors, we are using the gensim library.<br/>
see: https://pypi.org/project/gensim/

In [None]:
!pip install --upgrade gensim

In [None]:
from pyspark.sql import SparkSession
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='9OBEPHS0jj5q0FdEFWpF-USWWwiqFtRkeH6njgVaar',
    ibm_auth_endpoint="https://iam.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

# Your data file was loaded into a botocore.response.StreamingBody object.
# Please read the documentation of ibm_boto3 and pandas to learn more about your possibilities to load the data.
# ibm_boto3 documentation: https://ibm.github.io/ibm-cos-sdk-python/
# pandas documentation: http://pandas.pydata.org/
spark = SparkSession.builder.getOrCreate()

In [None]:
client.download_file(Bucket='bscstesting-donotdelete-pr-paqxy5fmsmaykn', 
                     Key='glove.6B.100d.txt', Filename='glove.txt')

In [None]:
# There are 400000 words in this file. You can double-check with thwe following command
# !wc -l glove.txt

In [None]:
# The gensim interface expects to have the number of entries and the vectors size as the first line of the file
# Add a header to the file: number of rows, and number of dimensions
!echo "400000 100" >header.txt
!cat header.txt glove.txt >glove2.txt

In [None]:
# Load the gensim model using the glove2.txt file
# https://radimrehurek.com/gensim/models/keyedvectors.html
from gensim.models import KeyedVectors

gmodel=KeyedVectors.load_word2vec_format("./glove2.txt", binary=False)

In [None]:
!rm *.txt

In [None]:
type(gmodel)

In [None]:
print("Number of words: " + str(len(gmodel.index2word)) )
print("Vector size: " + str(gmodel.vector_size) )
print("First 5 factors of 'computer': ")
print(gmodel["computer"][:5])

## Look at similarities
See how close two words are from each other

In [None]:
print("good good: " + str(gmodel.similarity('good', 'good')) )
print("good best: " + str(gmodel.similarity('good', 'best')) )
print("good bad : " + str(gmodel.similarity('good', 'bad')) )
print("good mouse: " + str(gmodel.similarity('good', 'mouse')) )

## Related words
Find the top 10 related words

In [None]:
ms=gmodel.most_similar(positive=['spain'])
for x in ms:
    print(x[0] + ", " + str(x[1]))

In [None]:
ms=gmodel.most_similar(positive=['canada'])
for x in ms:
    print(x[0] + ", " + str(x[1]))

In [None]:
ms=gmodel.most_similar(positive=['although'])
for x in ms:
    print(x[0] + ", " + str(x[1]))

In [None]:
ms=gmodel.most_similar(positive=['computer'])
for x in ms:
    print(x[0] + ", " + str(x[1]))

## Word algebra
We can do vector algebra to compose words from other words.

For example: `"king" - "man" + "woman" = "queen"`

In [None]:
# Look at the top choices for results "king" - "man" + "woman"
gmodel.most_similar(positive=['king', 'woman'], negative=['man'])[:5]

## Analogy

In [None]:
import operator

def closest_analogies(
    left2: str, left1: str, right2: str, model: KeyedVectors
) -> [[float, str]]:
    return(model.most_similar(positive=[left1, right2], negative=[left2])[:5])

def print_analogy(left2: str, left1: str, right2: str, words: dict) -> None:
    analogies = closest_analogies(left2, left1, right2, words)
    if (len(analogies) == 0):
        print(left2 + "-" + left1 + " is like " + right2 + "-?")
    else:
        (w, dist) = analogies[0]
        print(left2 + "-" + left1 + " is like " + right2 + "-" + w)

In [None]:
print_analogy('paris', 'france', 'rome', gmodel)

In [None]:
gmodel.most_similar(positive=['france', 'rome'], negative=['paris'])[:5]

In [None]:
print_analogy('man', 'king', 'woman', gmodel)
print_analogy('walk', 'walked' , 'go', gmodel)
print_analogy('happy', 'sad' , 'rich', gmodel)

## Least likely word in a list

In [None]:
gmodel.doesnt_match("breakfast cereal dinner lunch".split())

In [None]:
print("breakfast dinner: " + str(gmodel.similarity('breakfast', 'dinner')) )
print("breakfast lunch: " + str(gmodel.similarity('breakfast', 'lunch')) )
print("lunch dinner: " + str(gmodel.similarity('lunch', 'dinner')) )
print("breakfast cereal: " + str(gmodel.similarity('breakfast', 'cereal')) )

## Show a graphical representation of the vectors
see: https://stackoverflow.com/questions/43776572/visualise-word2vec-generated-from-gensim

In [None]:
from sklearn.manifold import TSNE
import re
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# Grabbing only the first 100 words
vocab = list(gmodel.vocab)[:100]
X = gmodel[vocab]

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X[:100,:])
# X_tsne = tsne.fit_transform(X)

df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

ax.scatter(df['x'], df['y'])

for word, pos in df.iterrows():
    ax.annotate(word, pos)

plt.show()
