Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# Character Space - The `ascii_letters` vector space

Inspired by the char2vec colab.

## The `character` vector space

We begin by defining the `character` vector space which has the `ascii_letters` as its `basis`.  Why ascii? It's arbitrary and having only __ dimesions keeps things simple. We could have taken any of the unicode variations, but depending on the size of the character space we may want to pursue some optional optimizations which are also discussed below.

There are 2 functions which compute: <br>
1) the `vector` representation of the word, i.e. the counts of the characters in the word <br>
2) the `support`: the set of unique characters

In [None]:
from collections import Counter
import math

def char2vec(word):
  # Counts each of the the characters in word.
  # We use a dictionary instead of a sparse matrix to describe the characters,
  # however the concept is identical.
  return Counter(word)

def support(v):
  # The support of a vector over a basis is the subset of basis elements with
  # non-zero components.
  return set(v)

# Note: We could have written this simpler: char2vec = Counter; support = set;

# Check Point
1. What is the dimension of the ascii character vector space?
1. What is the support of 'pizza'?

In [None]:
import string

vector_space = string.ascii_letters

print(f"1. The dimension of vector_space is {len(vector_space)}, since there is one \n   independent vector per character and that's all of them.\n")
print(f'2. The support of pizza is {support(char2vec("pizza"))}.')


1. The dimension of vector_space is 52, since there is one 
   independent vector per character and that's all of them.

2. The support of pizza is {'a', 'z', 'p', 'i'}.


# The `dot` product and the `norm`

Convince yourself that this dot product here corresponds to dotting vectors.

With `numpy` arrays, that is `numpy.dot`.

In [None]:
import string

def dot(v, w):
  domain = string.ascii_letters
  # Note that this computation is equivalent to each of the following
  # optimizations. Exercise: Why?
  #   domain = support(v).union(w)
  #   domain = support(v).intersection(w)
  #
  # This domain here bears a passing resemblance to integration doesn't it.
  return sum(v[ch] * w[ch] for ch in domain) 

def norm(v):
  return math.sqrt(dot(v, v))

# An Example

In [None]:
# Our sample "words"
wordlist = [
  'TheQuickBrownFoxJumpsOverTheLazyDog',
  'TheQuickWhiteFoxJumpsOverTheLazyDog',
  'SupermanJumpsOverTheTallBuilding',
]

# For each of our words, print its vector and some information about it.
for word in wordlist:
  print(word)
  print("  vector: ", char2vec(word))
  print("  support: ", support(char2vec(word)))
  print("  norm:", norm(char2vec(word)), end="\n\n")

TheQuickBrownFoxJumpsOverTheLazyDog
  vector:  Counter({'e': 3, 'o': 3, 'T': 2, 'h': 2, 'u': 2, 'r': 2, 'Q': 1, 'i': 1, 'c': 1, 'k': 1, 'B': 1, 'w': 1, 'n': 1, 'F': 1, 'x': 1, 'J': 1, 'm': 1, 'p': 1, 's': 1, 'O': 1, 'v': 1, 'L': 1, 'a': 1, 'z': 1, 'y': 1, 'D': 1, 'g': 1})
  support:  {'D', 'a', 'v', 'g', 'r', 'J', 'h', 's', 'p', 'B', 'k', 'w', 'i', 'm', 'x', 'z', 'Q', 'F', 'y', 'o', 'n', 'c', 'O', 'L', 'u', 'T', 'e'}
  norm: 7.416198487095663

TheQuickWhiteFoxJumpsOverTheLazyDog
  vector:  Counter({'e': 4, 'h': 3, 'T': 2, 'u': 2, 'i': 2, 'o': 2, 'Q': 1, 'c': 1, 'k': 1, 'W': 1, 't': 1, 'F': 1, 'x': 1, 'J': 1, 'm': 1, 'p': 1, 's': 1, 'O': 1, 'v': 1, 'r': 1, 'L': 1, 'a': 1, 'z': 1, 'y': 1, 'D': 1, 'g': 1})
  support:  {'D', 'a', 'v', 'g', 'r', 'J', 'h', 's', 'p', 't', 'k', 'i', 'm', 'x', 'z', 'Q', 'W', 'F', 'y', 'o', 'c', 'O', 'L', 'u', 'T', 'e'}
  norm: 7.810249675906654

SupermanJumpsOverTheTallBuilding
  vector:  Counter({'u': 3, 'e': 3, 'l': 3, 'p': 2, 'r': 2, 'm': 2, 'a': 2, 'n': 2, 

## The angles between vectors
This function finds the cosine_similarity in character-space between two char2vec vectors.

In [None]:
def cosine_similarity(v, w):
  return dot(v, w) / norm(v) / norm(w)

## How far are our examples from one another?

In [None]:
from itertools import combinations

# Find the cosine similarity between each of the 3 vectors created above
# Similar senteces will have higher scores, ranging from 0-1.
for x, y in combinations(wordlist, 2):
  print ("Cosine Similarity between", x, "and", y,"=",
         cosine_similarity(char2vec(x),char2vec(y)), end="\n\n")

Cosine Similarity between TheQuickBrownFoxJumpsOverTheLazyDog and TheQuickWhiteFoxJumpsOverTheLazyDog = 0.9150179365143998

Cosine Similarity between TheQuickBrownFoxJumpsOverTheLazyDog and SupermanJumpsOverTheTallBuilding = 0.6910548590248231

Cosine Similarity between TheQuickWhiteFoxJumpsOverTheLazyDog and SupermanJumpsOverTheTallBuilding = 0.6721936196477039



---

# Totally Optional Exercises

1. If the cosine_similarity=0, what do you know about the words?
1. If the cosine_similarity=1, are the words the same?
1. Can the cosine_similarity be negative?
1. Is the cosine_similarity symmetric?
1. In the definition of `dot`, above, why are these three definitions of the variable `domain` equivalent?
  - `domain = string.ascii_letters`
  - `domain = support(v).union(w)`
  - `domain = support(v).intersection(w)`
1. Define a `distance` by taking the inverse cosine of `cosine_similarity`.
  - Show this now actually computes the angle.
  - Given 3 words, does this notion of `distance` satisfy the triangle inequality? Modify the code above to show these three words do.

##### **Advanced**
1. Find a sequence of words w_1, w_2, ..., for which the sequence norm(w_1), norm(w_2), ... is [unbounded](https://en.wikipedia.org/wiki/Bounded_function).
1. Find two sequences of words whose pair-wise cosine_similarity is arbitrarily close to 1. i.e. Find word sequences a=a_1,a_2,... and b=b_1,b_2,... so that a_n and b_n make arbitrarily tiny angles.
 - This means that our vector space (with this distance) is not [topologically discrete](https://en.wikipedia.org/wiki/Discrete_space).
 - State a reasonable condition so that the vector space is discrete.

In [None]:
def similarity(x,y):
  return cosine_similarity(char2vec(x), char2vec(y))

print("similarity of", "able", "elba", ": ", similarity("able", "elba"))
print("similarity of", "gabe", "juno", ": ", similarity("gabe", "juno"))
print("similarity of", "piz...za", "piz...zza", ": ", similarity("pi" + "z"*10**3 + "a", "pi" + "z"*(10**3 +1) + "a"))

similarity of able elba :  1.0
similarity of gabe juno :  0.0
similarity of piz...za piz...zza :  0.999999999998503


---

# Forget about sparsity! (a **dense** implementation)

The above example used a Python Counter, which is a `defaultdict(int)` to implement sparse vectors. That is critically important if your domain is e.g. the English Language. However in our case here our domain is just characters, so the sparseity is unnecessary. Here is a non-sparse, i.e. dense, implementation.

In [None]:
import string
import numpy as np

vector_space = string.ascii_letters

def char2index(ch):
  return vector_space.index(ch)
def index2char(index):
  return vector_space[index]

def char2vec_dense(word):
  out = np.zeros(len(vector_space))
  for ch in word:
    out[char2index(ch)] += 1
  return out

def support_dense(v):
  return "".join(index2char(i) for i in v.nonzero()[0])

# Now the dot, really is np.dot
def dot_dense(v, w):
  return v.dot(w)

def norm_dense(v):
  return dot_dense(v)**.5

In [None]:
# For each of our words, print its vector and some information about it.
for word in wordlist:
  print(word)
  print("  vector: ", char2vec_dense(word))
  print("  support: ", support_dense(char2vec_dense(word)))
  print("  norm:", norm_dense(char2vec_dense(word)), end="\n\n")

TheQuickBrownFoxJumpsOverTheLazyDog
  vector:  [1. 0. 1. 0. 3. 0. 1. 2. 1. 0. 1. 0. 1. 1. 3. 1. 0. 2. 1. 0. 2. 1. 1. 1.
 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 2. 0. 0.
 0. 0. 0. 0.]
  support:  aceghikmnoprsuvwxyzBDFJLOQT
  norm: 7.416198487095663

TheQuickWhiteFoxJumpsOverTheLazyDog
  vector:  [1. 0. 1. 0. 4. 0. 1. 3. 2. 0. 1. 0. 1. 0. 2. 1. 0. 1. 1. 1. 2. 1. 0. 1.
 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 2. 0. 0.
 1. 0. 0. 0.]
  support:  aceghikmoprstuvxyzDFJLOQTW
  norm: 7.810249675906654

SupermanJumpsOverTheTallBuilding
  vector:  [2. 0. 0. 1. 3. 0. 1. 1. 2. 0. 0. 3. 2. 2. 0. 2. 0. 2. 1. 0. 3. 1. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 2. 0. 0.
 0. 0. 0. 0.]
  support:  adeghilmnprsuvBJOST
  norm: 8.0



# More exercises
1. Hey, vector-spaces are over a field. What is the field in which this vector-space is over?
1. Pick a unicode encoding. What is its dimension?
1. Find an alternative basis for ascii character space. Can you think of a situation where it might be more useful than `string.ascii_letters`? (*)
1. Implement cosine_similarity_dense. It looks an awful lot like cosine_similarity, doesn't it?
1. Re-implement `cosine_similarity` using `sklearn.CountVectorizer`.
1. What is an advantage of the sparse implementation?
1. What is an advantage of the dense implementation?


In [None]:
# This function looks an awful lot like cosine_distance, doesn't it.
def cosine_similarity_dense(v, w):
  return dot_dense(v, w) / norm_dense(v) / norm_dense(w)