# Vectors and Similarity

The process of converting or transforming a data set into a set of vectors is called __Vectorization__.<br>
We will explore several methods of vectorisation.

Objectives: 
- understanding the basic concepts of vectors in Linear Algebra 
- practicing implementation of vectorisation algorithms in Python programming

Tasks: 
1. Create a function for estimating the similarity between two vectors by means of _cosine similarity_ measure
2. Test the function by comparing variety of numeric test data
3. Test the function by comparing text data 
3. Apply the function in a Q&A (questions answering) application 

## 1. Cosine Similarity
See the figure below. Be sure to find out the meaning of a, b - __vectors__, and |a|, |b| - __magnitudes__.
Remember and apply what you already know from the Pythagoras theorem about sides of a triangle.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

We are searching for the size of the angle between the two vectors: <br>
__smaller angle__ means __closer vectors__  means __bigger similarity__

$${(cosine-similarity-coef) = 1-cos(angle)}$$

According the formula, we can calculate __cos(angle)__ as division of two vectors components - we call them __nominator__ (above the division line, __a.b__) and __denominator__ (below the division line, __||a||*||b||__).<br>

- The fist component requires _dot product_ of two vectors.
- The second one requres calculating magnitudes.

In the function, we calculate each component separately and then divide both.

In [None]:
%%writefile cosimfunc.py
import numpy as np
import math

# calculate coeficient of cosine similarity between two vectors
def cosim(vector1, vector2) -> float:
    
    # nominator as a dot product
    nominator = sum([i*j for (i, j) in zip(vector1, vector2)])
    
    # denominator as a product of two magnitudes 
    # call the second function below
    mag1 = magnitude(vector1)    
    mag2 = magnitude(vector2)
    denominator = mag1 * mag2
    
    # divide
    if not denominator:  # we cannot divide if it is null
         sim = 0.0
    else:
         sim = float(nominator)/denominator
    print('Cosine similarity: ', sim)
    return sim

# calculate one magnitude
def magnitude(v) -> float:
    # square() returns the element-wise square of the input
    # math.sqrt() returns sqrt of a number
    mag = math.sqrt(sum(np.square(v))) 
    return mag

Overwriting cosimfunc.py


In [4]:
pip install python-docx

Collecting python-docx
  Obtaining dependency information for python-docx from https://files.pythonhosted.org/packages/5f/d8/6948f7ac00edf74bfa52b3c5e3073df20284bec1db466d13e668fe991707/python_docx-1.1.0-py3-none-any.whl.metadata
  Downloading python_docx-1.1.0-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.0-py3-none-any.whl (239 kB)
   ---------------------------------------- 0.0/239.6 kB ? eta -:--:--
   - -------------------------------------- 10.2/239.6 kB ? eta -:--:--
   - -------------------------------------- 10.2/239.6 kB ? eta -:--:--
   ---------- ---------------------------- 61.4/239.6 kB 544.7 kB/s eta 0:00:01
   ---------------------------------------- 239.6/239.6 kB 1.6 MB/s eta 0:00:00
Installing collected packages: python-docx
Successfully installed python-docx-1.1.0
Note: you may need to restart the kernel to use updated packages.


In [3]:
from docx import Document
doc = Document('./data/BarackHusseinObamaWiki.docx')
doctxt = 

## 2. Numeric Tests

In [4]:
import cosimfunc
from cosimfunc import cosim

In [5]:
import importlib 
importlib.reload(cosimfunc)

<module 'cosimfunc' from 'C:\\Users\\Alexander Michelsen\\Downloads\\cosimfunc.py'>

#### Test 1

In [None]:
# import two vectors
a = [1, 2, 3]
b = [1, 2, 3]

In [None]:
# calculate similarity
k = cosim(a, b)

#### Test 2

In [None]:
# import two vectors
a = [1, 2, 3]
b = [-1, -2, -3]

In [None]:
# calculate similarity
k = cosim(a, b)

Make more tests here!

In [None]:
# import two vectors
a = [1, 2, 3]
b = [45, -20, 33]

In [None]:
# calculate similarity
k = cosim(a, b)