### Identifying similar items

* A fundamental problem in data mining is to search for "similar" items.
* E.g.: 
    * Finding duplicate Web pages 
    * Finding duplicate listings on an e-commerce platform (e.g., Craigslist or Amazon)
   
* How do we find very closely related items to a given query $q$
* We often need to do that for all instances of a dataset

### Native Solution

* Compare items pairwise and compute some score of similarity between the pages.

* A naive approach is not tractable.

* For n items there are $\frac{n \times (n-1)}{2} \in O(n^2)$ unique comparisons
  * With 1 million items, there are 4,999,9950,0000 comparisons
    * That's a relatively small dataset by today's standard
  * May be computationally intractable or cost-prohibitive


### Score of Between Two Items

* Suppose we have two records (instances) for a dataset
  * Records contain `marital_status`, `number of children`, `has_investments?`, `owns_house?`, `car >= 2?`

```python    
 a = ["Married", "4", "yes", "yes", "yes"]
 b = ["Married", "2", "yes", "yes", "yes"]
 c = ["Single",  "0", "yes", "no", "no"]
 ``` 
* How similar are the following sets?
* We can see that `a` and `b` are more alike than `a` and `c` or `a` and `d`

### Similarity Between two Sets

```python
 a = ["Married", "4", "yes", "yes", "yes"]
 b = ["Married", "2", "yes", "yes", "yes"]
```

* One way to capture the similarity is by comparing the number of shared features
* `a` and `b share 4 similar values out of 5.
* `a` and `c` share 1 similar value out of 5.

* We can therefore say that `a` and `b` are more similar than `a and c` or `b` and `c`


### Jaccard Similarity

* We define the Jaccard similarity between two sets $S$ and $T is 

$$
J(S,T) = \frac{|S \cap T |}{|S \cup T |}
$$

  * i.e., the size of the intersection of S and T to the size of their union. 

* $J(a,b) = 4/5 = 0.8$
* $J(a,c) = 1/5 = 0.2$

### Hashing

* Many applications depend critically on quickly finding items in a list
  * We cannot afford to search for the item (Binary search is $O(logN)$ where N is the number of items to search) 
  * E.g.: air traffic control or packet routing in critical web applications
* Hashing can yield a match in near-constant time O(1)
  * The cost to compute the hash is constant

* Idea: 
  1. Transform the searched-for item into some index (key)
    * We can compute the `key` of an element `e` using a funciton `key()`
    * `key(e)= k`
  2. Find the index in a table (table)
    * Use the hash function to determine bin (`b`) in a table where `k` belongs.
    * `hash(b)= b`


### Hashing -- Cont'd 
<img src="https://www.dropbox.com/s/tkk8tz0yy6k7v2q/hashing.png?dl=1" alt="Drawing" style="width: 400px;"/>


In [6]:
### Universally Unique Identifier (UUID) 
### are often used to identify objects in software
### Paricularly in
import uuid
uuid.uuid4()

UUID('4461adf8-5ec5-4399-95cb-ef3ffcb6e400')

In [7]:
uuid_list = []
for val in range(5):
    uuid_list.append(str(uuid.uuid4()))
uuid_list

['1de6fc24-391e-4329-94a2-414d796009eb',
 '46abbd39-1126-49b5-8bf7-187eab038361',
 '4c7cfbcc-c795-4e68-91c6-14470881be4a',
 '7807fe37-18b5-44ac-97a5-8b8d999d5be8',
 '1897f749-0ce3-4f48-8cc7-a3f786369fd6']

In [8]:
# The following uses a list comprehension.
# List comprehensions are useful and terse to write 
# See https://realpython.com/list-comprehension-python/
uuid_list = [str(uuid.uuid4()) for i in range(5)] 
uuid_list


['7c7890c6-7fbc-4838-8389-6b101e297754',
 '67a5ccff-2bd6-4900-800a-242d46166e2c',
 'f85055ca-8818-4f47-95aa-57ccb6f6bf1a',
 '6056f9ca-15f9-4e94-8081-de0c2e231b29',
 'aa1463b2-89c8-4d13-a228-9b05943cd070']

In [9]:
import uuid
import time
uuid_list = [(str(uuid.uuid4()), time.time()) for i in range(5)] 
uuid_list


[('336d9ee1-f79f-4a64-a0f7-52b5b5abe120', 1630534303.67071),
 ('54e35ec3-6a55-422e-b7e3-5588282a7fa3', 1630534303.6707299),
 ('79174f88-8543-4a60-98f8-03a8747fc5cc', 1630534303.670767),
 ('98cb0a9e-950d-45ee-a3ed-794621fed650', 1630534303.6708071),
 ('d7e796f7-4697-48d3-b3f6-15b679a47f57', 1630534303.6708758)]

In [10]:
%%time
uuid_list = [(str(uuid.uuid4()), time.time()) for i in range(1_000_000)] 

CPU times: user 5.18 s, sys: 1.94 s, total: 7.12 s
Wall time: 7.33 s


In [11]:
uuid_list[999999][0]

'd825ca21-a842-47b5-8258-2aa002ebd9c6'

In [12]:
%%time
query = uuid_list[999999][0]
for elem in uuid_list:
    if elem[0] == query:
        print(elem)
        break

('d825ca21-a842-47b5-8258-2aa002ebd9c6', 1630534311.3391411)
CPU times: user 138 ms, sys: 9.37 ms, total: 148 ms
Wall time: 177 ms


In [13]:
%%time
[x for x in uuid_list if x[0] == query]

CPU times: user 106 ms, sys: 6.86 ms, total: 113 ms
Wall time: 139 ms


[('d825ca21-a842-47b5-8258-2aa002ebd9c6', 1630534311.3391411)]

In [14]:
import random
random.randint(0, 999_999)

213147

In [15]:
x = random.randint(0, 999_999)



In [16]:
queries = []
for i in range(100):
    rand_position = random.randint(0, 999_999)
    query = uuid_list[rand_position][0]
    queries.append(query)

queries[0:4]

['e499c58a-806b-4be3-8c54-430057d7ffeb',
 '3ed23203-23ab-4140-92fc-eaa5a7503b25',
 '7dfc4512-3ae0-466d-a5bf-062cb13d7e6c',
 'd8b11d00-2739-4b36-a40e-eac875b2d2cc']

In [17]:
%%time
for q in queries:
    temp = [x for x in uuid_list if x[0] == q]

CPU times: user 7.62 s, sys: 177 ms, total: 7.8 s
Wall time: 8.27 s


In [18]:
# We can easily build a dict from a list of tuples
dict([("A", 1), ("B", 2), ("C", 3)])

{'A': 1, 'B': 2, 'C': 3}

In [19]:
# We can easily build a dict from a list of tuples
some_dict = dict([("A", 1), ("B", 2), ("C", 3)])
some_dict["B"]

2

In [20]:
uuid_hash = dict(uuid_list)

In [21]:
q = queries[0]
print(f"The query is: {q}")
print(f"The time associated with q is {uuid_hash[q]}")

The query is: e499c58a-806b-4be3-8c54-430057d7ffeb
The time associated with q is 1630534304.944434


In [22]:
%%time

for q in queries:
    uuid_hash[q]

CPU times: user 84 µs, sys: 1e+03 ns, total: 85 µs
Wall time: 88.9 µs


### The Hash Function in Python

* Python uses `hash()` to hash an immutable object
  * Combine conversion of input to a key and hash of key to a bin
  * Cannot hash any object that can be modified, such as lists
  * The has value is used to determine the location (address) where a dict key will be stored
  
```python
    hash()
```

### Hashing Similar Items

* The following dataset contains individuals' info
  * `name`, `age`, `salary`, and `number of years of experience`

```python
data_1 = ("John", "Doe", "32", "165,385", 3)
data_2 = ("Jane", "Doe", "32", "192,891", 3)
data_3 = ("Mark", "Smith", "34", "85,232", 2)
```

* We can hash each of these datasets using the `hash` function.
  * Note that I declared them as tuples instead of lists
  * Lists are mutable and, therefore, not hashable
  

In [23]:
data_1 = ("John", "Doe", "32", "165,385", 3)
print(f"The hash for data_1 is {hash(data_1)}")

data_2 = ("Mat", "Doe", "32", "192,891", 3)
print(f"The hash for data_2 is {hash(data_2)}")

data_3 = ("Mark", "Smith", "34", "85,232", 2)
print(f"The hash for data_3 is {hash(data_3)}")

data_4 = ("Mindy", "Smith", "65", "160,000", 42)
print(f"The hash for data_4 is {hash(data_4)}")


The hash for data_1 is 2035316016885371771
The hash for data_2 is -1028300933287350327
The hash for data_3 is 4030385555266032470
The hash for data_4 is -1416360971839870202


### Hashing and Proximity

* We've shown that hashing can be much faster for finding identical items than a search

* Unfortunately, hashing does not convey level of similarity (e.g.: using Jaccard) 
  * `data_1` and `data_2` are closer to each other than to `data_3`, but their hashes are not
* Can we use hashing to convey some level of similarity?
  * We could find potentially similar items using hashing (fast) and then compare the subset of items using Jaccard similarity?
  * Pairwise comparison on a much smaller subset of data

  

  


### Hashing and Proximity - cont'd

* Naive approach: normalize values to convey similarity

1. Drop non-relevant unnecessary columns such as first and last name
    * for instance, one could start by de-replicating highly-correlated variables.
    * use domain knowledge to remove features that can be derived from other features
2. Convert numerical values into bins
3. Compare the entries over the converted data
    * Compute the hash on some randomly picked subset of features.


### Hashing and Proximity -1

```python
# original data
data_1 = ("John", "Doe", "32", "165,385", 3)
data_2 = ("Mat", "Doe", "32", "192,891", 3)
data_3 = ("Mark", "Smith", "34", "85,232", 2)
data_4 = ("Mindy", "Smith", "65", "160,000", 42)

# Binned data

data_1 = ("milenial", "high", "low")
data_2 = ("milenial", "high", "low")
data_3 = ("millenial", "average", "low")
data_4 = ("baby_boomer", "high", "high")
```

* This approach supposes it's possible to bin the data into relevant categories
  * Here we can create relative significant categories that may be satisfactory to find matching profiles


In [24]:
d = [("milenial", "high", "low"), ("milenial", "high", "low"), ("millenial", "average", "low"), ("baby_boomer", "high", "high")]
d

[('milenial', 'high', 'low'),
 ('milenial', 'high', 'low'),
 ('millenial', 'average', 'low'),
 ('baby_boomer', 'high', 'high')]

In [25]:
[hash(i) for i in d]

[-8106195907596266894,
 -8106195907596266894,
 5772069829379949509,
 3830553319135088666]

### Hashing and Proximity -2

* When workign with large datasets (particulalry wide) and comparing the data across all the features may be CPU and disk intensive.

* For example, image an isntance (singel entry) that has thousands of features
  * Example 1: a simple way to numericize a documnt is by representing it a counts of its word occurrences
  * Example 2: DNA testing companies can assay hundreds of thousands of genetic markers
  * example 3: A medical expertiment, for each patient, one can record:
      - Age
      -  blood analysis:
        - Complete blood count
        - High Density Lipoprotein (HDL)
        - Low Density Lipoprotein (LDL)
        - White Blood Cell Count
        - Red Blood Cell Count
        - ...
      - Immune system status
        - Number of leukocytes
        - Cytokine levels in serum
        - ....
      - Genetic background
      - nb cigarettes in last month
      - nb of alcoholic drinks in last month
      - nb times used drugs in last month
      - nb surgical operations in last year
      - nb of medications taken
      - nb of hospital visits
      - ...
  
  
  
* It's dificult to compare data across all features 


### Hashing and Proximity

* Recall that the objective we defined above is to find items that are similar to a query $q$ 
* The proposed subset-bin approach is not a perfect replacement for pairwise comparisons but allows us to avoid unnecessary comparisons
  * We only focus on pairs of items that generate matches 
* it's also ideal for dealing with very wide datasets
   * In a dataset with thousands of features, we need only focus on a small subset of the features

### Hashing Wide Datasets




* While this is computationally efficient, there is a risk that we might choose the wrong features
    
```python 
         [1, 2, 3, 4, 5, 6, 7, 8]
data_1 = [x, x, x, x, y, x, x, y]
data_2 = [x, x, x, x, z, x, x, z]
data_3 = [w, w, w, w, z, w, w, z]
```
 * e.g.: randomly picking features 5 and 8 can miss identification of the closest item.

### Hashing Wide Datasets - cont'd

* Perhaps we can repeat the algorithm multiple times

1. Pick a random subset of features
2. compute the hash on the selected subset of features
3. Repeat a certain number of times

* There are theoretical guarantees that can make this work.

### Iterative Hashing

* Given two datasets with $N$ features and $K$ features that match between $x_i$ and $x_j$
* If we hash on $n$ features, we can compute the probability of a matching hash value
  * We are selecting $n$ features and we want the probability that all $n$ features match


![](https://www.dropbox.com/s/c7bbp98fhcn0s0q/select_jar.png?dl=0)

$$
p = \frac{{K}\choose{n}}{{N}\choose{n}}
$$


* This is a specific case of the hypergeometric distribution (https://en.wikipedia.org/wiki/Hypergeometric_distribution)

### Iterative Hashing - Single iteration Probability

e.g.: 
    
* Suppose we have a dataset that has 100 features
* Given two instances $x_i$ and $x_j$ that match over 95 of the features 
* what is the probability that x_i and x_j match on a randomly selected subset of 6 features?

$$
p = \frac{{K}\choose{n}}{{N}\choose{n}} = \frac{{95}\choose{6}}{{100}\choose{6}}
$$

* in Python

```python
math.comb(95, 6) / math.comb(100, 6) = 0.72908
```

### Iterative Hashing - Probability for Multiple iterations 

* The probability of two relatively similar instances matching may not be acceptable
  * p=0.73 means that we may miss up to ~27% of the instances that are similar to a query
 
* Repeating the process multiple times increases our chances of selecting $n$ features that match between $x_i$ and $x_j$



* Specifically, given the probability of a single match (ex. p=0.73), repeating the process $n$ times means that the probability of at least one match is:

$$
1- (1-p)^{n}
$$

* This is based on the binomial probability distribution
https://en.wikipedia.org/wiki/Binomial_distribution

### Iterative Hashing - Probability for Multiple iterations 

* The probability for a single perfect match in the example above is p=0.73

* The probability of at least one match in 10 trials is $1 - (1-0.73)^{10}$

* in Python

```
1 - (1-0.73)**10 = 0.9999979
```


### Simulating the probabilities.

* A relatively tractable way to estimate probabilities for simple events is through simulation.

E.g.: 
* Suppose we have a dataset that has 100 features 
* Given two instances $x_i$ and $x_j$ that match on 95 out of 100  features 
* What is the probability that x_i and x_j match on a randomly selected subset of 6 features?
  * Randomly select 6 features a large number of times and compute the fraction of times you obtained a perfect match

In [35]:
# intially x and y match
x = [1 for i in  range(100)]
y = [1 for i in  range(100)]
print(f"x is: {x}\n")
print(f"y is: {y}")

x is: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

y is: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [60]:
# Select 5 positions where x and y will not match
import random

random_positions = random.sample(range(100), 5)
random_positions

[89, 80, 44, 2, 38]

In [61]:
# change the values in y at the selected positions
for i in random_positions:
    y[i] = 0
    
print(x)    
print(y)    


[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [62]:
random_columns = random.sample(range(100), 6)
print(random_columns)
values_for_x = tuple([x[random_columns[0]], x[random_columns[1]], x[random_columns[2]],  x[random_columns[3]], x[random_columns[4]], x[random_columns[5]]])
values_for_y = tuple([y[random_columns[0]], y[random_columns[1]], y[random_columns[2]],  y[random_columns[3]], y[random_columns[4]], y[random_columns[5]]])
print(values_for_x)
print(values_for_y)
print(hash(values_for_x))
print(hash(values_for_y))



[27, 85, 16, 43, 81, 28]
(1, 1, 1, 1, 1, 1)
(1, 1, 1, 1, 1, 1)
-1276324167427414696
-1276324167427414696


In [63]:

perfect_matches = 0

for _ in range(100_000):
    comparison_indices = random.sample(range(100), 6)
    nb_matches = sum([x[i] == y[i] for i in comparison_indices])
    if nb_matches == 6:
        perfect_matches += 1 
        
print(perfect_matches/100_000)        


0.72852


In [31]:
# we can simulate an event of that has a probability of p 
# by simulating a coing flip

random.choices([0, 1], weights=[0.27, 0.73])

[1]

In [32]:
outcomes = []
for i in range(10):
    outcomes.append(random.choices([0, 1], weights=[0.27, 0.73])[0])
print(outcomes)


[1, 1, 0, 1, 1, 0, 1, 1, 1, 1]


In [197]:

def has_a_success(prob):
    outcomes = []
    for i in range(10):
        outcomes.append(random.choices([0, 1], weights=[1-prob, prob])[0])
    if sum(outcomes) > 0:
        return True


In [198]:
has_a_success(prob)

True

In [199]:
has_a_success(prob)

True

In [200]:
nb_successes = 0

prob = 0.73

for i in range(100_000):
    if has_a_success(prob):
        nb_successes +=1 
        
print(nb_successes/100_000)        

1.0


### Variables and Dimensionality

* The approach above required "binning" the data to maximize matches. What
* This approach works well when binning is possible or when the data is binary
  * For example when working with presence-absence.
* This solution is not practical when bin boundaries are not easy to derive



### Data in Higher-Dimensional Space


* Another approach for finding similar items requires thinking of the data in higher-dimensional space.

* The data features are simply dimensions in space.
  * The instances of a dataset with two features can be plotted in 2-D space. 
  * The instances of a dataset with three features can be plotted in 3-D space. 
  * The instances of a dataset with n features can be plotted in n-D space. 


### Data in 2D

<img src="https://www.dropbox.com/s/t0ca5t1qzkr635b/2d-data.png?dl=1" alt="Drawing" style="width: 400px;"/>


### Data in 3D

<img src="https://www.dropbox.com/s/4flfaojlfb2a101/3d-data.png?dl=1" alt="Drawing" style="width: 400px;"/>


### Random Projections: An intuition

* Instead of hashing the data, let's project it into a new line instead.

    * Given some randomly selected line, project a point so that the projection is perpendicular to the projection line

* Two point that are close in higher dimensional space *may* also be close on the line

<img src="https://www.dropbox.com/s/pjmbski6b0v8s8y/point_3d.png?dl=1" alt="Drawing" style="width: 400px;"/>




### Random Projections: An intution - Cont'd

<img src="https://www.dropbox.com/s/vkmv5um7w256w33/3d_point_line.png?dl=1" alt="Drawing" style="width: 400px;"/>


### Random Projections: An intution - Cont'd

<img src="https://www.dropbox.com/s/irik02lzrkuykgp/projection_1.png?dl=1" alt="Drawing" style="width: 400px;"/>


### Random Projections: An intution - Cont'd

<img src="https://www.dropbox.com/s/8y35229n7sb66ep/projection_2.png?dl=1" alt="Drawing" style="width: 400px;"/>


### Random Projections: An intution - Cont'd

* Intuition: two points that are projected close to each other (say to the same line bin) are potentially similar and should be inspected further
  * This is somewhat a relaxed version of hashing since instances don't need to hash to the same value 
     * Projecting close to each other in the same region is sufficient
      

### Projection Example - Cont'd

* Exmaple in 2-D
<img src="https://www.dropbox.com/s/9szk1c2p0bvvacp/projection_3.png?dl=1" alt="Drawing" style="width: 500px;"/>


### Random Projections: An intution - Cont'd
<span style="color:lightgrey;">
    
### Random Projections: An intution - Cont'd

* Intuition: two points that are projected close to each other (say to the same line bin) are potentially similar and should be inspected further
  * This is somewhat a relaxed version of hashing since instances don't need to hash to the same value 
    * Projecting close to each other in the same bin is sufficient
    * We also don't need to bin the feature values
</span>
    

* Since the lines are randomly selected two close points may still fall into separate bins
* To compensate, we can repeat the process multiple times
    

### Projection Example 

<img src="https://www.dropbox.com/s/irik02lzrkuykgp/projection_1.png?dl=1" alt="Drawing" style="width: 500px;"/>



### Projection Example - Cont'd

<img src="https://www.dropbox.com/s/8y35229n7sb66ep/projection_2.png?dl=1" alt="Drawing" style="width: 500px;"/>


###  Projection as a Dot Product


* How do you project a point onto a new axis?
 
*  The dot product (or inner product) of two vectors is the projection of one onto the line spanned by the other.

* It describes the projected vector in terms of the reference vector
  * After normalizing, we get a quantity that represents the magnitude of the projected vector in terms of the reference vector

$$
Proj_B~A = \frac{A \cdot B}{|B|}
$$
where $A \cdot B$ is simply the dot product of A and B.


$A \cdot B = A_x \times B_x + A_y \times B_y$ 

and $|B|$ is the magnitude of |B|. This is needed to normalize the resulting quantity (express it in terms of vector B)

$|B| = \sqrt{B_x^2 + B_y^2}$

In [138]:
import math

# compute the normalized projection of A onto B.
A = (3,4)
B = (5,2)

A_dot_B = A[0]*B[0] + A[1]*B[1]
amp_B = math.sqrt(B[0]**2 + B[1]**2)
print(f"The magnitude of B is: {amp_B}")
proj = A_dot_B / amp_B
print (f"The magnitude of the projection  {proj}")

print(f"The ratio of the projection in terms of B is {proj/amp_B}")

The magnitude of B is: 5.385164807134504
The magnitude of the projection  4.270992778072193
The ration of the projection in terms of B is 0.7931034482758621


In [None]:
print (A_dot_B / amp_B)

In [149]:
### Projection as a Dot Product

A = (0.95, 8)
B = (5,7)
C = (6,5)
D = (4,2)

amp_D = math.sqrt(D[0]**2 + D[1]**2)
print(f"The magnitude of D is: {amp_D}\n")


A_dot_D = A[0]*D[0] + A[1]*D[1]
proj_A = A_dot_D / amp_D
print (f"The magnitude of the projection  {proj_A}")
print(f"The ratio of the projection in terms of D is {proj_A/amp_D}")
print(f"Projection occurs in bin {math.ceil(proj_A/amp_D)} \n")


B_dot_D = B[0]*D[0] + B[1]*D[1]
proj_B = B_dot_D / amp_D
print (f"The magnitude of the projection  {proj_B}")
print(f"The ratio of the projection in terms of D is {proj_B/amp_D}")
print(f"Projection occurs in bin {math.ceil(proj_B/amp_D)} \n")


C_dot_D = C[0]*D[0] + B[1]*D[1]
proj_C = C_dot_D / amp_D
print (f"The magnitude of the projection  {proj_C}")
print(f"The ratio of the projection in terms of D is {proj_C/amp_D}")
print(f"Projection occurs in bin {math.ceil(proj_C/amp_D)} \n")


The magnitude of D is: 4.47213595499958

The magnitude of the projection  4.427414595449584
The ratio of the projection in terms of D is 0.99
Projection occurs in bin 1 

The magnitude of the projection  7.602631123499284
The ratio of the projection in terms of D is 1.6999999999999997
Projection occurs in bin 2 

The magnitude of the projection  8.497058314499201
The ratio of the projection in terms of D is 1.9
Projection occurs in bin 2 



### Formalism Behind Random Projections 


* The theory (Johnson-Lindenstrauss lemma) behind random projections guarantees approximate preservation of distance 

* We can project on a lower number of dimensions without distorting the distance between any two points by more than a factor of (1 $\pm$ $\varepsilon$)

  * Where $\varepsilon$ depends on the number of instances in the data

https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma

### Implementation Details

  
* If input is $n \times d$ matrix $A$.

* using an __appropriate__ a $d \times k$ matrix $R$, we can define the projection of $A$ as:
$$
E = A R
$$

* Therefore, the matrix multiplication $A \dot R$ projects each of our data points onto the random vectors in $\mathbb{R}$.

* Normalise by the vector's magnitude and take ceiling (or floor) of these values to get the bin each instance falls into


In [None]:
### Example of Random projections in Action 
https://github.com/spotify/annoy