<a href="https://colab.research.google.com/github/nikibhatt/Unit3-Sprint3-MVP/blob/master/Explanation_of_%22embedding%22_model_for_Medicine_Cabinet_Team3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explaning a Natural-Language-Processing Machine-Learning Method

# How does the Embedding Model work?

The search-models used in the main app, and these two demo apps below...

https://medicinal-cultivars.herokuapp.com/

https://cultivars-for-symptoms.herokuapp.com/

...employ a perhaps deceptively simple method. In this particular case, this way of using "embeddings" (a term to be explained below) works better than more fancy and elaborate models. 

This success may be because we are searching through a fixed list of products. (As an example for people with some data-science experience, because the target list is always the same: the test, validate, train, predict targets are always the same products, so 'overfitting' woud not be an issue (though this model probably is not 'overfit' per se).

For example a standard KNN model was ~86% accurate (not too bad). But this "embedding based model" is 100% accurate. How can this be? It's strangely simple as you will see below. 



# "Distance"

If I were to ask you to tell me what the distance is between two words, what would your first thought be? e.g. "anxiety" and "fruity"

My first thought would be that the question doesn't make much sense. There is no obvious context for what 'distance' means when comparing words.

If I were to ask you to tell me the distance between two numbers, what would your first thought be? e.g. "2" and "5"? 

My first thought would be that this makes more sense and the answer is likely to be something like 3 (since 2 + 3 = 5, or 5 - 2 = 3). Though it could be 2 kilometers vs. 5 miles, or 2 bonds and 5 stocks. The context is not completely clear, but at least it makes more sense.

Now, what if I were to ask you to tell me the distance between two coordinate points? Let's say, for points (X,y), we are asking about coordinate points:  (3,2) and (7,8)

Now you have a context for exactly what "distance" means and the two things you are comparing are locations! So now it is easy to find the distance. You can plot them on a graph, measure the distance between them, and that's the distance between them! At last, we finally have a case (for asking about 'distance') that is completely clear and makes complete sense!

![alt text](https://www.wikihow.com/images/thumb/1/1f/Find-the-Distance-Between-Two-Points-Step-2.jpg/aid2897060-v4-728px-Find-the-Distance-Between-Two-Points-Step-2.jpg)

# Distance & Words

In our last example we saw that it makes a lot of sense to ask about the distance between two coordinate--number-locations. Going back to our first example, our "word" example, we saw that it is unclear and confusing to ask about the distance between two words. 

## But what if...

What if we could assign not just a number but a coordiate-location to each word?? If we did that, then it could be very simple and clear to ask about the distance (and to answer). 

And that is exactly what "embeddings" let us do. The word "embedding" itself is sadly not very intuitively clear, but calculating values and distances of "embeddings" for words is exactly like the above example of getting the distance between coordinate locations. 

So simple it may even be shocking. 

# Let's See "Embeddings" in Action:

So an "embedding" is a coordinate-location-identity for a word. What does that look like? 

As you might imagine, there is no single way to assign a location-number to a word. You can use an existing method, or you can make up your own new method as you like.

Here in this next example we will use a pre-packaged way to calculate the "embedding" (the number-coordinate-location) for each word.

Then we will use one simple calculation to see how big or small the difference is between two locations, or between two "words"...hmm...

###Note: 
In our number-example we were talking about the distance between just two points. But we could do the same for two 'groups of points' or 'groups of words' still with one calculation: e.g. "Cosine Distance" https://reference.wolfram.com/language/ref/CosineDistance.html

So we will be calculating the similarity or 'distance between' two groups of words (which should be starting to sound very much like the search-model in the app: comparing the user's set of words and the products' set of words). 




# Let's look at a simple illustration:

You can make a copy of this notebook and explore and play with the data yourself, or just read along here.

You can also see https://www.basilica.ai/quickstart/python/


In [0]:
# installing an extra package
!pip install basilica

Collecting basilica
  Downloading https://files.pythonhosted.org/packages/68/19/6216f1c0ad6d0f738bd1061cb5c65097021b41f3891046fac87bc4c4e1ae/basilica-0.2.8.tar.gz
Building wheels for collected packages: basilica
  Building wheel for basilica (setup.py) ... [?25l[?25hdone
  Created wheel for basilica: filename=basilica-0.2.8-cp36-none-any.whl size=4710 sha256=07df912a708184ed7ed3cedce17bc145e28d662309760db36554d180370de470
  Stored in directory: /root/.cache/pip/wheels/31/18/9f/46f6face8baf98e31b52bf91a0d76930ec76860f9e9211104d
Successfully built basilica
Installing collected packages: basilica
Successfully installed basilica-0.2.8


In [0]:
# importing packages and libraries
import basilica
from scipy import spatial

In [0]:
#making an array of 3 groups of words
groups_of_word_we_will_compare = [
    # word group 1                                  
    "Anxiety, Fruity, Sleepy",

    # word group 2
    "Pain, Disel, PTSD",
    # word group 3
    "Anxiety, Citrus, Sleepy",
]

# Here the pre-packed embedding calculator is used to fit each word and word group with numbers-coordinates
# these numbers are the "embeddings"
with basilica.Connection('36a370e3-becb-99f5-93a0-a92344e78eab') as c:
    embeddings = list(c.embed_sentences(groups_of_word_we_will_compare))

## What do embeddings (number-coordinates) actually look like??
## They look like this!

In [0]:
print(embeddings)

[[-0.109279, 0.055283, 0.148082, -0.0606532, -0.358577, -0.0248224, 0.150399, 0.176188, 0.0763853, -0.0937356, -0.0032826, 0.354016, -0.136879, 0.392298, 0.0465789, 0.239705, -0.0532615, 0.332357, 0.126538, -0.0702625, 0.158046, 0.256136, 0.0820996, -0.160131, 0.270934, -0.0544844, -0.487221, -0.0824355, 0.0716706, 0.0859297, -0.100986, 0.30651, 0.0292215, -0.146621, 0.580495, 0.181444, 0.362291, 0.0470983, 0.139075, -0.132852, 0.184951, 0.0103079, 0.188333, -0.00680691, -0.138658, -0.00687647, -2.54725, -0.0496548, -0.231092, -0.270531, 0.291893, -0.206562, 0.347545, -0.103569, 0.026424, 0.0185752, -0.1814, 0.553098, 0.470405, 0.233989, 0.0143653, -0.110829, -0.0718075, 0.114413, 0.0521732, -0.106131, -0.150378, 0.182184, 0.101157, 0.584072, -0.142362, -0.181064, 0.322339, 0.0613705, -0.14788, -0.0567258, 0.236887, 0.0277372, -0.128257, 0.147997, -0.310352, 0.26945, 0.632784, 0.04139, -0.0554253, 0.233316, -0.0383452, -0.0490961, 0.259438, 0.193747, 0.0594865, -0.221416, 0.0588757, 0.

# Distance between words is achieved!

So let's see which groups of words are "closer"...note that because we are talking about 'distance between' a smaller distance-number means that the two are closer together.

In [0]:
print('This is the distance between "Anxiety, Fruity, Sleepy" vs. "Pain, Disel, PTSD"')
print(spatial.distance.cosine(embeddings[0], embeddings[1]))

This is the distance between "Anxiety, Fruity, Sleepy" vs. "Pain, Disel, PTSD"
0.13202609478768856


In [0]:
print('This is the distance between "Anxiety, Fruity, Sleepy" vs. "Anxiety, Citrus, Sleepy"')
print(spatial.distance.cosine(embeddings[0], embeddings[2]))

This is the distance between "Anxiety, Fruity, Sleepy" vs. "Anxiety, Citrus, Sleepy"
0.011349934207927959


## Since distance (how far apart) and similarity (how close together) are kind of opposites, 
## if you do (1 - distance) that give a more intuitive result (read like a probability-of-similarity score).


In [0]:
print('This is the similarity of "Anxiety, Fruity, Sleepy" vs. "Anxiety, Fruity, Sleepy"')
print(1 - spatial.distance.cosine(embeddings[0], embeddings[0]))
print('The "~similairty probablity" is 1, so they are predicted to be an identical match, which is correct.')


This is the similarity of "Anxiety, Fruity, Sleepy" vs. "Anxiety, Fruity, Sleepy"
1.0
The "~similairty probablity" is 1, so they are predicted to be an identical match, which is correct.


In [0]:
print('This is the similarity of "Anxiety, Fruity, Sleepy" vs. "Pain, Disel, PTSD"')
print(1 - spatial.distance.cosine(embeddings[0], embeddings[1]))
print('The "~similairty probablity" is 0.86, which is lower than .99 or 1, and so less similar, which is correct.')

This is the similarity of "Anxiety, Fruity, Sleepy" vs. "Pain, Disel, PTSD"
0.8679739052123114
The "~similairty probablity" is 0.86, which is lower than .99 or 1, and so less similar, which is correct.


In [0]:
print('This is the similarity of "Anxiety, Fruity, Sleepy" vs. "Anxiety, Citrus, Sleepy"')
print(1 - spatial.distance.cosine(embeddings[0], embeddings[2]))
print('The "~similairty probablity" is 0.989, which is higher and very similar, which is correct.')

This is the similarity of "Anxiety, Fruity, Sleepy" vs. "Anxiety, Citrus, Sleepy"
0.988650065792072
The "~similairty probablity" is 0.989, which is higher and very similar, which is correct.


# Complete-Matches are Identified...

## Here you can hopefully see somewhat easily how this model can "predict with 100% accuracy" a match between the user's input and the product description: it calculates a distance of zero between the two! So simple...yet so useful.