## Introduction to Vector Embeddings

We have learnt what a vector is in the lesson [Data Structures - Scalars, Vectors, Matrices](./notes/data-structs.ipynb) and seen them used to train a basic neural network in [Neural Networks Part 2 - Training (MicroGrad)](./notes/nn-training.ipynb).

In this lesson, we will learn about a special type of vector called an **embedding** and how it is used to represent words in a text.

### Why Do We Need Numerical Representations of Data?

In machine learning and data analysis, algorithms typically work with numerical data. To make predictions or find patterns, we need to represent our data in a way that algorithms can process mathematically.

Remember our basic linear regression model?

$$
y = mx + c
$$

Here, we need to find the values of $m$ and $c$ that best fit the data. We do this by minimizing the error between the predicted values and the actual values.  Remember also that $x$ can be a scalar or a vector, i.e. have multiple values or features.

E.g. if we have a dataset of house prices, we might have a vector of features such as the number of bedrooms, the number of bathrooms, the size of the house, the location, etc.

In [5]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data for four different houses
# [Size (square feet), Number of bedrooms, Number of bathrooms]
X = np.array([
    [1500, 2, 1],   # house 1
    [2500, 3, 2],   # house 2   
    [3000, 4, 2],   # house 3
    [3500, 5, 3]    # house 4
])

# House prices (hundereds of thousands of dollars) for the same four houses above
y = np.array([200000, 300000, 350000, 400000])

model = LinearRegression()
model.fit(X, y)

# Predict the price of a house with 3 bedrooms, 2 bathrooms, and 1500 square feet
new_house = np.array([[1750, 3, 1]])

In [6]:
predicted_price = model.predict(new_house)
print(f"The predicted price of the house is ${predicted_price[0]:,.2f}")

The predicted price of the house is $225,000.00


### Explanation:
We have a dataset of four houses with their features and prices. We want to predict the price of a new house with 3 bedrooms, 2 bathrooms, and 1500 square feet. We use the *LinearRegression* model (this is using a python library called *sklearn*) to find the best fit line between the features and the price. We then use the *predict* method to predict the price of the new house.

### Challenges with text data

Sometimes our data is not numerical like the house prices example above.  If our data is text, then we need to convert this text into a numerical representation.  This is where **embeddings** come in.

### What is an embedding?

An embedding is simply a clever way of turning words into numbers. A Vector embedding is just the same data reprsented as a vector.

The term embed means to place something firmly into a space.  So we are placing our text into a vector space - i.e. the space of numbers that can mathematically represent our text. 

Typically if our input is a set of text (e.g. a chapter from a book), then the first step is to break this text into individual words.  We call this process **tokenization**. 

Then, each word (or token) is converted into a numberical vector. This numerical vector is called the **vector embedding** of the word. 

Lets use a simple example.  If we have three words - cat, dog and mouse.  We can represent each of these as a three dimensional vector. 

For cat we make the first value in the vector 1.  

So $cat = [1, 0, 0]$.

For dog we make the second value in the vector 1. 

So $dog = [0, 1, 0]$.

For mouse we make the third value in the vector 1. 

So $mouse = [0, 0, 1]$.

This approach of using a different value with 1 is called a **one-hot encoding**. This is because only one of the values in a vector is 1 for any particular word and the rest are 0. 

Of course you can see the limitation here, in that there is only three words we can represent with this approach. The minute we add another word, we have to increase the size of our vectors by 1.

Lets next look at a slightly more sophisticated approach. 