## **Vectorization**

Turning raw data (text, images, time series, ...) into numerical vectors so that models can work with them.

#### **Embeddings**

A dense-nn representation of distances in vectors. The directions include meaning, context, behavior.

In [2]:
## dot product
import numpy as np

## 4-dim embedding
apple = np.array([-0.4, -0.3, -0.1, 0.7])
pear = np.array([0.3, 0.1, 0.9, -0.3])

## dot product
apple.dot(pear)

np.float64(-0.44999999999999996)

In [4]:
import spacy

## download small
!python -m spacy download en_core_web_md


Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [5]:
## pre-train embedding model
#### The neural has been trained on data (optimized)

## instance
nlp = spacy.load('en_core_web_md')

In [9]:
## embeddings
word1 = nlp('apple').vector
word2 = nlp('pear').vector
word3 = nlp('king').vector
word4 = nlp('queen').vector
word5 = nlp('car').vector

In [10]:
## dot product apple and pear
word1.dot(word2)

np.float32(47.55341)

In [11]:
word1.dot(word3)

np.float32(9.994232)

## **Vector Norms**

Used to measure magnitudes in vectors (how big they are), also to measure distance between two vectors.

Let's start with the distance between two vectors:

* actual vs predicted
* customer1 - customer2 (how their profiles)
* two vector embeddings

In [12]:
## actual vs predicted
salary = y = np.array([80, 75, 95, 100])
p_salary = yhat = np.array([85, 81, 72, 95])

The $L_2$-Norm is defined as the square root of the sum of squared differences:


$$ \sqrt{\sum(y - \hat{y})^2} $$

The notation changes $||y - \hat{y}||_2$. This is also called the Euclidian distance between $y$ and $\hat{y}$. Distances are used in clustering, embeddings, neural networks, regularization.

In [18]:
## default L2 norm
np.linalg.norm(y - yhat, ord = 2)

np.float64(24.79919353527449)

The $L_1$-Norm is defined as the sum of the absolute values.

$$ \sum|y - \hat{y}|$$

The L-2 norm sometimes creates issues with squared values. The L-1 makes more computationally complex.

$||y - \hat{y}||_1$ is the most common notation to use.

In [20]:
np.linalg.norm(word1 - word2, ord = 2)

np.float32(0.0)

### **Magnitude**

What if we take the norm of a vector.

In [21]:
## how big is a vector
np.linalg.norm(y)

np.float64(176.21010186706096)

#### **Regularization**

In ML the slopes, y-intercepts are called weights and biases.

$$w = [w_1, w_2, ..., w_k]$$

and

$$ b = [b_0]$$

The idea is setting unimportant weights to zero.

Tell my model to make $w$ as close to zero as possible without sacrificing performance. get $||w||_2$ to be small.

In [22]:
from sklearn.linear_model import LogisticRegression

In [24]:
## linear regression
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/martinwg/ISA630/refs/heads/master/data/housing_data.csv")
df.head()

Unnamed: 0,Bedrooms,Area,City_Distance,Age,Price
0,1,26.184098,1286.68,67,96004.804557
1,1,34.866901,1855.25,30,92473.72257
2,1,36.980709,692.09,24,98112.51994
3,1,17.445723,1399.49,66,92118.326874
4,1,52.587646,84.65,3,98976.653176


In [25]:
## create bulbs
np.random.seed(630)

df['Bulbs'] = np.random.randint(10, 25, df.shape[0])
df.head()

Unnamed: 0,Bedrooms,Area,City_Distance,Age,Price,Bulbs
0,1,26.184098,1286.68,67,96004.804557,18
1,1,34.866901,1855.25,30,92473.72257,14
2,1,36.980709,692.09,24,98112.51994,23
3,1,17.445723,1399.49,66,92118.326874,10
4,1,52.587646,84.65,3,98976.653176,16


In [26]:
## Feauture matrix
X = df.drop('Price', axis = 1)
y = df.Price

In [28]:
## Regression (Descriptive)
import statsmodels.api as sm

XD = sm.add_constant(X)
reg = sm.OLS(y, XD).fit()
print(reg.summary())

                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.672
Model:                            OLS   Adj. R-squared:                  0.672
Method:                 Least Squares   F-statistic:                     1764.
Date:                Wed, 04 Feb 2026   Prob (F-statistic):               0.00
Time:                        16:08:03   Log-Likelihood:                -39354.
No. Observations:                4308   AIC:                         7.872e+04
Df Residuals:                    4302   BIC:                         7.876e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const          9.665e+04    178.340    541.956

In [29]:
## Predictive models (no p-values, models neeed to be regularized)
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

## fit
lr.fit(X,y)

## coefficients
lr.coef_

array([116.76362474,  25.18760016,  -2.88986832, -26.30424038,
         2.73249087])

In [30]:
## calculate SSE
#### sum of squared errors (l2)
yhat = lr.predict(X)
SSE = np.linalg.norm(y - yhat)**2
print(f'SSE for unregularized model is: {SSE}')

SSE for unregularized model is: 21698212635.04964


In [31]:
## (X'X)^-1 (X'y) is obtained because you min SSE
## L2 Regularized: min SSE + penalty (||b||_2^2) - Ridge Regression
from sklearn.linear_model import Ridge

Ridge??

In [33]:
## L1 Regularized: min SSE + penalty (||b||_1) - Lasso Regression
from sklearn.linear_model import Lasso

Lasso??

## When L2 vs L1?

L2 is faster, more efficient, but ||w||_2 makes the values of w close to zero but not necessarily zero.


e.g.,

$$ w = [100, 10, 2, 3, 0.5]$$

L1 is slower but $||w||_1$ drives unimportant variables to zero

e.g.,

$$ w = [100, 10, 2, 3, 0]$$