We can divide a set of points on a givin line using linear algebra as : 
$f(x, w) = w_1*x_1+w_2$


We can then extend the same idea to 2D Plane : 
$f(x,w) = w_1*x_1 +w_2*x_2+ w_3$

This idea can be naturally extended to arbitrary (N) dimensions. As a result, we now have $f(x,\mathbf{W}) =\mathbf{W}*x +b$ , where $x$ is N-dimensional vector , $b$ is a scaler , while $W$ is a $1 X N$ matrix

In Pytorch, we can build a linear classifier with 5 inputs and 10 outputs using just one line of code. The following code will initialize a trainable matrix

In [4]:
pip install torch 

Collecting torch
  Using cached torch-2.8.0-cp313-none-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting networkx (from torch)
  Using cached networkx-3.5-py3-none-any.whl.metadata (6.3 kB)
Using cached torch-2.8.0-cp313-none-macosx_11_0_arm64.whl (73.6 MB)
Using cached networkx-3.5-py3-none-any.whl (2.0 MB)
Installing collected packages: networkx, torch
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [torch]32m1/2[0m [torch]
[1A[2KSuccessfully installed networkx-3.5 torch-2.8.0
Note: you may need to restart the kernel to use updated packages.


In [6]:
import torch
import torch.nn as nn

In [11]:
classifier = nn.Linear(5, 10)
# print(classifier)

Training the classifier. We know that we need to find $W$ and $b$ in order to classify our points. But how do we do that ? First o fall , we need training data , i.e we need data points - $x$ and target $t$ , which we are aware of. 

Before going to the process of training, we must know 2 new concepts. 
#### 1. Loss Function :
   It is the measure of how good or bad classification of the data point is. More precisely.
   Given a dataset ${(x_i,t_i)}$ of N points , where $x_i$ is a N-dimensional point in space and $t_i$ is an integer that defines the points category , loss is the distance between $f(x_i, W)$ and $C_i( f(x_i,W),t_i)$ is the cost for a single example $x_i$. The overall loss of the entire training data is simply the average of all the individual losses. However, in practice, we rarely average the loss over all data points.

   There are multiple choices of Loss Functions , we will have a look at :
   $C= \sum(f(x_i,W) - t_i)^2$

Lets see how we can write the same in `torch`

In [18]:

# Define a simple linear model: input dimension=10, output dimension=3
# This means the model will map a 10-dimensional vector to 3 output values.
model = nn.Linear(10, 3)

# Define a Mean Squared Error (MSE) loss function.
# It measures the squared difference between predicted and target values.
loss = nn.MSELoss()

# Create a dummy input vector of size 10 (random values sampled from a normal distribution).
# This simulates one data point with 10 features.
input_vector = torch.randn(10)

# Define the target output vector.
# Here we assume a 3-class problem, and the target is class index 2 (represented as one-hot [0,0,1]).
target = torch.tensor([0, 0, 1], dtype=torch.float32)

# Forward pass: pass the input through the model to get predictions (size 3).
pred = model(input_vector)

# Compute the loss between predicted vector and target vector.
output = loss(pred, target)

# Print results
print("Prediction: ", pred)   # The raw output from the linear model (logits, not probabilities).
print("Output (Loss): ", output)  # The scalar MSE loss value.

Prediction:  tensor([ 0.4646, -1.0671, -0.2893], grad_fn=<ViewBackward0>)
Output (Loss):  tensor(1.0056, grad_fn=<MseLossBackward0>)


#### 2. Optimization and training process 
Optimization is the process of finding the weight matrix $W$ that minimizes the loss function. In other words, it is the process of selecting the individual weights $w_i$ so that the classifier’s prediction $y$ is as close as possible to the point’s real label $t$


Now, we can describe the training algorithm in its entirety:

Given a set of training examples $x_i$ with their labels $t_i$ , we need to:

 - Initialize the classifier $f(x_i,\mathbf{W})$ with random weight $W$ . 
 - Feed a training example in the classifier and get the output $y_i$ .
 - Compute the loss between $y_i$ the prediction $t_i$ .
 - Adjust the weights $W$ according to the loss $C_i$ .
 - Repeat for all training examples.
   
This is the core idea behind all deep learning models. In the end, we will have a trained classifier that can be generalized in previous unseen examples.