# **Training Pipeline**

- An automated **"aseembly line"** that chains all your preprocessing steps(Scaling, Encoding) and your model into single Object.
- It guarantees you **never accidentally fit on your test data (prevents Data Leakage).**
- You can call pipe.fit() once instead of running 5 different steps manually.
- Always recommended in professional code to keep things clean, reproducible, and error-free.

In [None]:
import numpy as np
import pandas as pd
import torch

# Data Preprocessing Libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

In [None]:
# Using Breast Cancer Detection Dataset

df = pd.read_csv('https://raw.githubusercontent.com/gscdit/Breast-Cancer-Detection/refs/heads/master/data.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [5]:
df.shape

(569, 33)

In [7]:
df.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)

In [8]:
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### **Train Test Split**

- Dividing your single dataset into two separate piles: one for **studying ("Train")** and one for the **final exam ("Test").**
- train_test_split(X, y, test_size=0.2)
    , X = Features (Questions)
    , y = Target (Answers)
- To prevent Overfitting. **If the model sees the test questions during study time, it memorizes them instead of learning the logic.**
- The very first step. **Always split before you scale or fix missing values to avoid "Data Leakage."**

In [None]:
# Splitting the dataset into features and target variable
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], df.iloc[:, 0], test_size=0.2)

### **Scaling (Feature Scaling)**

- **Adjusting the range of data numbers** so they are all roughly the same size (e.g., converting "Salary: 100,000" and "Age: 30" to comparable values like 1.2 and 0.5).
- How: 
    , fit_transform(X_train): Calculate stats and scale the training data.
    , transform(X_test): Scale the test data using the training stats.
- Prevents the model from thinking big numbers (Salary) are more important than small numbers (Age).
- Helps the math (Gradient Descent) run faster.
- When features have different units (e.g., Kilograms vs. Meters) and you are using algorithms that calculate distance (KNN, SVM, Neural Networks).

In [None]:
# scalar object
scaler = StandardScaler()

# 
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [11]:
X_train

array([[-0.7782224 , -0.13733042, -0.77630652, ..., -0.42354413,
         0.28679742, -0.92885125],
       [-0.11505735, -0.73724257, -0.16050763, ..., -0.66161453,
         0.72762892, -0.38593406],
       [ 0.47473624, -0.30738426,  0.48305938, ...,  1.03978502,
         1.51375048,  1.25760282],
       ...,
       [ 0.25462189, -0.86950667,  0.23192124, ...,  0.37992431,
         0.10912389,  0.00498997],
       [-0.50166847, -1.60640665, -0.52761851, ..., -0.29356229,
        -0.10374911, -0.9359482 ],
       [-0.70485095, -1.500123  , -0.71178648, ..., -0.43778243,
        -0.19593821,  0.23800671]], shape=(455, 30))

In [12]:
y_train

515    B
374    B
11     M
419    B
87     M
      ..
153    B
514    M
227    B
170    B
98     B
Name: diagnosis, Length: 455, dtype: object

### **Label Encoding**

- Converting text labels (Categories) into integers (Numbers).
    Example: "Cat" → 0, "Dog" → 1, "Bird" → 2.
- **Computers can only do math on numbers**. They cannot multiply or subtract words like "Cat.
- Typically used for the Target variable (y) when it is text-based (e.g., predicting "Yes/No").

In [13]:
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)


In [14]:
y_train

array([0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1,

### Numpy Arrays To PyTorch Tensors

In [20]:
# Converting numpy arrays to torch tensors

X_train_tensor = torch.from_numpy(X_train)
X_test_tensor = torch.from_numpy(X_test)

y_train_tensor = torch.from_numpy(y_train)
y_test_tensor = torch.from_numpy(y_test)

In [21]:
X_train_tensor.shape
X_train_tensor

tensor([[-0.7782, -0.1373, -0.7763,  ..., -0.4235,  0.2868, -0.9289],
        [-0.1151, -0.7372, -0.1605,  ..., -0.6616,  0.7276, -0.3859],
        [ 0.4747, -0.3074,  0.4831,  ...,  1.0398,  1.5138,  1.2576],
        ...,
        [ 0.2546, -0.8695,  0.2319,  ...,  0.3799,  0.1091,  0.0050],
        [-0.5017, -1.6064, -0.5276,  ..., -0.2936, -0.1037, -0.9359],
        [-0.7049, -1.5001, -0.7118,  ..., -0.4378, -0.1959,  0.2380]],
       dtype=torch.float64)

In [18]:
y_train_tensor.shape

torch.Size([455])

### **Defining the Model**

In [None]:
class BreastCancerNeuralNetwork():

    # Initializing weights and bias
    def __init__(self, X):

        # create weight for every feature (column) in X
        # requires_grad => This tells PyTorch: "Watch these variables! If we make a mistake later, 
        # track back to these numbers so we can adjust them." This enables the magic of backpropagation.
        self.weights = torch.rand(X.shape[1], 1, requires_grad=True, dtype=torch.float64)

        # Create a single bias value, staring at 0
        self.bias = torch.zeros(1, requires_grad=True, dtype=torch.float64)

    def forward(self, X):
        z = torch.matmul(X, self.weights) + self.bias
        y_pred = torch.sigmoid(z)
        return y_pred
    
    def loss_function(self, y_pred, y_true):
        epsilon = 1e-7  # small constant to avoid log(0)
        y_pred = torch.clamp(y_pred, epsilon, 1 - epsilon)

        loss = -torch.mean(y_true * torch.log(y_pred) + (1 - y_true) * torch.log(1 - y_pred))
        return loss

### Important Parameters

In [25]:
learning_rate = 0.1
epochs = 25

### **Training Pipeline**

In [27]:
# Crteating model 
model = BreastCancerNeuralNetwork(X_train_tensor)

# define Loop
for epochs in range(epochs):

    # forward pass
    y_pred = model.forward(X_train_tensor)

    # loss calculation
    loss = model.loss_function(y_pred, y_train_tensor)

    # backward pass
    loss.backward()

    # parameter update
    with torch.no_grad():
        model.weights -= learning_rate * model.weights.grad
        model.bias -= learning_rate * model.bias.grad

        # zero the gradients after updating
        model.weights.grad.zero_()
        model.bias.grad.zero_()

        # print loss every each epoch
        print(f'Epoch {epochs+1}, Loss: {loss.item()}')


Epoch 1, Loss: 3.5289330566791692
Epoch 2, Loss: 3.400511212599101
Epoch 3, Loss: 3.268516266945045
Epoch 4, Loss: 3.1323469581678527
Epoch 5, Loss: 2.9927901579191687
Epoch 6, Loss: 2.851043137117016
Epoch 7, Loss: 2.7029609283714207
Epoch 8, Loss: 2.554983847980423
Epoch 9, Loss: 2.4026633470635077
Epoch 10, Loss: 2.2494166649965517
Epoch 11, Loss: 2.099653154900244
Epoch 12, Loss: 1.9534812345953967
Epoch 13, Loss: 1.813156238967321
Epoch 14, Loss: 1.6749859201370951
Epoch 15, Loss: 1.5459662770667604
Epoch 16, Loss: 1.424960643495217
Epoch 17, Loss: 1.3140122375897374
Epoch 18, Loss: 1.2175114671229146
Epoch 19, Loss: 1.1352769288625246
Epoch 20, Loss: 1.0667533138418026
Epoch 21, Loss: 1.010936168144072
Epoch 22, Loss: 0.9663389991551633
Epoch 23, Loss: 0.9311128862101402
Epoch 24, Loss: 0.9033031570088161


In [31]:
model.bias

tensor([-0.1404], dtype=torch.float64, requires_grad=True)

### Evaluation

In [30]:
# model evaluation on test data
with torch.no_grad():
    y_pred = model.forward(X_test_tensor)
    y_pred = (y_pred >= 0.9).float()
    accuracy = (y_pred.squeeze() == y_test_tensor).float().mean()
    print(f'Accuracy on test data: {accuracy.item()}')

Accuracy on test data: 0.6140350699424744
