# Unit Testing

## What it is: Unit testing involves writing tests for small parts of your code, usually functions or methods, to ensure they work as expected. In ML projects, unit testing helps validate that individual components, like data processing functions, model training functions, and utility functions, are performing correctly.

In [4]:
#Unit Test for Data Preprocessing
import numpy as np
import pandas as pd

def fill_missing_values(df):
    return df.fillna(df.median())

def test_fill_missing_values():
    data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]}
    df = pd.DataFrame(data)
    filled_df = fill_missing_values(df)
    
    assert filled_df.isnull().sum().sum() == 0, "There should be no missing values"
    assert filled_df.loc[2, 'A'] == 2, "The missing value in column A should be filled with median (2)"
    assert filled_df.loc[1, 'B'] == 6.5, "The missing value in column B should be filled with median (6.5)"

# Run the test
try:
    test_fill_missing_values()
    print("test_fill_missing_values: PASSED")
except AssertionError as e:
    print(f"test_fill_missing_values: FAILED - {e}")

test_fill_missing_values: PASSED


In [5]:
#Unit Test for Feature Engineering
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

def generate_polynomial_features(data, degree):
    poly = PolynomialFeatures(degree)
    return poly.fit_transform(data)

def test_generate_polynomial_features():
    data = np.array([[2, 3], [3, 4]])
    poly_data = generate_polynomial_features(data, degree=2)
    
    expected_shape = (2, 6)  # 2 samples and 6 features (including bias term)
    assert poly_data.shape == expected_shape, f"Expected shape {expected_shape}, but got {poly_data.shape}"
    assert np.allclose(poly_data[0], [1, 2, 3, 4, 6, 9]), "Generated polynomial features are incorrect"

# Run the test
try:
    test_generate_polynomial_features()
    print("test_generate_polynomial_features: PASSED")
except AssertionError as e:
    print(f"test_generate_polynomial_features: FAILED - {e}")

test_generate_polynomial_features: PASSED


In [6]:
#Unit Test for Model Training
from sklearn.linear_model import LinearRegression
import numpy as np

def train_linear_model(X, y):
    model = LinearRegression()
    model.fit(X, y)
    return model

def test_train_linear_model():
    X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
    y = np.array([2, 3, 4, 5])
    model = train_linear_model(X, y)
    
    assert not np.all(model.coef_ == 0), "Model coefficients should not all be zero after training"
    assert model.intercept_ != 0, "Model intercept should not be zero after training"

# Run the test
try:
    test_train_linear_model()
    print("test_train_linear_model: PASSED")
except AssertionError as e:
    print(f"test_train_linear_model: FAILED - {e}")

test_train_linear_model: PASSED


In [7]:
#Unit Test for Model Prediction
def make_predictions(model, X):
    return model.predict(X)

def test_make_predictions():
    X_train = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
    y_train = np.array([2, 3, 4, 5])
    model = train_linear_model(X_train, y_train)
    
    X_test = np.array([[5, 6]])
    prediction = make_predictions(model, X_test)
    
    assert np.isclose(prediction, [6]), f"Expected prediction [6], but got {prediction}"

# Run the test
try:
    test_make_predictions()
    print("test_make_predictions: PASSED")
except AssertionError as e:
    print(f"test_make_predictions: FAILED - {e}")

test_make_predictions: PASSED


In [8]:
#Unit Test for Evaluation Metrics
from sklearn.metrics import mean_absolute_error
import numpy as np

def calculate_mae(y_true, y_pred):
    return mean_absolute_error(y_true, y_pred)

def test_calculate_mae():
    y_true = np.array([2, 3, 4, 5])
    y_pred = np.array([2.1, 2.9, 4.2, 4.8])
    mae = calculate_mae(y_true, y_pred)
    
    assert np.isclose(mae, 0.15), f"Expected MAE 0.15, but got {mae}"

# Run the test
try:
    test_calculate_mae()
    print("test_calculate_mae: PASSED")
except AssertionError as e:
    print(f"test_calculate_mae: FAILED - {e}")


test_calculate_mae: PASSED


In [9]:
#Unit Test for Neural Network Initialization
import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

def test_nn_initialization():
    model = SimpleNN()
    for param in model.parameters():
        assert param.data.mean().item() == 0, "Parameters should be initialized with mean 0"
        assert param.data.std().item() > 0, "Parameters should have a non-zero standard deviation"

# Run the test
try:
    test_nn_initialization()
    print("test_nn_initialization: PASSED")
except AssertionError as e:
    print(f"test_nn_initialization: FAILED - {e}")

test_nn_initialization: FAILED - Parameters should be initialized with mean 0


In [10]:
#Unit Test for Data Splitting
from sklearn.model_selection import train_test_split
import numpy as np

def split_data(X, y, test_size=0.2, random_state=42):
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

def test_split_data():
    X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
    y = np.array([1, 2, 3, 4, 5])
    X_train, X_test, y_train, y_test = split_data(X, y)

    assert len(X_train) == 4, "Training set should contain 80% of data"
    assert len(X_test) == 1, "Testing set should contain 20% of data"
    assert len(y_train) == 4, "y_train should contain 80% of labels"
    assert len(y_test) == 1, "y_test should contain 20% of labels"

# Run the test
try:
    test_split_data()
    print("test_split_data: PASSED")
except AssertionError as e:
    print(f"test_split_data: FAILED - {e}")

test_split_data: PASSED


In [11]:
#Unit Test for One-Hot Encoding
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

def one_hot_encode(df, columns):
    encoder = OneHotEncoder(sparse=False)
    encoded_columns = encoder.fit_transform(df[columns])
    encoded_df = pd.DataFrame(encoded_columns, columns=encoder.get_feature_names_out(columns))
    return pd.concat([df.drop(columns, axis=1), encoded_df], axis=1)

def test_one_hot_encode():
    data = {'color': ['red', 'blue', 'green'], 'value': [1, 2, 3]}
    df = pd.DataFrame(data)
    encoded_df = one_hot_encode(df, ['color'])

    expected_columns = ['value', 'color_blue', 'color_green', 'color_red']
    assert all(col in encoded_df.columns for col in expected_columns), "Encoded columns are missing"
    assert encoded_df.shape[1] == 4, f"Expected 4 columns, but got {encoded_df.shape[1]}"

# Run the test
try:
    test_one_hot_encode()
    print("test_one_hot_encode: PASSED")
except AssertionError as e:
    print(f"test_one_hot_encode: FAILED - {e}")

test_one_hot_encode: PASSED




In [12]:
#Unit Test for Model Saving and Loading
import joblib
from sklearn.linear_model import LogisticRegression
import numpy as np

def save_model(model, filename):
    joblib.dump(model, filename)

def load_model(filename):
    return joblib.load(filename)

def test_save_load_model():
    model = LogisticRegression()
    X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
    y = np.array([0, 1, 0, 1])
    model.fit(X, y)

    save_model(model, 'test_model.pkl')
    loaded_model = load_model('test_model.pkl')

    assert np.allclose(model.coef_, loaded_model.coef_), "Loaded model coefficients should match original"
    assert model.intercept_ == loaded_model.intercept_, "Loaded model intercept should match original"

# Run the test
try:
    test_save_load_model()
    print("test_save_load_model: PASSED")
except AssertionError as e:
    print(f"test_save_load_model: FAILED - {e}")

test_save_load_model: PASSED


In [13]:
#Unit Test for Confusion Matrix Calculation
from sklearn.metrics import confusion_matrix
import numpy as np

def calculate_confusion_matrix(y_true, y_pred):
    return confusion_matrix(y_true, y_pred)

def test_calculate_confusion_matrix():
    y_true = np.array([1, 0, 1, 0, 1, 0])
    y_pred = np.array([1, 0, 0, 0, 1, 1])
    cm = calculate_confusion_matrix(y_true, y_pred)

    expected_cm = np.array([[2, 1], [1, 2]])
    assert np.array_equal(cm, expected_cm), f"Expected {expected_cm}, but got {cm}"

# Run the test
try:
    test_calculate_confusion_matrix()
    print("test_calculate_confusion_matrix: PASSED")
except AssertionError as e:
    print(f"test_calculate_confusion_matrix: FAILED - {e}")


test_calculate_confusion_matrix: PASSED


In [14]:
#Unit Test for Data Normalization
from sklearn.preprocessing import MinMaxScaler
import numpy as np

def normalize_data(X):
    scaler = MinMaxScaler()
    return scaler.fit_transform(X)

def test_normalize_data():
    X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
    normalized_X = normalize_data(X)

    assert np.allclose(normalized_X.min(), 0), "Normalized data should have a minimum of 0"
    assert np.allclose(normalized_X.max(), 1), "Normalized data should have a maximum of 1"

# Run the test
try:
    test_normalize_data()
    print("test_normalize_data: PASSED")
except AssertionError as e:
    print(f"test_normalize_data: FAILED - {e}")

test_normalize_data: PASSED


In [15]:
#Unit Test for Feature Scaling
from sklearn.preprocessing import StandardScaler
import numpy as np

def standardize_data(X):
    scaler = StandardScaler()
    return scaler.fit_transform(X)

def test_standardize_data():
    X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
    standardized_X = standardize_data(X)

    assert np.isclose(standardized_X.mean(), 0), "Standardized data should have a mean of 0"
    assert np.isclose(standardized_X.std(), 1), "Standardized data should have a standard deviation of 1"

# Run the test
try:
    test_standardize_data()
    print("test_standardize_data: PASSED")
except AssertionError as e:
    print(f"test_standardize_data: FAILED - {e}")

test_standardize_data: PASSED


# Profiling

## What it is: Profiling in the context of ML/DL projects involves analyzing the performance of your code, identifying bottlenecks, and optimizing resource usage such as CPU, GPU, and memory. Profiling helps you understand which parts of your code are consuming the most time or memory and can help you make decisions on where to focus optimization efforts.

In [17]:
#Profiling with cProfile
#cProfile is a built-in Python module that provides a detailed report of function calls, including time spent in each function.

import cProfile
import numpy as np
from sklearn.ensemble import RandomForestClassifier

def load_data():
    # Simulate loading data
    X = np.random.rand(10000, 20)
    y = np.random.randint(0, 2, 10000)
    return X, y

def train_model(X, y):
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X, y)
    return model

def main():
    X, y = load_data()
    model = train_model(X, y)

# Profile the main function
cProfile.run('main()')

         201033 function calls (198325 primitive calls) in 44.744 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   44.727   44.727 1897314723.py:14(train_model)
        1    0.000    0.000   44.733   44.733 1897314723.py:19(main)
        1    0.000    0.000    0.006    0.006 1897314723.py:8(load_data)
     1153    0.001    0.000    0.008    0.000 <frozen abc>:117(__instancecheck__)
  618/616    0.001    0.000    0.003    0.000 <frozen abc>:121(__subclasscheck__)
      301    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap>:405(parent)
        1    0.011    0.011   44.744   44.744 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 <string>:2(__eq__)
      301    0.000    0.000    0.000    0.000 _array_api.py:12(_check_array_api_dispatch)
      406    0.000    0.000    0.000    0.000 _array_api.py:170(_check_device_cpu)
     2026    0.006    0.000    0.007    0.000 _a

In [21]:
#Memory Profiling with memory_profiler
#Memory profiling helps you understand how much memory is being used by different parts of your code, which is especially important in DL projects where large models and datasets are involved.

from memory_profiler import profile
import numpy as np
import pandas as pd

@profile
def data_preprocessing():
    # Simulate a large dataset
    df = pd.DataFrame(np.random.rand(1000000, 10), columns=[f'col_{i}' for i in range(10)])
    
    # Preprocessing: Filling missing values
    df.fillna(df.mean(), inplace=True)
    
    # More preprocessing steps...
    df['col_sum'] = df.sum(axis=1)
    
    return df

# Run the function to profile
data_preprocessing()


ERROR: Could not find file C:\Users\Abhishek_Jaiswal\AppData\Local\Temp\ipykernel_32472\4197188316.py


Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_sum
0,0.239438,0.527816,0.473308,0.769237,0.072831,0.581874,0.326494,0.891716,0.205090,0.124340,4.212145
1,0.246050,0.054724,0.506426,0.020357,0.089198,0.137109,0.393631,0.900328,0.108786,0.569204,3.025811
2,0.843365,0.038258,0.627765,0.445383,0.559283,0.912020,0.836728,0.756270,0.656609,0.977788,6.653469
3,0.301246,0.194494,0.909902,0.966589,0.584935,0.689754,0.260273,0.135848,0.384984,0.471336,4.899361
4,0.674663,0.206842,0.002493,0.160282,0.134196,0.177756,0.776780,0.560751,0.784129,0.259443,3.737334
...,...,...,...,...,...,...,...,...,...,...,...
999995,0.554016,0.063069,0.464241,0.482893,0.759817,0.194917,0.938976,0.485525,0.653090,0.497936,5.094481
999996,0.044191,0.576465,0.643956,0.644571,0.993955,0.250678,0.098439,0.858626,0.049913,0.784549,4.945345
999997,0.711333,0.835933,0.319238,0.858734,0.815850,0.603493,0.591752,0.498782,0.959497,0.772368,6.966978
999998,0.676477,0.677218,0.977506,0.723864,0.322024,0.987257,0.857471,0.901892,0.593534,0.919762,7.637005


In [None]:
#GPU Profiling with torch.cuda
#In deep learning, GPU profiling is crucial for understanding GPU utilization, memory consumption, and the time spent on different operations. PyTorch provides some basic tools for this.

import torch
import torch.nn as nn
import torch.optim as optim
import time

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(1024, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

def train_model():
    model = SimpleModel().cuda()
    optimizer = optim.Adam(model.parameters())
    criterion = nn.CrossEntropyLoss()
    
    # Simulate input data
    inputs = torch.randn(64, 1024).cuda()
    labels = torch.randint(0, 10, (64,)).cuda()
    
    # Training loop
    start_time = time.time()
    for _ in range(100):
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
    end_time = time.time()
    print(f"Training time: {end_time - start_time:.2f} seconds")

# Run the function
train_model()

# GPU memory profiling
print(f"Max GPU memory allocated: {torch.cuda.max_memory_allocated() / 1024**2:.2f} MiB")
print(f"Max GPU memory cached: {torch.cuda.max_memory_reserved() / 1024**2:.2f} MiB")

In [23]:
#Time Profiling with timeit
#timeit is a simple way to profile the execution time of small code snippets.

import timeit

def example_function():
    sum(range(1000000))

# Time profiling
execution_time = timeit.timeit('example_function()', globals=globals(), number=100)
print(f"Average execution time over 100 runs: {execution_time / 100:.5f} seconds")

Average execution time over 100 runs: 0.11175 seconds


In [26]:
#Profiling with line_profiler
#line_profiler provides more granular control by allowing you to profile individual functions line by line.

from line_profiler import LineProfiler
import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(1024, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

def profile_model():
    model = SimpleNN()
    inputs = torch.randn(64, 1024)
    outputs = model(inputs)

# Set up the profiler
lp = LineProfiler()
lp.add_function(SimpleNN.forward)
lp_wrapper = lp(profile_model)

# Run the profiler
lp_wrapper()
lp.print_stats()

Timer unit: 1e-07 s

Total time: 0.0509071 s

Could not find file C:\Users\Abhishek_Jaiswal\AppData\Local\Temp\ipykernel_32472\173877596.py
Are you sure you are running this program from the same directory
that you ran the profiler from?
Continuing without the function's contents.

Line #      Hits         Time  Per Hit   % Time  Line Contents
    15                                           
    16         1     492885.0 492885.0     96.8  
    17         1       8818.0   8818.0      1.7  
    18         1       7368.0   7368.0      1.4  

Total time: 0.0870728 s

Could not find file C:\Users\Abhishek_Jaiswal\AppData\Local\Temp\ipykernel_32472\173877596.py
Are you sure you are running this program from the same directory
that you ran the profiler from?
Continuing without the function's contents.

Line #      Hits         Time  Per Hit   % Time  Line Contents
    20                                           
    21         1     238933.0 238933.0     27.4  
    22         1     118900.

## Profiling is an essential part of optimizing ML/DL projects, ensuring that code runs efficiently and resources are used effectively. Tools like cProfile, memory_profiler, torch.cuda, timeit, and line_profiler provide different perspectives on performance, allowing you to identify and address bottlenecks in your code.

# Code Tuning

## What it is: Code tuning involves optimizing your code to improve performance, such as reducing execution time, memory usage, or improving model accuracy. In ML projects, this might include optimizing data pipelines, using more efficient algorithms, or tuning hyperparameters.

# For ML

In [37]:
#Before Tuning: Basic ML Model Training
#This example uses the load_breast_cancer dataset from sklearn.datasets. The code loads the dataset, splits it into training and test sets, trains a RandomForestClassifier, and evaluates the model.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9708


In [38]:
#After Tuning: Optimized ML Model Training

#Now, we'll apply code tuning techniques to improve the performance of the above code:

#Optimizing Data Handling with Efficient Data Structures: Use NumPy arrays instead of Python lists for more efficient data handling.
#Hyperparameter Tuning with RandomizedSearchCV: Use RandomizedSearchCV instead of default hyperparameters to find a more optimal configuration.
#Parallel Processing with n_jobs: Utilize multiple CPU cores by setting n_jobs=-1 in the classifier and cross-validation.

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Convert to NumPy arrays for more efficient processing
X_train, X_test = np.array(X_train), np.array(X_test)
y_train, y_test = np.array(y_train), np.array(y_test)

# Hyperparameter tuning using RandomizedSearchCV
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# RandomForest with parallel processing (n_jobs=-1)
model = RandomForestClassifier(random_state=42, n_jobs=-1)
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=3, random_state=42, n_jobs=-1)

# Train with RandomizedSearchCV
random_search.fit(X_train, y_train)

# Best model prediction and evaluation
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Optimized Accuracy: {accuracy:.4f}")
print(f"Best Parameters: {random_search.best_params_}")

Optimized Accuracy: 0.9649
Best Parameters: {'n_estimators': 50, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': 10, 'bootstrap': False}


# For DL

In [29]:
#Optimizing Data Loading and Preprocessing

#Before Tuning:
#Loading and preprocessing data can become a bottleneck, especially with large datasets. The code below loads data sequentially, which might be slow.

import pandas as pd

def load_and_preprocess_data(file_path):
    df = pd.read_csv(file_path)
    df['new_feature'] = df['feature1'] * df['feature2']
    df['normalized_feature'] = (df['feature1'] - df['feature1'].mean()) / df['feature1'].std()
    return df

#df = load_and_preprocess_data('large_dataset.csv')

In [31]:
#After Tuning:
#Use chunking and vectorized operations to speed up data loading and preprocessing. Also, consider parallel processing if applicable.
import pandas as pd

def load_and_preprocess_data(file_path, chunksize=100000):
    chunks = []
    for chunk in pd.read_csv(file_path, chunksize=chunksize):
        chunk['new_feature'] = chunk['feature1'] * chunk['feature2']
        chunk['normalized_feature'] = (chunk['feature1'] - chunk['feature1'].mean()) / chunk['feature1'].std()
        chunks.append(chunk)
    df = pd.concat(chunks, axis=0)
    return df

#df = load_and_preprocess_data('large_dataset.csv')


In [33]:
#Optimizing Model Training with Mixed Precision

#Before Tuning:
#Training deep learning models with single precision (32-bit floating-point) is common, but can be resource-intensive.
import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(1024, 512).cuda()
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

for inputs, labels in dataloader:
    inputs, labels = inputs.cuda(), labels.cuda()
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()


In [None]:
#After Tuning:
#Mixed precision training (16-bit floating-point) can speed up training and reduce memory usage without sacrificing model accuracy.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import autocast, GradScaler

model = nn.Linear(1024, 512).cuda()
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
scaler = GradScaler()

for inputs, labels in dataloader:
    inputs, labels = inputs.cuda(), labels.cuda()
    optimizer.zero_grad()
    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, labels)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()


In [34]:
#Optimizing Model Architecture
#Before Tuning:
#Using a complex model architecture with many layers and parameters can lead to slow training and inference times.
import torch.nn as nn

class ComplexModel(nn.Module):
    def __init__(self):
        super(ComplexModel, self).__init__()
        self.fc1 = nn.Linear(1024, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 128)
        self.fc4 = nn.Linear(128, 64)
        self.fc5 = nn.Linear(64, 32)
        self.fc6 = nn.Linear(32, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = torch.relu(self.fc4(x))
        x = torch.relu(self.fc5(x))
        return self.fc6(x)

In [None]:
#After Tuning:
#Simplify the model architecture by reducing the number of layers or parameters, or using techniques like parameter sharing or pruning.

import torch.nn as nn

class OptimizedModel(nn.Module):
    def __init__(self):
        super(OptimizedModel, self).__init__()
        self.fc1 = nn.Linear(1024, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)


In [None]:
#Optimizing Data Parallelism
#Before Tuning:
#Training on a single GPU can be slow, especially for large models and datasets.

import torch.nn as nn

model = nn.Linear(1024, 512).cuda()

for inputs, labels in dataloader:
    inputs, labels = inputs.cuda(), labels.cuda()
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()


In [None]:
#After Tuning:
#Use data parallelism to distribute the training process across multiple GPUs.

import torch.nn as nn
from torch.nn.parallel import DataParallel

model = nn.Linear(1024, 512)
model = DataParallel(model)
model = model.cuda()

for inputs, labels in dataloader:
    inputs, labels = inputs.cuda(), labels.cuda()
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()


In [None]:
#Optimizing Hyperparameters with Grid Search or Random Search
#Before Tuning:
#Manually tuning hyperparameters is inefficient and can lead to suboptimal results.

import torch.optim as optim

model = nn.Linear(1024, 512)
optimizer = optim.Adam(model.parameters(), lr=0.001)


In [None]:
#After Tuning:
#Use grid search or random search to systematically explore the hyperparameter space and find the optimal settings.

from sklearn.model_selection import GridSearchCV
from skorch import NeuralNetClassifier

net = NeuralNetClassifier(model)
params = {
    'lr': [0.001, 0.01, 0.1],
    'max_epochs': [10, 20, 30],
}

gs = GridSearchCV(net, params, refit=False, cv=3)
gs.fit(X_train, y_train)

print(f"Best hyperparameters: {gs.best_params_}")


In [None]:
#Optimizing I/O Operations

#Before Tuning:
#I/O operations, such as loading and saving models or checkpoints, can become a bottleneck if not optimized.

import torch

# Saving the model
torch.save(model.state_dict(), 'model.pth')

# Loading the model
model.load_state_dict(torch.load('model.pth'))


In [None]:
#After Tuning:
#Use more efficient formats (like TorchScript for PyTorch) or compress the data to reduce I/O overhead.

import torch

# Saving the model with compression
torch.save(model.state_dict(), 'model.pth', _use_new_zipfile_serialization=False)

# Loading the model
model.load_state_dict(torch.load('model.pth'))

# Alternatively, use TorchScript for faster loading
scripted_model = torch.jit.script(model)
torch.jit.save(scripted_model, 'scripted_model.pt')


In [36]:
#Optimizing Batch Size
#Before Tuning:
#Using a default or arbitrarily chosen batch size might not be optimal for training performance.

for inputs, labels in dataloader:
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()


NameError: name 'dataloader' is not defined

In [None]:
#After Tuning:
#Tune the batch size to maximize GPU utilization without causing memory overflow.

batch_size = 64  # Increase or decrease based on profiling results
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

for inputs, labels in dataloader:
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()


In [None]:
#Optimizing with Lazy Loading
#Before Tuning:
#Loading all data at once can be memory-intensive and slow.

data = load_all_data()
preprocessed_data = preprocess_data(data)


In [None]:
#After Tuning:
#Use lazy loading to load data in chunks or only when needed.

def load_data_in_chunks(file_path, chunksize=10000):
    for chunk in pd.read_csv(file_path, chunksize=chunksize):
        preprocessed_chunk = preprocess_data(chunk)
        yield preprocessed_chunk

for chunk in load_data_in_chunks('large_dataset.csv'):
    # Process each chunk
    pass


# Integration testing

## Integration testing in machine learning (ML) projects involves testing the interaction between different components of the system to ensure they work together as expected. This could include testing data preprocessing, model training, and evaluation steps together as a pipeline. Here’s an example using a complete ML pipeline.

In [41]:
#Scenario: Integration Testing for an ML Pipeline

#Let's consider an ML pipeline that includes the following steps:
#Data Preprocessing: Scaling and transforming the input features.
#Model Training: Training a classifier.
#Model Evaluation: Evaluating the trained model on a test set.

#We'll write an integration test to ensure that these components work together properly.


In [42]:
import numpy as np
import unittest
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Define the ML pipeline
def create_ml_pipeline():
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  # Data Preprocessing
        ('classifier', RandomForestClassifier(random_state=42))  # Model Training
    ])
    return pipeline

class TestMLPipelineIntegration(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        # Load dataset
        iris = load_iris()
        cls.X_train, cls.X_test, cls.y_train, cls.y_test = train_test_split(
            iris.data, iris.target, test_size=0.3, random_state=42
        )
        # Create pipeline
        cls.pipeline = create_ml_pipeline()

    def test_pipeline_training(self):
        # Train the pipeline
        self.pipeline.fit(self.X_train, self.y_train)
        self.assertTrue(hasattr(self.pipeline.named_steps['classifier'], 'classes_'), 
                        "Model training failed: 'classes_' attribute not found.")

    def test_pipeline_prediction(self):
        # Train and predict
        self.pipeline.fit(self.X_train, self.y_train)
        y_pred = self.pipeline.predict(self.X_test)
        accuracy = accuracy_score(self.y_test, y_pred)
        
        # Assert the accuracy is above a threshold
        self.assertGreaterEqual(accuracy, 0.9, "Model accuracy is below 0.9")

    def test_pipeline_inference(self):
        # Train the pipeline
        self.pipeline.fit(self.X_train, self.y_train)
        
        # Test inference on a new sample
        new_sample = [[5.1, 3.5, 1.4, 0.2]]  # A sample from the Iris dataset
        prediction = self.pipeline.predict(new_sample)
        
        # Convert to Python int before checking type
        prediction = int(prediction[0])
        
        # Assert the prediction is as expected (checking type and valid class label)
        self.assertIsInstance(prediction, int, "Prediction is not an integer class label.")
        self.assertIn(prediction, [0, 1, 2], "Prediction is not a valid class label.")

if __name__ == '__main__':
    unittest.main(argv=[''], exit=False)

...
----------------------------------------------------------------------
Ran 3 tests in 1.275s

OK


Here's a concise overview of Unit Testing, Integration Testing, Profiling, and Code Tuning in the context of machine learning, tailored for interview preparation:

### **1. Unit Testing in ML/DL:**
- **Definition:** Unit testing involves testing individual components or functions of a codebase to ensure they work correctly in isolation.
- **Purpose:** To validate the correctness of each component (e.g., a specific data preprocessing function, a model training method).
- **Example:** Testing a function that scales data or a model's predict method to ensure it returns outputs of the expected shape and type.
- **Tools:** Python’s `unittest`, `pytest`.

### **2. Integration Testing in ML/DL:**
- **Definition:** Integration testing checks the interaction between different components of the system to ensure they work together as intended.
- **Purpose:** To validate that the end-to-end ML pipeline (e.g., data preprocessing, model training, and evaluation) functions correctly when integrated.
- **Example:** Testing an entire ML pipeline where data is scaled, a model is trained, and predictions are made to ensure the workflow is seamless.
- **Tools:** Python’s `unittest`, `pytest`, with an emphasis on end-to-end tests.

### **3. Profiling in ML/DL:**
- **Definition:** Profiling is the process of measuring the performance of your code, identifying bottlenecks, and understanding resource usage (CPU, memory).
- **Purpose:** To optimize the performance of ML models and preprocessing steps by identifying slow or resource-intensive parts of the code.
- **Example:** Using a profiler like `cProfile` to measure the execution time of different parts of a training loop or feature engineering process.
- **Tools:** `cProfile`, `line_profiler`, `memory_profiler`.

### **4. Code Tuning in ML/DL:**
- **Definition:** Code tuning refers to optimizing code for better performance, often by improving computational efficiency or reducing memory usage.
- **Purpose:** To enhance the speed of ML model training/inference and reduce resource consumption, especially when working on CPUs.
- **Example:** Optimizing the hyperparameters of a model, using efficient data structures (like NumPy arrays), and parallelizing computations.
- **Techniques:** Vectorization, parallel processing (e.g., setting `n_jobs=-1`), efficient data handling.

This summary provides a foundational understanding of key testing and optimization concepts relevant to machine learning, crucial for interview discussions.