Scikit-Learn Overview

Scikit-Learn is a powerful Python library for machine learning that provides tools for:

Supervised Learning (Regression & Classification)

Unsupervised Learning (Clustering, Dimensionality Reduction)

Model Selection & Hyperparameter Tuning

Data Preprocessing & Feature Engineering

1. Data Preprocessing & Feature Engineering

Scikit-Learn provides several utilities for preparing data before feeding it into ML models.

In [None]:
#Handling Missing Values
from sklearn.impute import SimpleImputer
import numpy as np

data = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])
imputer = SimpleImputer(strategy="mean")  # Options: mean, median, most_frequent
data_imputed = imputer.fit_transform(data)
print(data_imputed)

[[1.  2.  7.5]
 [4.  5.  6. ]
 [7.  8.  9. ]]


In [None]:
#Standradization and Normalization
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()  # Normalizes data (mean=0, variance=1)
normalized_data = scaler.fit_transform(data_imputed)

min_max_scaler = MinMaxScaler()  # Scales data between 0 and 1
scaled_data = min_max_scaler.fit_transform(data_imputed)

In [5]:
#Encoding categorcial data
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
categorical_data = np.array([["red"], ["blue"], ["green"]])
encoded_data = encoder.fit_transform(categorical_data).toarray()
print(encoded_data)

[[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]


2. Model Training & Evaluation

Scikit-Learn provides several ML models for regression, classification, and clustering.

In [6]:
#splitting data
from sklearn.model_selection import train_test_split

X = np.random.rand(100, 5)
y = np.random.randint(0, 2, size=100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
#Traning a classification Model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.6


In [8]:
#Training a Regression Model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

x = np.random.rand(100,3)
y = np.random.rand(100)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
print("MSE: ", mean_squared_error(y_test, y_pred))

MSE:  0.29252412589894317


3. Model Selection & Hyperparameter Tuning

In [9]:
#Grid search for Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)


Best Parameters: {'max_depth': 10, 'n_estimators': 50}


In [13]:
#Cross Validation
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

# Sample dataset
X = np.random.rand(100, 5)  # Features
y = np.random.rand(100)  # Continuous target (Regression problem)

# Use RandomForestRegressor instead of RandomForestClassifier
regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Perform cross-validation
scores = cross_val_score(regressor, X, y, cv=5, scoring='neg_mean_squared_error')

print("Cross-validation Scores (MSE):", scores)
print("Mean MSE:", scores.mean())

Cross-validation Scores (MSE): [-0.06717358 -0.12617092 -0.12283851 -0.13094387 -0.09280789]
Mean MSE: -0.10798695673196115


4. Model Persistence (Saving & Loading Models)

In [14]:
import joblib

# Save Model
joblib.dump(clf, "random_forest_model.pkl")

# Load Model
loaded_model = joblib.load("random_forest_model.pkl")
print(loaded_model.predict(X_test))

[1 1 1 0 1 1 0 0 0 1 1 0 0 1 1 0 1 0 1 0]
