# Scikit-learn Estimator API

# The 'fit' Method: Learning from Data

The **fit** method is where the learning happens. When you call fit on an estimator, you provide it with your training data. The estimator then analyzes this data to learn patterns, relationships, and parameters that will be used for making predictions or transformations later on.

**Syntax**:- estimator.fit(X_train, y_train)

Here:

**estimator**: An instance of a Scikit-learn model (e.g., LinearRegression(), LogisticRegression(), KMeans()).                                      
**X_train**: The training feature data, typically a NumPy array or Pandas DataFrame.                                                                 
**y_train**: The training target data (labels or values), also typically a NumPy array or Pandas Series. For unsupervised learning algorithms, y_train is not required.

# The 'predict' Method: Making Predictions

Once an estimator has been fitted to the training data, it can be used to make predictions on new, unseen data. This is achieved using the predict method.

**Syntax**:- predictions = estimator.predict(X_test)                                                                                                
**X_test**: The feature data for which you want to make predictions. This should have the same structure and features as the data used for fitting, but it contains new, unseen samples.                                                                                                                 
**predictions**: The output of the method, which will be an array of predicted values or class labels, corresponding to each sample in X_test.

#  The 'transform' Method: Data Preprocessing and Feature Engineering

While fit and predict are central to model building, many Scikit-learn objects, particularly those in the sklearn.preprocessing module, use the transform method. These objects are often called transformers.

Transformers often have a fit_transform method, which combines fitting the transformer to the data and then transforming the data in one step. Alternatively, you can call fit and then transform separately.                                                                                     

#### Using fit_transform
transformer = StandardScaler()                                                                                                                      
X_scaled = transformer.fit_transform(X_train)

#### Using fit and transform separately
transformer = StandardScaler()                                                                                                                      
transformer.fit(X_train)                                                                                                                           
X_scaled = transformer.transform(X_train)

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [3]:
data = {'feature1': np.random.rand(100) * 100,
        'feature2': np.random.rand(100) * 1000,
        'target': np.random.rand(100) * 50}
df = pd.DataFrame(data)

X = df[['feature1', 'feature2']]
y = df['target']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

# Instantiate and fit_transform on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Uses learned parameters to transform data

# Transform test data using the SAME scaler fitted on training data
X_test_scaled = scaler.transform(X_test)

print("Original X_train shape:", X_train.shape)
print("Scaled X_train shape:", X_train_scaled.shape)
print("Original X_test shape:", X_test.shape)
print("Scaled X_test shape:", X_test_scaled.shape)
print("Sample of scaled X_train: ", X_train_scaled[:5])
print("Sample of scaled X_test: ", X_test_scaled[:5])

Original X_train shape: (80, 2)
Scaled X_train shape: (80, 2)
Original X_test shape: (20, 2)
Scaled X_test shape: (20, 2)
Sample of scaled X_train:  [[ 0.75759118 -0.10651748]
 [ 0.12090698 -1.27390205]
 [-1.23829021  0.06255202]
 [ 1.48969527  0.00477849]
 [ 1.67021941  1.16041632]]
Sample of scaled X_test:  [[-0.84839566 -0.15582781]
 [ 0.4496746   0.29350611]
 [-1.50850036 -0.62152603]
 [ 1.28451798  0.63299618]
 [-0.44825264 -0.34558095]]


# Structuring Your Data: Feature Matrices (X) and Target Vectors (y)

The feature matrix, X, is a 2-dimensional array-like structure (typically a NumPy array or a Pandas DataFrame) where each row represents a single observation or sample, and each column represents a specific feature or attribute of that observation. These features are the independent variables that your model will use to learn and make predictions.

##### Why is it important?

Machine learning models learn by identifying patterns and relationships between features and the target variable. The structure of X ensures that the model receives consistent information for each observation. Each feature column should contain numerical data (or be appropriately encoded if it's categorical) and should be relevant to the problem you are trying to solve.

In [12]:
data = {
    'CustomerID': [f'CUST{i:04d}' for i in range(1, 11)],
    'Tenure': np.random.randint(1, 72, 10),
    'MonthlyCharges': np.random.uniform(20, 120, 10).round(2),
    'TotalCharges': np.random.uniform(20, 8000, 10).round(2),
    'ContractType': np.random.choice(['Month-to-month', 'One year', 'Two year'], 10),
    'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], 10),
    'Churn': np.random.randint(0, 2, 10) # 0 for No, 1 for Yes
}
df = pd.DataFrame(data)

# Define features (X)
feature_columns = ['Tenure', 'MonthlyCharges', 'TotalCharges', 'ContractType', 'InternetService']
X = df[feature_columns]      # It selects only the columns listed in feature_columns from the DataFrame df

print("Feature Matrix (X):")
print(X.head())
print("Shape of X:", X.shape)

Feature Matrix (X):
   Tenure  MonthlyCharges  TotalCharges    ContractType InternetService
0       9          104.36       6045.15        Two year             DSL
1       4           75.65       4970.96        Two year     Fiber optic
2      69          102.31       3188.17  Month-to-month     Fiber optic
3      36          116.65       1031.48        One year             DSL
4      12           54.96       5960.68        Two year              No
Shape of X: (10, 5)


The target vector, y, is a 1-dimensional array-like structure (typically a NumPy array or a Pandas Series) that contains the values or labels you are trying to predict. Each element in y corresponds to the target value for the observation at the same index in the feature matrix X.

##### Why is it important?

The target vector is the 'answer' that your model aims to learn. During training (the fit process), the model uses X to predict values and compares these predictions to the actual values in y to adjust its internal parameters. For supervised learning tasks, y is essential.

In [13]:
target_column = 'Churn'
y = df[target_column]

print("Target Vector (y):")
print(y.head())
print("Shape of y:", y.shape)

Target Vector (y):
0    1
1    1
2    0
3    0
4    0
Name: Churn, dtype: int32
Shape of y: (10,)


### The Relationship Between X and y
It is crucial that the number of rows in X (number of samples) exactly matches the number of elements in y (number of target values). If they do not match, Scikit-learn will raise an error. This ensures that each feature set has a corresponding correct answer for the model to learn from.

# Splitting Data into Training and Testing Sets with train_test_split

The training set is used to 'teach' the model. The model learns patterns, relationships, and parameters from this data. The testing set, on the other hand, is held back and used only after the model has been trained. It serves as a proxy for real-world, unseen data, allowing us to get an unbiased estimate of the model's performance.

**Preventing Overfitting**: This is the primary reason. A model that performs perfectly on training data but poorly on test data is overfitted.       
**Unbiased Evaluation**: The test set provides an honest assessment of how the model is likely to perform in production.                              
**Model Selection**: When comparing different models or different hyperparameter settings for the same model, the performance on the test set helps you choose the best one.

**Syntax** :- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**X**: Feature matrix (input data) — Pandas DataFrame or NumPy array                                                                          
**y**: Target vector (output/label) — Pandas Series or NumPy array                                                                                
**test_size**: Fraction or number of samples for the test set (e.g., 0.2 = 20%). Remaining data is used for training                                
**random_state**: Fixes randomness for reproducible splits (e.g., 42)                                                                               
**shuffle (default: True)**: Shuffles data before splitting to ensure randomness                                                                      
**stratify (optional)**: Maintains the same class distribution in train and test sets (use stratify=y for classification)

In [16]:
data = {
    'FeatureA': np.random.rand(100) * 10,
    'FeatureB': np.random.rand(100) * 5,
    'FeatureC': np.random.randint(0, 2, 100), # Example of a categorical-like feature
    'Target': np.random.rand(100) * 100 # A continuous target for regression
}
df = pd.DataFrame(data)

feature_column = ['FeatureA', 'FeatureB', 'FeatureC']
X = df[feature_column]
y = df['Target']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

print("--- After Splitting ---")
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

--- After Splitting ---
Shape of X_train: (80, 3)
Shape of X_test: (20, 3)
Shape of y_train: (80,)
Shape of y_test: (20,)


## Standard Scikit-learn Workflow

1. **Import Libraries**  - Import required Python libraries such as NumPy, Pandas, and Scikit-learn.

2. **Load Data**  - Load the dataset into a Pandas DataFrame.

3. **Data Preprocessing & Feature Engineering**  
   - Handle missing values (imputation)  
   - Encode categorical variables (e.g., One-Hot Encoding)  
   - Scale numerical features (Standardization / Normalization)  
   - Create new features if required  

4. **Define Features and Target**  - Separate the dataset into feature matrix **X** and target vector **y**.

5. **Split the Data**  -  Use `train_test_split` to divide data into training and testing sets.

6. **Model Selection**  -  Choose and instantiate a suitable machine learning model.

7. **Train the Model**  -  Fit the model using the training data.

8. **Make Predictions**  -  Predict outputs for the test data.

9. **Evaluate the Model**  -  Evaluate model performance using appropriate metrics.

10. **Iterate and Improve**  - Tune hyperparameters, try different models, or improve feature engineering.


# Tools for Your ML Toolkit

### 1. The linear_model Module: Algorithms for Linear Relationships                                                                      
The linear_model module contains a variety of algorithms that assume a linear relationship between the input features and the target variable. These are often some of the simplest yet most powerful models, serving as excellent baselines.                                                              

##### Key Algorithms and Use Cases

- **LinearRegression**  
  Used for regression problems with a continuous target variable.  
  Fits the best straight line (or hyperplane) to the data.

- **Ridge Regression**  
  Linear regression with **L2 regularization**.  
  Reduces overfitting by shrinking coefficients, useful when features are highly correlated.

- **Lasso Regression**  
  Linear regression with **L1 regularization**.  
  Can reduce some coefficients exactly to zero, performing **feature selection**.

- **ElasticNet**  
  Combines **L1 (Lasso)** and **L2 (Ridge)** regularization.  
  Useful when multiple features are correlated and feature selection is needed.

- **LogisticRegression**  
  Used for **classification**, mainly binary classification.  
  Models the probability of class membership using a logistic (sigmoid) function.

- **SGDClassifier / SGDRegressor**  
  Linear models trained using **Stochastic Gradient Descent (SGD)**.  
  Efficient for **large-scale datasets** and online learning.


In [18]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import make_regression, make_classification

X_reg, y_reg = make_regression(n_samples=100,n_features=1,noise=10,random_state=42)
reg_model = LinearRegression()
reg_model.fit(X_reg, y_reg)
print(f"Linear Regression Coefficients: {reg_model.coef_}")

Linear Regression Coefficients: [44.43716999]


In [20]:
X_clf, y_clf = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, random_state=42)
clf_model = LogisticRegression()
clf_model.fit(X_clf, y_clf)
print(f"Logistic Regression Coefficients: {clf_model.coef_}")
print(f"Logistic Regression Intercept: {clf_model.intercept_}")

Logistic Regression Coefficients: [[ 3.23216767 -0.84594518]]
Logistic Regression Intercept: [0.05126339]


### 2. The metrics Module: Evaluating Model Performance

The metrics module provides a comprehensive set of functions for evaluating the performance of machine learning models. These metrics help you understand how well your model is performing and identify areas for improvement.                                                                     

#### Key Evaluation Metrics and Use Cases
The choice of evaluation metrics depends on whether the problem is **regression** or **classification**.

### Regression Metrics

- **Mean Squared Error (MSE)**  
  Measures the average squared difference between predicted and actual values.  
  Penalizes larger errors more heavily.

- **Mean Absolute Error (MAE)**  
  Measures the average absolute difference between predicted and actual values.  
  Less sensitive to outliers compared to MSE.

- **R² Score (r2_score)**  
  Indicates the proportion of variance in the target variable explained by the model.  
  A value of `1` represents a perfect fit.

### Classification Metrics

- **Accuracy Score**  
  Proportion of correctly classified samples.  
  Best suited for balanced datasets.

- **Precision Score**  
  Ratio of true positives to all predicted positives:  
  `TP / (TP + FP)`  
  Measures correctness of positive predictions.

- **Recall Score**  
  Ratio of true positives to all actual positives:  
  `TP / (TP + FN)`  
  Measures the model’s ability to identify positive cases.

- **F1 Score**  
  Harmonic mean of precision and recall.  
  Useful when class distribution is imbalanced.

- **Confusion Matrix**  
  Displays counts of true positives, true negatives, false positives, and false negatives.

In [21]:
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import make_regression, make_classification

# Regression Evaluation
X_reg, y_reg = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)
reg_model = LinearRegression()
reg_model.fit(X_reg_train, y_reg_train)
y_reg_pred = reg_model.predict(X_reg_test)

mse = mean_squared_error(y_reg_test, y_reg_pred)
r2 = r2_score(y_reg_test, y_reg_pred)
print(f"--- Regression Metrics ---")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")

# Classification Evaluation
X_clf, y_clf = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, random_state=42)
X_clf_train, X_clf_test, y_clf_train, y_clf_test = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)
clf_model = LogisticRegression(random_state=42)
clf_model.fit(X_clf_train, y_clf_train)
y_clf_pred = clf_model.predict(X_clf_test)

accuracy = accuracy_score(y_clf_test, y_clf_pred)
conf_matrix = confusion_matrix(y_clf_test, y_clf_pred)
class_report = classification_report(y_clf_test, y_clf_pred)

print(f"--- Classification Metrics ---")
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:", conf_matrix)
print("Classification Report:", class_report)

--- Regression Metrics ---
Mean Squared Error: 104.20
R-squared: 0.94
--- Classification Metrics ---
Accuracy: 0.95
Confusion Matrix: [[10  1]
 [ 0  9]]
Classification Report:               precision    recall  f1-score   support

           0       1.00      0.91      0.95        11
           1       0.90      1.00      0.95         9

    accuracy                           0.95        20
   macro avg       0.95      0.95      0.95        20
weighted avg       0.96      0.95      0.95        20



#### 3. The model_selection Module: Managing Model Development

The model_selection module provides tools for selecting the best models and tuning their hyperparameters. It's essential for robust model development and evaluation.                                                                                                                                      

#### Key Functions and Use Cases

- **train_test_split**  
  Splits the dataset into training and testing sets to evaluate model performance on unseen data.

- **KFold / StratifiedKFold**  
  Used for **k-fold cross-validation**, where data is divided into multiple folds and the model is trained and tested multiple times.  
  - `KFold`: Suitable for regression or balanced datasets  
  - `StratifiedKFold`: Preserves class distribution, ideal for classification problems

- **GridSearchCV**  
  Performs **exhaustive hyperparameter tuning** by evaluating all specified parameter combinations using cross-validation.

- **RandomizedSearchCV**  
  Performs **randomized hyperparameter tuning** by sampling a fixed number of parameter combinations.  
  More efficient than GridSearchCV for large search spaces.

- **cross_val_score**  
  Computes cross-validation scores for a model in a simple and efficient way.


In [22]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate data
X_cv, y_cv = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, random_state=42)

# Instantiate a model
model_cv = LogisticRegression(random_state=42)

# Define K-Fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42) # 5 folds

# Compute cross-validation scores
# This will train and evaluate the model 5 times
cv_scores = cross_val_score(model_cv, X_cv, y_cv, cv=kf, scoring='accuracy')

print(f"--- Cross-Validation Scores ---")
print(f"Individual fold scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.2f}")
print(f"Standard deviation of CV scores: {cv_scores.std():.2f}")

--- Cross-Validation Scores ---
Individual fold scores: [0.95 1.   1.   1.   0.95]
Mean CV accuracy: 0.98
Standard deviation of CV scores: 0.02
