# Function
            
         Logistic Regression

Question 1: What is Logistic Regression, and how does it differ from Linear
Regression?

Ans Logistic Regression and Linear Regression are both statistical models, but they’re used for different types of prediction problems and work in different ways.

1. Logistic Regression

Purpose: Used for classification problems (predicting categories, e.g., Yes/No, 0/1, Spam/Not Spam).

Output: Predicts probabilities between 0 and 1, then classifies them into categories using a threshold (e.g., 0.5).
Mathematics:

Instead of predicting the value directly, it uses the logit function (sigmoid) to transform the linear output:

p = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \dots + \beta_nx_n)}}

2. Linear Regression

Purpose: Used for regression problems (predicting continuous numerical values, e.g., salary, temperature, house price).

Output: Predicts a continuous value that can be any real number (positive or negative).

Mathematics:

Directly predicts using a linear equation:

y = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n

Question 2: Explain the role of the Sigmoid function in Logistic Regression.

Ans In Logistic Regression, the sigmoid function plays a crucial role in transforming the output of a linear model into a probability score between 0 and 1. This transformation is essential for binary classification problems, where the goal is to predict the probability of an instance belonging to a particular class.

Question 3: What is Regularization in Logistic Regression and why is it needed?

Ans Regularization in Logistic Regression is a technique used to prevent overfitting by adding a penalty term to the cost function, which discourages the model from assigning too much importance (large coefficients) to any single feature.

Why Regularization is Needed

In Logistic Regression, if some features have very large coefficients:

The model fits the training data too closely (overfitting).

It performs poorly on new/unseen data (low generalization).


Regularization controls the complexity of the model so it focuses on truly important features.

Question 4: What are some common evaluation metrics for classification models, and
why are they important?

Ans Evaluation metrics are used to measure how well a classification model performs. They are important because they help us:

1. Quantify accuracy — to see how close predictions are to the actual outcomes.


2. Choose the right model — especially when comparing multiple models.


3. Handle class imbalance — some metrics work better when data is skewed.

In [None]:

Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load dataset from sklearn
iris = load_iris()

# 2. Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# 3. Features (X) and Target (y)
X = df.drop('target', axis=1)
y = df['target']

# 4. Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 5. Create Logistic Regression model
model = LogisticRegression(max_iter=200)

# 6. Train the model
model.fit(X_train, y_train)

# 7. Predict on test set
y_pred = model.predict(X_test)

# 8. Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

Model Accuracy: 1.00


In [None]:

Question 6: Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load dataset
iris = load_iris()

# 2. Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# 3. Features and target
X = df.drop('target', axis=1)
y = df['target']

# 4. Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 5. Create Logistic Regression model with L2 regularization
model = LogisticRegression(
    penalty='l2',   # L2 regularization
    C=1.0,          # Regularization strength (lower value = stronger regularization)
    solver='lbfgs', # Solver that supports L2
    max_iter=200
)

# 6. Train the model
model.fit(X_train, y_train)

# 7. Predictions
y_pred = model.predict(X_test)

# 8. Accuracy
accuracy = accuracy_score(y_test, y_pred)

# 9. Print coefficients and accuracy
print("Model Coefficients:")
print(model.coef_)
print("\nModel Intercept:")
print(model.intercept_)
print(f"\nModel Accuracy: {accuracy:.2f}")

Model Coefficients:
[[-0.39345607  0.96251768 -2.37512436 -0.99874594]
 [ 0.50843279 -0.25482714 -0.21301129 -0.77574766]
 [-0.11497673 -0.70769055  2.58813565  1.7744936 ]]

Model Intercept:
[  9.00884295   1.86902164 -10.87786459]

Model Accuracy: 1.00


In [None]:

Question 7: Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# 1. Load dataset from sklearn
iris = load_iris()

# 2. Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# 3. Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# 4. Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 5. Create Logistic Regression model with One-vs-Rest strategy
model = LogisticRegression(
    multi_class='ovr',
    solver='lbfgs',
    max_iter=200
)

# 6. Train the model
model.fit(X_train, y_train)

# 7. Make predictions
y_pred = model.predict(X_test)

# 8. Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))



Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30



In [None]:

Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# 1. Load dataset
iris = load_iris()

# 2. Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# 3. Features and target
X = df.drop('target', axis=1)
y = df['target']

# 4. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 5. Create Logistic Regression model
log_reg = LogisticRegression(max_iter=500)

# 6. Parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2'],       # L1 = Lasso, L2 = Ridge
    'solver': ['liblinear']        # Supports both L1 and L2
}

# 7. Apply GridSearchCV
grid_search = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,             # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)

# 8. Fit GridSearchCV
grid_search.fit(X_train, y_train)

# 9. Print best parameters and validation accuracy
print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# 1. Load dataset
iris = load_iris()

# 2. Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# 3. Features and target
X = df.drop('target', axis=1)
y = df['target']

# 4. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 5. Create Logistic Regression model
log_reg = LogisticRegression(max_iter=500)

# 6. Parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2'],       # L1 = Lasso, L2 = Ridge
    'solver': ['liblinear']        # Supports both L1 and L2
}

# 7. Apply GridSearchCV
grid_search = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,             # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)

# 8. Fit GridSearchCV
grid_search.fit(X_train, y_train)

# 9. Print best parameters and validation accuracy
print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# 1. Load dataset
iris = load_iris()

# 2. Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# 3. Features and target
X = df.drop('target', axis=1)
y = df['target']

# 4. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 5. Create Logistic Regression model
log_reg = LogisticRegression(max_iter=500)

# 6. Parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2'],       # L1 = Lasso, L2 = Ridge
    'solver': ['liblinear']        # Supports both L1 and L2
}

# 7. Apply GridSearchCV
grid_search = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,             # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)

# 8. Fit GridSearchCV
grid_search.fit(X_train, y_train)

# 9. Print best parameters and validation accuracy
print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.9583
Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.9583
Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.9583


In [None]:

Question 9: Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 1. Load dataset
iris = load_iris()

# 2. Convert to DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# 3. Features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# 4. Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -----------------------------
# Logistic Regression without scaling
# -----------------------------
model_no_scaling = LogisticRegression(max_iter=200)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# -----------------------------
# Logistic Regression with StandardScaler
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_with_scaling = LogisticRegression(max_iter=200)
model_with_scaling.fit(X_train_scaled, y_train)
y_pred_with_scaling = model_with_scaling.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred_with_scaling)

# -----------------------------
# Results
# -----------------------------
print(f"Accuracy without scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with scaling:    {accuracy_with_scaling:.4f}")

Accuracy without scaling: 1.0000
Accuracy with scaling:    1.0000


Question 10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.

Ans let’s walk through this step by step as if we’re actually in that e-commerce setting, where only 5% of customers respond to the campaign.
That’s heavily imbalanced, so we can’t just train a default Logistic Regression and call it a day — we need a careful approach.

1. Understand the Problem and Dataset

Goal: Predict the probability that a customer will respond to a marketing campaign.

Nature: Binary classification (respond vs not respond) with class imbalance (5% positive class).

Business impact: False positives cost marketing spend; false negatives lose potential conversions.

2. Data Handling

Data cleaning:

Handle missing values.

Remove duplicates.

Treat outliers (especially in features like purchase amount or activity counts).

Feature engineering:

Create features like recency, frequency, monetary value (RFM analysis).

Include channel engagement data (email clicks, site visits, ad interactions).

Encode categorical variables (One-Hot Encoding or Target Encoding).

Train-test split:

Stratified split so the 5% positive ratio is preserved in both sets.

3. Feature Scaling

Logistic Regression is sensitive to feature scale when regularization is applied.

Apply StandardScaler (mean = 0, std = 1) to numeric variables.

Fit scaler only on training data, then transform both train & test sets.

4. Balancing Classes

Two main options:

1. Class weights

In scikit-learn: LogisticRegression(class_weight='balanced')

Adjusts the penalty so misclassifying the minority class costs more.

2. Resampling

Oversampling (e.g., SMOTE) to generate synthetic positive samples.

Undersampling majority class to reduce imbalance.

Sometimes combining both works best.

5. Hyperparameter Tuning

Key hyperparameters for Logistic Regression:

C (inverse regularization strength) — tune across a log scale ([0.01, 0.1, 1, 10, 100]).

penalty — test L1 and L2 regularization.

solver — choose based on penalty (liblinear for L1/L2; lbfgs for L2).

Use GridSearchCV or RandomizedSearchCV with stratified cross-validation.

Include class_weight as a parameter to tune as well.

6. Evaluation Strategy
Do not rely on accuracy — it will be misleading due to imbalance.
Use:

Precision, Recall, and F1-score (especially recall if catching positives is important).

ROC-AUC to measure ranking ability.

Precision-Recall AUC — more informative for imbalanced datasets.

Choose threshold carefully:

Default 0.5 may not be optimal — adjust threshold to maximize business metric (e.g., maximize precision at 20% recall).

7. Deployment Considerations

Ensure feature scaling & preprocessing are part of a pipeline so production data is transformed the same way.

Monitor:

Data drift (customer behavior may change over time).

Model performance over campaigns — retrain periodically.