

**Question 1: Logistic Regression vs Linear Regression**
>Logistic Regression is a supervised learning algorithm used mainly for classification problems, especially binary outcomes like spam/not spam or disease/no disease. It models the relationship between independent variables and a categorical dependent variable by estimating probabilities. Unlike Linear Regression, which predicts continuous numeric outputs by fitting a straight line, Logistic Regression predicts the probability of belonging to a class. To achieve this, it applies the logistic (sigmoid) function to the linear combination of input features, ensuring outputs always lie between 0 and 1. Linear Regression can give values outside this range, making it unsuitable for classification. Logistic Regression also uses a decision boundary (usually 0.5) to assign the final class label.



**Question 2: Role of Sigmoid Function in Logistic Regression**
>The Sigmoid function plays a crucial role in Logistic Regression by converting the linear equation’s output into a probability value between 0 and 1. Its S-shaped curve allows the model to represent uncertainty and smooth transitions between classes. When the input to the sigmoid is very large positive, the output approaches 1; when very negative, it approaches 0. This makes it ideal for mapping inputs to binary class probabilities. The function ensures that Logistic Regression outputs remain interpretable as probabilities rather than raw scores. Additionally, because the sigmoid is differentiable, it supports optimization using gradient descent, which helps in finding the best model parameters. Without it, Logistic Regression would lose its classification capability and behave like Linear Regression.



**Question 3: What is Regularization in Logistic Regression and why is it needed?**
>Regularization in Logistic Regression is a technique used to prevent the model from overfitting the training data. Overfitting happens when the model learns noise or irrelevant patterns, which reduces its performance on unseen data. Regularization works by adding a penalty term to the cost function, discouraging the model from assigning very high weights to features. Two common types are L1 (Lasso) and L2 (Ridge) regularization. L1 helps in feature selection by shrinking some coefficients to zero, while L2 reduces the magnitude of coefficients but keeps them all. By controlling complexity, regularization improves generalization and ensures the model performs better on new datasets.



**Question 4: What are some common evaluation metrics for classification models, and why are they important?**
>Common evaluation metrics for classification models include accuracy, precision, recall, F1-score, and ROC-AUC. Accuracy measures the overall proportion of correct predictions, but it may be misleading with imbalanced data. Precision tells how many predicted positives are actually correct, while recall shows how many actual positives are correctly identified. The F1-score is the harmonic mean of precision and recall, balancing both aspects. ROC-AUC evaluates how well the model separates classes across different thresholds. These metrics are important because they provide deeper insights into model performance beyond simple accuracy, helping to choose the right model for real-world applications where class distribution may vary.







In [1]:
# Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
# splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
# (Use Dataset from sklearn package)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# Load dataset from sklearn
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Train/Test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=10000)  # increased iterations for convergence
model.fit(X_train, y_train)

# Predictions and Accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy of Logistic Regression model:", accuracy)


Accuracy of Logistic Regression model: 0.956140350877193


In [2]:
# Question 6: Write a Python program to train a Logistic Regression model using L2
# regularization (Ridge) and print the model coefficients and accuracy.
# (Use Dataset from sklearn package)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# Load dataset from sklearn
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression with L2 Regularization (default penalty='l2')
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=10000)
model.fit(X_train, y_train)

# Predictions and Accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print coefficients and accuracy
print("Model Coefficients:\n", model.coef_)
print("\nModel Intercept:\n", model.intercept_)
print("\nAccuracy of Logistic Regression with L2 regularization:", accuracy)



Model Coefficients:
 [[ 1.0274368   0.22145051 -0.36213488  0.0254667  -0.15623532 -0.23771256
  -0.53255786 -0.28369224 -0.22668189 -0.03649446 -0.09710208  1.3705667
  -0.18140942 -0.08719575 -0.02245523  0.04736092 -0.04294784 -0.03240188
  -0.03473732  0.01160522  0.11165329 -0.50887722 -0.01555395 -0.016857
  -0.30773117 -0.77270908 -1.42859535 -0.51092923 -0.74689363 -0.10094404]]

Model Intercept:
 [28.64871395]

Accuracy of Logistic Regression with L2 regularization: 0.956140350877193


In [3]:
# Question 7: Write a Python program to train a Logistic Regression model for multiclass
# classification using multi_class='ovr' and print the classification report.
# (Use Dataset from sklearn package)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris

# Load dataset (Iris dataset - multiclass)
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression for multiclass (One-vs-Rest)
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=10000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Print classification report
print("Classification Report for Logistic Regression (OvR):\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report for Logistic Regression (OvR):

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [4]:
# Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
# hyperparameters for Logistic Regression and print the best parameters and validation
# accuracy.
# (Use Dataset from sklearn package)

# Question 8: Hyperparameter tuning with GridSearchCV for Logistic Regression

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load dataset (Breast Cancer dataset)
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression model
log_reg = LogisticRegression(max_iter=10000, solver='liblinear')  # solver compatible with l1 and l2

# Hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],        # Regularization strength
    'penalty': ['l1', 'l2']              # Regularization type
}

# GridSearchCV
grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Best parameters and accuracy
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)

# Evaluate on test set
y_pred = grid.best_estimator_.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy with Best Parameters:", test_accuracy)


Best Parameters: {'C': 100, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9670329670329672
Test Accuracy with Best Parameters: 0.9824561403508771


In [5]:
# Question 9: Write a Python program to standardize the features before training Logistic
# Regression and compare the model's accuracy with and without scaling.
# (Use Dataset from sklearn package)


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression without scaling
model_no_scaling = LogisticRegression(max_iter=10000, solver='lbfgs')
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with scaling
model_scaling = LogisticRegression(max_iter=10000, solver='lbfgs')
model_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = model_scaling.predict(X_test_scaled)
accuracy_scaling = accuracy_score(y_test, y_pred_scaling)

# Print results
print("Accuracy without Scaling:", accuracy_no_scaling)
print("Accuracy with Scaling:", accuracy_scaling)


Accuracy without Scaling: 0.956140350877193
Accuracy with Scaling: 0.9736842105263158


Question 10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.

>To build a Logistic Regression model for predicting customer response in an imbalanced dataset (5% response rate), the first step would be data preprocessing by handling missing values, encoding categorical variables, and standardizing numerical features for consistent scaling. Since the dataset is highly imbalanced, techniques like SMOTE (Synthetic Minority Oversampling Technique), undersampling, or class weighting (class_weight='balanced' in Logistic Regression) should be applied to handle the imbalance. After that, feature selection can be done to remove irrelevant predictors, improving model efficiency and interpretability. Next, apply Logistic Regression with scaling (using StandardScaler) to avoid bias due to different feature magnitudes. For hyperparameter tuning, use GridSearchCV or RandomizedSearchCV to optimize C (regularization strength), penalty type (L1/L2), and solver. For evaluation, rely not only on accuracy (which is misleading in imbalanced data) but also on precision, recall, F1-score, ROC-AUC, and Precision-Recall curves to measure business effectiveness. Finally, the chosen model should be validated on unseen test data and monitored continuously, since in business cases, recall (catching most responders) may be more valuable than accuracy.