Question 1: What is Logistic Regression, and how does it differ from Linear Regression?


Answer:
Logistic Regression is a statistical model used for classification problems, especially when the output is binary (0 or 1) — like predicting whether an email is spam or not.
It predicts the probability of an event occurring using the Sigmoid function.

- Linear Regression predicts continuous values (e.g., house price).

- Logistic Regression predicts categorical outcomes (e.g., yes/no).

- Linear Regression uses a straight line, while Logistic Regression uses an S-shaped curve (sigmoid) to map values between 0 and 1.

Question 2: Explain the role of the Sigmoid function in Logistic Regression.

Answer:
The Sigmoid function converts the linear output of the model into a probability value between 0 and 1.
It ensures predictions are interpretable as probabilities.

If the output (probability) is > 0.5, the model predicts 1 (positive class); otherwise, 0 (negative class).

Question 3: What is Regularization in Logistic Regression and why is it needed?

Answer:
Regularization is a technique used to reduce overfitting by adding a penalty term to the loss function.
It prevents the model from giving too much importance (high weights) to any one feature.

Types:

- L1 Regularization (Lasso): Adds the absolute value of weights as penalty — can make some weights zero (feature selection).

- L2 Regularization (Ridge): Adds the square of weights as penalty — keeps all features but reduces their impact.

Regularization improves the generalization and stability of the model.

Question 4: What are some common evaluation metrics for classification models, and why are they important?

Answer:
Evaluation metrics help measure how well a classification model performs. Common metrics include:

- Accuracy:

Ratio of correctly predicted observations to total observations.

Simple but can be misleading if the data is imbalanced.

- Precision:

Ratio of correctly predicted positive observations to total predicted positives.

Useful when false positives are costly.

- Recall (Sensitivity):

Ratio of correctly predicted positives to all actual positives.

Useful when missing positive cases is costly.

- F1-Score:

Harmonic mean of precision and recall.

Balances both precision and recall.

- Confusion Matrix:

A table showing counts of true positives, true negatives, false positives, and false negatives.

Helps visualize model performance.

- ROC-AUC Score:

Measures how well the model separates the classes.

A higher value (closer to 1) means better performance.

These metrics are important because they provide different perspectives on model performance and help you choose the best model for your specific problem.

Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy. (Use Dataset from sklearn package)

Answer:

                                   

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.956140350877193


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy. (Use Dataset from sklearn package)

Answer:

In [2]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model with L2 Regularization (default)
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print coefficients and accuracy
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)
print("Accuracy:", accuracy)



Model Coefficients: [[ 2.09981182  0.13248576 -0.10346836 -0.00255646 -0.17024348 -0.37984365
  -0.69120719 -0.4081069  -0.23506963 -0.02356426 -0.0854046   1.12246945
  -0.32575716 -0.06519356 -0.02371113  0.05960156  0.00452206 -0.04277587
  -0.04148042  0.01425051  0.96630267 -0.37712622 -0.05858253 -0.02395975
  -0.31765956 -1.00443507 -1.57134711 -0.69351401 -0.84095566 -0.09308282]]
Model Intercept: [2.13128402]
Accuracy: 0.956140350877193


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report. (Use Dataset from sklearn package)

Answer:

In [3]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report

# Load multiclass dataset (Iris dataset)
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model using One-vs-Rest (OvR)
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy. (Use Dataset from sklearn package)

Answer:

In [4]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define model
model = LogisticRegression(max_iter=1000)

# Define hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # 'liblinear' supports both l1 and l2
}

# Apply GridSearchCV
grid = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Get best model and evaluate
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid.best_params_)
print("Validation Accuracy:", grid.best_score_)
print("Test Accuracy:", accuracy)


Best Parameters: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
Validation Accuracy: 0.9626373626373628
Test Accuracy: 0.956140350877193


Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling. (Use Dataset from sklearn package)

Answer:

In [5]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression without scaling
model1 = LogisticRegression(max_iter=1000)
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)
acc_without_scaling = accuracy_score(y_test, y_pred1)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with scaling
model2 = LogisticRegression(max_iter=1000)
model2.fit(X_train_scaled, y_train)
y_pred2 = model2.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred2)

# Compare results
print("Accuracy without Scaling:", acc_without_scaling)
print("Accuracy with Scaling:", acc_with_scaling)


Accuracy without Scaling: 0.956140350877193
Accuracy with Scaling: 0.9736842105263158


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Question 10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.

Answer:

1. Data Understanding and Cleaning

- Load the data and explore it using summary statistics (df.describe(), df.info()).

    >  Handle missing values:

     -   Impute numerical features using mean/median.

     -   Impute categorical features using mode or create a separate category “Unknown.”

- Remove duplicates and irrelevant features (like user IDs or timestamps not adding value).



2. Feature Engineering

- Convert categorical variables using One-Hot Encoding or Label Encoding.

> Create new meaningful features like:

- Customer purchase frequency, total spending, time since last purchase, etc.

- Drop highly correlated or redundant features to avoid multicollinearity.

3. Feature Scaling

- Use StandardScaler or MinMaxScaler to standardize numerical features.

- Scaling helps the Logistic Regression algorithm converge faster and improves stability.

4. Handling Class Imbalance (Only 5% respond)

- Since the dataset is highly imbalanced, accuracy alone is misleading.
Use one or a combination of these techniques:

> Resampling Techniques:

- Oversampling minority class using SMOTE (Synthetic Minority Oversampling Technique).

- Undersampling majority class to balance proportions.

> Class Weight Adjustment:

- Set class_weight='balanced' in Logistic Regression to automatically adjust for imbalance.

- This penalizes misclassification of minority class more heavily.

5. Model Training

- Train a Logistic Regression model:

             model = LogisticRegression(class_weight='balanced', solver='liblinear', max_iter=1000)


- Use cross-validation (e.g., StratifiedKFold) to ensure fair representation of both classes in each fold.

6. Hyperparameter Tuning

- Use GridSearchCV to tune parameters like:

- C (regularization strength)

- penalty (L1 or L2)

- solver (liblinear, saga)

- This helps find the best balance between bias and variance.

7. Model Evaluation

- Because the data is imbalanced, focus on metrics beyond accuracy:

- Precision: How many predicted responders are actually correct.

- Recall: How many actual responders we identified correctly.

- F1-Score: Balances precision and recall.

- ROC-AUC Score: Measures ability to distinguish between responders and non-responders.

- Confusion Matrix: Gives insight into true positives and false negatives.

- For business cases like marketing response prediction, Recall and AUC are more important than Accuracy — missing a potential responder is more costly.

8. Model Interpretation and Business Application

- Examine model coefficients to understand which features most influence campaign response.

- Share insights with the marketing team to target high-probability customers.

- Continuously retrain and monitor the model as customer behavior changes.
