Question 1: What is Logistic Regression, and how does it differ from Linear Regression?

Answer:
Logistic Regression is a type of regression model that is used when the output variable is categorical (like yes/no, 0/1, true/false). Instead of predicting continous values like linear regression, it predicts probability of a class. The probability is then mapped into 0 or 1 using a threshold (mostly 0.5).

Linear regression draws a straight line and is used for continous values like salary prediction, house price etc. Logistic regression on the other hand uses sigmoid curve and works for classification problems.

Question 2: Explain the role of the Sigmoid function in Logistic Regression.

Answer:
The sigmoid function takes any real number input and squashes it between 0 and 1. This is perfect for logistic regression because we want to predict probability of a class.
Formula: 1 / (1 + e^-z)

If output is close to 0 → model predicts class 0.
If output is close to 1 → model predicts class 1.
So basically sigmoid helps to convert linear combination into probability.

Question 3: What is Regularization in Logistic Regression and why is it needed?

Answer:
Regularization is a technique to avoid overfitting of the model. In logistic regression, sometimes coefficients (weights) become too large which makes the model too complex and not generalize well.

There are mainly two types:

L1 Regularization (Lasso): It can shrink some coefficients to 0.

L2 Regularization (Ridge): It penalizes large weights but does not make them exactly zero.

We need regularization so the model does not memorize training data and perform bad on testing data.

Question 4: What are some common evaluation metrics for classification models, and why are they important?

Answer:
Some metrics are:

Accuracy: Percent of correctly predicted samples.

Precision: Out of predicted positive, how many are really positive.

Recall (Sensitivity): Out of real positives, how many model predicted correctly.

F1-score: Harmonic mean of precision and recall.

ROC-AUC: Shows the ability of model to separate classes.

These metrics are important because accuracy alone is not enough espically when data is imbalanced. Example: if 95% are negative and only 5% positive, a model can get 95% accuracy by always predicting negative but it is useless.

In [1]:
#Question 5: Python Program (Basic Logistic Regression)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict and accuracy
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.956140350877193


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [2]:
#Question 6: Logistic Regression with L2 Regularization
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

print("Coefficients:", model.coef_)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))


Coefficients: [[ 2.09981182  0.13248576 -0.10346836 -0.00255646 -0.17024348 -0.37984365
  -0.69120719 -0.4081069  -0.23506963 -0.02356426 -0.0854046   1.12246945
  -0.32575716 -0.06519356 -0.02371113  0.05960156  0.00452206 -0.04277587
  -0.04148042  0.01425051  0.96630267 -0.37712622 -0.05858253 -0.02395975
  -0.31765956 -1.00443507 -1.57134711 -0.69351401 -0.84095566 -0.09308282]]
Accuracy: 0.956140350877193


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [3]:
# Question 7: Logistic Regression Multi-Class (OvR)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(multi_class='ovr', max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [4]:
#Question 8: GridSearchCV for Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)
print("Best Score:", grid.best_score_)


Best Params: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
Best Score: 0.9626373626373628


In [5]:
#Question 9: Standardization and Comparison
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without scaling
model1 = LogisticRegression(max_iter=1000)
model1.fit(X_train, y_train)
print("Accuracy without scaling:", accuracy_score(y_test, model1.predict(X_test)))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model2 = LogisticRegression(max_iter=1000)
model2.fit(X_train_scaled, y_train)
print("Accuracy with scaling:", accuracy_score(y_test, model2.predict(X_test_scaled)))

Accuracy without scaling: 0.956140350877193
Accuracy with scaling: 0.9736842105263158


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#Question 10: Real-world Business Case (Imbalanced Data)
Answer:
If I was working in an ecommerce company with 5% response rate, I would handle it carefully.

Steps I will follow:

Data Handling: Clean missing values, encode categorical variables.

Feature Scaling: Since logistic regression is sensitive to scale, I will use StandardScaler.

Class Balancing: Because only 5% are positive, I would use oversampling (SMOTE) or undersampling or give class_weight='balanced' in logistic regression.

Hyperparameter Tuning: Use GridSearchCV to find best penalty and C.

Evaluation Metrics: Accuracy is not enough, I will check Precision, Recall, F1, ROC-AUC. Recall is more important here because we don’t want to miss customers who may respond.

Finally I will deploy the model and monitor it regularly since customer behaviour changes with time.