# **Logistic Regression | Assignment**

Question 1: What is Logistic Regression, and how does it differ from Linear Regression?

Answer:

Logistic Regression is a statistical and machine learning technique used for binary classification problems, where the output variable is categorical (e.g., 0 or 1, yes or no, true or false). It predicts the probability of an instance belonging to a particular class using the logistic (sigmoid) function.

Differences from Linear Regression


| Feature               | Logistic Regression                  | Linear Regression                      |
| --------------------- | ------------------------------------ | -------------------------------------- |
| Output                | Probability between 0 and 1          | Continuous value                       |
| Function used         | Sigmoid (logistic) function          | Linear function                        |
| Use case              | Classification                       | Regression                             |
| Output Interpretation | Class membership (e.g., spam or not) | Numeric prediction (e.g., house price) |
| Loss Function         | Log loss (cross-entropy)             | Mean Squared Error (MSE)               |


Question 2: Explain the role of the Sigmoid function in Logistic Regression.

Answer:

The sigmoid function is central to logistic regression. It maps any real-valued number to a value between 0 and 1, which is interpreted as a probability.

Formula:

σ(z)= 1/
1+e
−z

Where:

z=w⋅x+b (linear combination of weights and features)

Role:

Converts linear output to probability.

Helps in classification by setting a threshold (commonly 0.5) to classify into classes 0 or 1.




Question 3: What is Regularization in Logistic Regression and why is it needed?

Answer:

Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. It discourages the model from fitting the noise in the training data by shrinking large coefficients.

Types in Logistic Regression:

L1 Regularization (Lasso): Adds absolute values of coefficients.

L2 Regularization (Ridge): Adds squared values of coefficients.


Why Needed:

Improves model generalization.

Helps when data has many features.

Reduces model complexity

Question 4: What are some common evaluation metrics for classification models, and why are they important?

Answer:

Key metrics include:

| Metric                   | Definition                            | Importance                    |
| ------------------------ | ------------------------------------- | ----------------------------- |
| **Accuracy**             | (TP + TN) / Total                     | Good when data is balanced    |
| **Precision**            | TP / (TP + FP)                        | Focuses on false positives    |
| **Recall (Sensitivity)** | TP / (TP + FN)                        | Focuses on false negatives    |
| **F1-Score**             | Harmonic mean of precision and recall | Balanced metric               |
| **ROC-AUC**              | Area under ROC curve                  | Performance across thresholds |


Question 5: Python program to load CSV, split data, train logistic model, and print accuracy

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train logistic regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)


Accuracy: 0.956140350877193


Question 6: Train Logistic Regression with L2 (Ridge) Regularization

In [2]:
from sklearn.linear_model import LogisticRegression

model_l2 = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=10000)
model_l2.fit(X_train, y_train)

print("Coefficients:", model_l2.coef_)
print("Accuracy:", model_l2.score(X_test, y_test))


Coefficients: [[ 1.0274368   0.22145051 -0.36213488  0.0254667  -0.15623532 -0.23771256
  -0.53255786 -0.28369224 -0.22668189 -0.03649446 -0.09710208  1.3705667
  -0.18140942 -0.08719575 -0.02245523  0.04736092 -0.04294784 -0.03240188
  -0.03473732  0.01160522  0.11165329 -0.50887722 -0.01555395 -0.016857
  -0.30773117 -0.77270908 -1.42859535 -0.51092923 -0.74689363 -0.10094404]]
Accuracy: 0.956140350877193


Question 7: Multiclass Classification with multi_class='ovr'

In [3]:
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report

# Load dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

# Train logistic regression
model_multi = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
model_multi.fit(X_train, y_train)

# Predict and report
y_pred = model_multi.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





Question 8: Apply GridSearchCV to tune C and penalty

In [4]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # liblinear supports both l1 and l2
}

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Validation Accuracy:", grid.best_score_)


Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Validation Accuracy: 0.9583333333333334


Question 9: Standardize features and compare accuracy

In [5]:
from sklearn.preprocessing import StandardScaler

# Without Scaling
model_raw = LogisticRegression(max_iter=10000)
model_raw.fit(X_train, y_train)
acc_raw = model_raw.score(X_test, y_test)

# With Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=10000)
model_scaled.fit(X_train_scaled, y_train)
acc_scaled = model_scaled.score(X_test_scaled, y_test)

print("Accuracy without Scaling:", acc_raw)
print("Accuracy with Scaling:", acc_scaled)


Accuracy without Scaling: 1.0
Accuracy with Scaling: 1.0


Question 10: Predicting marketing campaign response (imbalanced data)

Answer:

To handle an imbalanced dataset (e.g., 5% response rate), here’s the step-by-step approach:


1- Data Handling:

Analyze missing values, outliers, and perform imputation/cleaning.

Encode categorical features (using OneHot or Label Encoding).

2-Feature Scaling:

Use StandardScaler or MinMaxScaler to normalize input features.

3-Balancing Classes:

Use SMOTE (Synthetic Minority Over-sampling Technique) to upsample minority class.

Or use class_weight='balanced' in logistic regression.

4-Model Training:

Use logistic regression with regularization.

Tune C and penalty using GridSearchCV.

5-Evaluation:

Use metrics like F1-score, Precision-Recall Curve, and AUC-ROC.

Accuracy is not reliable here due to imbalance.

6- Business Integration:

Interpret model outputs as probabilities.

Use threshold tuning to optimize profit/cost ratio.

Consider feedback loops and update model periodically.
