
### **Question 1: What is Logistic Regression, and how does it differ from Linear Regression?**

**Answer:**
Logistic Regression is a statistical and machine learning technique used for **classification problems**, where the dependent variable is categorical (e.g., 0/1, Yes/No, Spam/Not Spam). Instead of predicting continuous outcomes like Linear Regression, Logistic Regression predicts the **probability** of a class label.

The main differences are:

1. **Output Nature**

   * Linear Regression predicts continuous values (−∞ to +∞).
   * Logistic Regression predicts probabilities (0 to 1).

2. **Function Used**

   * Linear Regression uses a straight-line equation.
   * Logistic Regression applies the **sigmoid function** to map values between 0 and 1.

3. **Application**

   * Linear Regression → Continuous outcomes (e.g., predicting house prices).
   * Logistic Regression → Categorical outcomes (e.g., predicting if an email is spam).





---

### **Question 2: Explain the role of the Sigmoid function in Logistic Regression.**

**Answer:**
The **sigmoid function** plays a crucial role in Logistic Regression by converting the output of a linear equation into a probability value between **0 and 1**.

Mathematically, the sigmoid is defined as:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

where $z = β_0 + β_1x_1 + β_2x_2 + \dots + β_nx_n$.

**Role in Logistic Regression:**

1. **Probability Mapping** – It transforms any real number (−∞ to +∞) into a probability score between 0 and 1.
2. **Classification Decision** – A threshold (commonly 0.5) is applied:

   * If probability ≥ 0.5 → Class 1
   * If probability < 0.5 → Class 0
3. **Non-linear Transformation** – Although the decision boundary is linear in feature space, the sigmoid ensures probabilities follow an **S-shaped curve**, making the model suitable for classification.

**Example:**
If the sigmoid output is **0.82**, the model predicts there is an **82% chance** the sample belongs to the positive class.

---




---

### **Question 3: What is Regularization in Logistic Regression and why is it needed?**

**Answer:**
**Regularization** is a technique used in Logistic Regression to **prevent overfitting** by adding a penalty term to the loss function. It controls the magnitude of model coefficients so that the model doesn’t become too complex and fit noise from the data.

**Types of Regularization:**

1. **L1 Regularization (Lasso):**

   * Adds the absolute values of coefficients as a penalty.
   * Some coefficients shrink to zero → helps in **feature selection**.

2. **L2 Regularization (Ridge):**

   * Adds the square of coefficients as a penalty.
   * Shrinks coefficients smoothly → reduces **variance** and improves stability.

**Why is it needed?**

* Logistic Regression can overfit when:

  * There are too many features.
  * Features are highly correlated (multicollinearity).
* Regularization ensures the model:

  * Generalizes better to unseen data.
  * Avoids extremely large coefficient values.
  * Improves performance in real-world scenarios.

**In practice:**
Scikit-learn’s `LogisticRegression` uses **L2 regularization by default**.

---




### **Question 4: What are some common evaluation metrics for classification models, and why are they important?**

**Answer:**
In classification tasks, accuracy alone may not be sufficient to judge model performance, especially in **imbalanced datasets**. Therefore, multiple evaluation metrics are used:

1. **Accuracy**

   * Formula: $\frac{TP + TN}{TP + TN + FP + FN}$
   * Shows overall correctness of predictions.
   * Limitation: Misleading in imbalanced datasets.

2. **Precision**

   * Formula: $\frac{TP}{TP + FP}$
   * Measures how many predicted positives are actually positive.
   * Important when **false positives are costly** (e.g., spam filter).

3. **Recall (Sensitivity / True Positive Rate)**

   * Formula: $\frac{TP}{TP + FN}$
   * Measures how many actual positives were correctly identified.
   * Important when **false negatives are costly** (e.g., disease detection).

4. **F1-score**

   * Harmonic mean of Precision and Recall.
   * Useful when a balance between Precision and Recall is required.

5. **ROC-AUC (Receiver Operating Characteristic – Area Under Curve)**

   * Evaluates how well the model distinguishes between classes.
   * Higher AUC → Better classifier.

**Importance:**

* Provides a **holistic view** of performance.
* Helps select the **best model** based on business needs (e.g., prioritize Recall in healthcare, Precision in fraud detection).




In [7]:

#**Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy. (Use Dataset from sklearn package)**


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()
X, y = data.data, data.target

# Convert to DataFrame (optional, for better view)
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Logistic Regression model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.956140350877193


In [10]:
"""Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients 
and accuracy. (Use Dataset from sklearn package)"""

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Logistic Regression with L2 regularization (default)
model = LogisticRegression(penalty='l2', C=1.0, max_iter=10000)
model.fit(X_train, y_train)

# Coefficients and accuracy
print("Model Coefficients:\n", model.coef_)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))


Model Coefficients:
 [[ 0.97796466  0.22675499 -0.36921764  0.02644054 -0.15485375 -0.22665079
  -0.5186091  -0.27936438 -0.22284174 -0.03509306 -0.09377994  1.39092772
  -0.17022173 -0.08877402 -0.02215899  0.05164999 -0.03656395 -0.03142397
  -0.03290299  0.01227996  0.09595287 -0.51563694 -0.01698607 -0.01657517
  -0.30594188 -0.74668265 -1.39907242 -0.50342187 -0.73505594 -0.09765041]]
Accuracy: 0.956140350877193


In [13]:
"""Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the 
classification report. (Use Dataset from sklearn package)"""

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report

# Load dataset (Iris has 3 classes)
iris = load_iris()
X, y = iris.data, iris.target

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Logistic Regression with One-vs-Rest strategy
model = LogisticRegression(multi_class='ovr', max_iter=10000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print("Classification Report:\n", classification_report(y_test, y_pred))


Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [12]:
"""Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters 
and validation accuracy. (Use Dataset from sklearn package)"""

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define parameter grid for tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # solver that supports both l1 and l2
}

# Apply GridSearchCV
grid = GridSearchCV(LogisticRegression(max_iter=10000), param_grid, cv=5)
grid.fit(X_train, y_train)

# Print results
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)


Best Parameters: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.964835164835165


In [14]:
"""Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and 
without scaling. (Use Dataset from sklearn package)"""

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Logistic Regression without scaling
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
acc_without_scaling = accuracy_score(y_test, model.predict(X_test))

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with scaling
model.fit(X_train_scaled, y_train)
acc_with_scaling = accuracy_score(y_test, model.predict(X_test_scaled))

print("Accuracy without scaling:", acc_without_scaling)
print("Accuracy with scaling:", acc_with_scaling)


Accuracy without scaling: 0.956140350877193
Accuracy with scaling: 0.9736842105263158




### **Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.**

**Answer:**

To build a Logistic Regression model for this scenario, the steps would be:

#### **1. Data Handling**

* **Clean data**: Handle missing values, remove duplicates, and encode categorical variables (e.g., one-hot encoding).
* **Feature engineering**: Create meaningful features such as purchase frequency, average spend, browsing behavior, etc.
* **Remove noise**: Drop irrelevant variables that do not contribute to prediction.

#### **2. Feature Scaling**

* Standardize features using **StandardScaler** or **MinMaxScaler** to ensure fair contribution from all features.

#### **3. Balancing Classes**

Since only 5% of customers respond (positive class), the dataset is **highly imbalanced**. Options include:

* **Oversampling minority class** (e.g., SMOTE – Synthetic Minority Oversampling Technique).
* **Undersampling majority class** to balance the dataset.
* **Class weights** in Logistic Regression (`class_weight='balanced'`) to penalize misclassification of minority class.

#### **4. Hyperparameter Tuning**

* Use **GridSearchCV** or **RandomizedSearchCV** to optimize parameters:

  * `C` (regularization strength).
  * `penalty` (L1 vs L2).
  * Solver type.
* Apply **cross-validation** for reliable results.

#### **5. Evaluation Metrics**

* Accuracy is misleading in imbalanced data. Instead, focus on:

  * **Precision** → To ensure targeted customers are actually likely responders.
  * **Recall** → To capture as many actual responders as possible.
  * **F1-score** → Balance between Precision and Recall.
  * **ROC-AUC** → Overall ability to distinguish responders vs non-responders.

#### **6. Business Deployment**

* Use the model to score new customers and rank them by probability of response.
* Deploy in marketing pipeline for targeted campaigns.
* Continuously monitor performance and retrain with new data to keep the model up-to-date.

**In summary:**

* Handle imbalance carefully.
* Scale features before training.
* Use hyperparameter tuning for optimization.
* Evaluate using Precision, Recall, F1-score, and ROC-AUC instead of just accuracy.

---
