# Question 1: What is Logistic Regression, and how does it differ from Linear Regression?

# Logistic Regression is a statistical method used for binary classification problems
# (i.e., when the target variable has two possible outcomes such as Yes/No, 0/1, True/False).
# It predicts the probability that a given input belongs to a particular class using the
# logistic (sigmoid) function, which maps values between 0 and 1.

# On the other hand, Linear Regression is used for predicting continuous numerical values.
# It models the relationship between independent variables (features) and a continuous
# dependent variable by fitting a straight line (y = mx + c).

# Key Differences:
# 1. Output:
#    - Linear Regression → Continuous values
#    - Logistic Regression → Probabilities (0 to 1), then classified into categories
#
# 2. Function Used:
#    - Linear Regression → Linear function
#    - Logistic Regression → Sigmoid (logistic) function
#
# 3. Application:
#    - Linear Regression → Predicting house prices, sales, etc.
#    - Logistic Regression → Predicting spam vs non-spam, pass vs fail, etc.


# Question 2: Explain the role of the Sigmoid function in Logistic Regression

# Answer:
# The Sigmoid function plays a crucial role in Logistic Regression because it transforms
# the linear output of the model into a probability between 0 and 1.
#
# Mathematically, the Sigmoid function is defined as:
#     σ(z) = 1 / (1 + e^(-z))
#
# Where:
# - z is the linear combination of inputs (z = w1*x1 + w2*x2 + ... + b)
# - The output of σ(z) will always lie between 0 and 1.
#
# Role in Logistic Regression:
# 1. Converts linear regression output into probabilities → useful for classification tasks.
# 2. Helps decide class labels:
#    - If σ(z) ≥ 0.5 → Class 1
#    - If σ(z) < 0.5 → Class 0
# 3. Provides smooth gradient → makes optimization with Gradient Descent possible.
#
# Example:
# Input (z) = 0 → σ(0) = 0.5
# Input (z) = 2 → σ(2) ≈ 0.88
# Input (z) = -2 → σ(-2) ≈ 0.12


# Question 3: What is Regularization in Logistic Regression and why is it needed?

# Regularization in Logistic Regression is a method to reduce overfitting by
# adding a penalty to the model’s cost function. It controls the size of the
# regression coefficients so that the model does not become too complex.
#
# Types of Regularization:
# 1. L1 Regularization (Lasso): Uses the sum of absolute values of coefficients.
#    It can shrink some coefficients to zero, thus performing feature selection.
#
# 2. L2 Regularization (Ridge): Uses the sum of squared values of coefficients.
#    It reduces coefficient size but does not make them exactly zero.
#
# Why Regularization is Needed:
# - Prevents overfitting and improves generalization.
# - Ensures stable and reliable predictions on new/unseen data.
# - Helps handle multicollinearity among features.
# - Makes the model simpler and more robust.
#
# Regularization is essential in Logistic Regression to balance
# accuracy and simplicity of the model, ensuring better performance in practice.



# Question 4: What are some common evaluation metrics for classification models, and why are they important?

# In classification models (like Logistic Regression), evaluation metrics
# are used to measure how well the model predicts categories. They are
# important because accuracy alone may not give a complete picture,
# especially with imbalanced datasets.

# Common Evaluation Metrics:
# 1. Accuracy:
#    - Measures the percentage of correct predictions.
#    - Important for balanced datasets but may mislead in imbalanced cases.

# 2. Precision:
#    - Out of all predicted positives, how many are actually positive.
#    - Important when false positives are costly (e.g., spam detection).

# 3. Recall (Sensitivity or True Positive Rate):
#    - Out of all actual positives, how many did the model correctly predict.
#    - Important when false negatives are costly (e.g., disease detection).

# 4. F1-Score:
#    - Harmonic mean of Precision and Recall.
#    - Useful when there is an imbalance between classes.

# 5. ROC-AUC (Receiver Operating Characteristic - Area Under Curve):
#    - Measures the ability of the model to distinguish between classes.
#    - Higher AUC means better model performance.

# Why They Are Important:
# - Help compare models and choose the best one for the problem.
# - Ensure that the model is not just accurate but also reliable in
#   handling both positive and negative classes effectively.
# - Guide decision-making in real-world applications.


In [2]:
# Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
# splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
# (Use Dataset from sklearn package)


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Step 1: Load dataset from sklearn
iris = load_iris()

# Step 2: Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

print("First 5 rows of dataset:\n", df.head())

# Step 3: Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Step 4: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Train Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Step 6: Make predictions
y_pred = model.predict(X_test)

# Step 7: Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Logistic Regression model:", accuracy)
# Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
# splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
# (Use Dataset from sklearn package)


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Step 1: Load dataset from sklearn
iris = load_iris()

# Step 2: Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

print("First 5 rows of dataset:\n", df.head())

# Step 3: Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Step 4: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Train Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Step 6: Make predictions
y_pred = model.predict(X_test)

# Step 7: Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Logistic Regression model:", accuracy)


First 5 rows of dataset:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  
Accuracy of Logistic Regression model: 1.0
First 5 rows of dataset:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1    

In [3]:
# Question 6: Write a Python program to train a Logistic Regression model using L2
# regularization (Ridge) and print the model coefficients and accuracy.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Step 1: Load dataset
iris = load_iris()

# Step 2: Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Step 3: Split features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Step 4: Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Train Logistic Regression model with L2 Regularization (Ridge is default)
model = LogisticRegression(penalty='l2', max_iter=200)
model.fit(X_train, y_train)

# Step 6: Print Coefficients
print("Model Coefficients:\n", model.coef_)
print("Intercept:\n", model.intercept_)

# Step 7: Evaluate Accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with L2 Regularization:", accuracy)


Model Coefficients:
 [[-0.40538546  0.86892246 -2.2778749  -0.95680114]
 [ 0.46642685 -0.37487888 -0.18745257 -0.72127133]
 [-0.06104139 -0.49404358  2.46532746  1.67807247]]
Intercept:
 [  8.86383271   2.20981479 -11.0736475 ]
Accuracy with L2 Regularization: 1.0


In [4]:
# Question 7: Write a Python program to train a Logistic Regression model for multiclass
# classification using multi_class='ovr' and print the classification report.
# (Use Dataset from sklearn package)

# Answer (10 Marks):

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris

# Step 1: Load dataset
iris = load_iris()

# Step 2: Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Step 3: Split features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Step 4: Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Train Logistic Regression model with One-vs-Rest (OvR)
model = LogisticRegression(multi_class='ovr', max_iter=200)
model.fit(X_train, y_train)

# Step 6: Predictions
y_pred = model.predict(X_test)

# Step 7: Print Classification Report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      0.85      0.92        13
   virginica       0.87      1.00      0.93        13

    accuracy                           0.96        45
   macro avg       0.96      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45





In [5]:
# Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
# hyperparameters for Logistic Regression and print the best parameters and validation accuracy.
# (Use Dataset from sklearn package)

# Answer (10 Marks):

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Step 1: Load dataset
iris = load_iris()

# Step 2: Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Step 3: Split features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Step 4: Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Define Logistic Regression model
model = LogisticRegression(max_iter=500, solver='liblinear')

# Step 6: Define Hyperparameter Grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],       # Regularization strength
    'penalty': ['l1', 'l2']             # L1 = Lasso, L2 = Ridge
}

# Step 7: Apply GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Step 8: Print Best Parameters and Accuracy
print("Best Parameters:", grid.best_params_)
print("Best Validation Accuracy:", grid.best_score_)



Best Parameters: {'C': 10, 'penalty': 'l2'}
Best Validation Accuracy: 0.9523809523809523


In [6]:
# Question 9: Write a Python program to standardize the features before training Logistic
# Regression and compare the model's accuracy with and without scaling.
# (Use Dataset from sklearn package)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Step 1: Load dataset
iris = load_iris()

# Step 2: Convert to Pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Step 3: Split features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Step 4: Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ---- Model without scaling ----
model1 = LogisticRegression(max_iter=200)
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)
acc_without_scaling = accuracy_score(y_test, y_pred1)

# ---- Standardize features ----
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ---- Model with scaling ----
model2 = LogisticRegression(max_iter=200)
model2.fit(X_train_scaled, y_train)
y_pred2 = model2.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred2)

# ---- Print Results ----
print("Accuracy without Scaling:", acc_without_scaling)
print("Accuracy with Scaling   :", acc_with_scaling)


Accuracy without Scaling: 1.0
Accuracy with Scaling   : 1.0


# Question 10: Imagine you are working at an e-commerce company that wants to
# predict which customers will respond to a marketing campaign. Given an imbalanced
# dataset (only 5% of customers respond), describe the approach you’d take to build a
# Logistic Regression model — including data handling, feature scaling, balancing
# classes, hyperparameter tuning, and evaluating the model for this real-world business
# use case.



# 1. Data Handling:
#    - Load the dataset and handle missing values (impute mean/median or drop if necessary).
#    - Perform feature engineering (e.g., customer purchase history, demographics, browsing behavior).
#    - Encode categorical variables using OneHotEncoder or LabelEncoder.

# 2. Feature Scaling:
#    - Apply StandardScaler or MinMaxScaler so that all features have comparable scales.
#    - Important for Logistic Regression because it uses gradient-based optimization.

# 3. Handling Class Imbalance (only 5% positive responses):
#    - Use class_weight='balanced' in Logistic Regression to penalize misclassification of minority class.
#    - OR use oversampling methods like SMOTE or undersampling majority class.
#    - Ensure stratified train-test split to maintain class proportions.

# 4. Hyperparameter Tuning:
#    - Use GridSearchCV/RandomizedSearchCV to tune:
#        * C (regularization strength)
#        * penalty ('l1', 'l2')
#        * solver ('liblinear', 'saga')
#    - Perform cross-validation for reliable performance estimation.

# 5. Model Evaluation:
#    - Accuracy is not suitable for imbalanced data (95% accuracy can mean model predicts all negatives).
#    - Use metrics like:
#        * Precision (how many predicted positives are correct).
#        * Recall (how many actual positives are identified).
#        * F1-Score (balance between precision and recall).
#        * ROC-AUC (how well model distinguishes between classes).
#        * Precision-Recall AUC (more informative for highly imbalanced data).

# 6. Final Business Application:
#    - Select the best model with high recall (to capture most responding customers)
#      while maintaining decent precision (to avoid targeting too many uninterested users).
#    - Deploy model to predict campaign response probability.
#    - Use probability thresholds (e.g., 0.3 instead of 0.5) to maximize recall depending on business needs.

# Logistic Regression with proper scaling, class balancing, hyperparameter tuning, and
# evaluation using precision-recall metrics can help the e-commerce company
# effectively target potential customers and maximize campaign success.
