1. What is Logistic Regression, and how does it differ from Linear Regression?
   - Logistic Regression is a statistical method used for binary classification tasks, where the goal is to predict the probability of an outcome belonging to one of two categories (e.g., yes/no, spam/not spam). It models the relationship between a dependent variable and one or more independent variables using a logistic function, which transforms the linear combination of inputs into a value between 0 and 1, representing the probability of the positive class and the difference is Linear Regression predicts continuous numerical values by fitting a straight line to the data (minimizing the sum of squared errors), Logistic Regression is designed for categorical outcomes and uses a sigmoid curve to output probabilities, optimizing maximum likelihood estimation.

2. Explain the role of the Sigmoid function in Logistic Regression.
   - The Sigmoid function in Logistic Regression helps turn the model’s output into a probability. When a Logistic Regression model makes a prediction, it first calculates a value by adding up the input features multiplied by their weights. This value can be any number, positive or negative. The Sigmoid function then takes this number and converts it into a value between 0 and 1, which represents the probability that something belongs to a certain class. For example, if the result is close to 1, it means the model is confident the input belongs to the “yes” or “positive” class, if it’s close to 0, it means “no” or “negative.” The Sigmoid function also helps the model learn during training because it is smooth and easy to work with mathematically.
3. What is Regularization in Logistic Regression and why is it needed?
   - Regularization in Logistic Regression is a technique used to prevent the model from overfitting the training data. Overfitting happens when the model learns not only the main patterns in the data but also the noise or random fluctuations, which makes it perform poorly on new, unseen data. Regularization helps control this by adding a penalty term to the model’s cost function, which discourages the model from assigning very large weights to any particular feature. Basically it keeps the model simpler and more general.There are two common types of regularization: L1 (Lasso) and L2 (Ridge). L1 regularization can make some feature weights exactly zero, which helps in feature selection, while L2 regularization spreads the penalty across all weights, keeping them small but nonzero. Regularization is important because it improves the model’s ability to generalize to new data, reduces overfitting, and leads to more stable and reliable predictions.

4. What are some common evaluation metrics for classification models, and
why are they important?
   - Common evaluation metrics for classification models include accuracy, precision, recall, F1-score, and the ROC-AUC score. These metrics are important because they help measure how well a model performs, especially when dealing with different types of data or imbalanced classes.Accuracy measures the overall percentage of correct predictions made by the model, but it can be misleading when one class is much larger than the other. Precision shows how many of the predicted positive cases are actually positive, which is useful when the cost of false positives is high. Recall measures how many of the actual positive cases the model correctly identified, which is important when missing positive cases is costly, such as in medical diagnoses. The F1-score combines precision and recall into a single number, providing a balance between the two when both are important. Lastly, the ROC-AUC score measures how well the model can distinguish between classes across all threshold levels, showing its overall ability to separate positive and negative examples.These metrics are essential because they give a more complete picture of a model’s strengths and weaknesses, helping data scientists choose the best model for a specific problem.


In [9]:
# 5. Write a Python program that loads a CSV file into a Pandas DataFrame,
# splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
import warnings

warnings.filterwarnings("ignore")

# Load a sample dataset from sklearn
cancer = load_breast_cancer()

# Convert  into a  DataFrame
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

# Split the data into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("\nLogistic Regression Model Accuracy: {}%".format(round(accuracy * 100),2))



Logistic Regression Model Accuracy: 96%


In [8]:
# 6. Write a Python program to train a Logistic Regression model using L2
# regularization (Ridge) and print the model coefficients and accuracy.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
import warnings

warnings.filterwarnings("ignore")

# Load a sample dataset from sklearn
cancer = load_breast_cancer()

# Convert  into a  DataFrame
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

# Split dataset into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train Logistic Regression model with L2 regularization
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print model coefficients and accuracy
print("\nModel Coefficients:")
print(model.coef_)

print("\nModel Intercept:")
print(model.intercept_)

print("\nLogistic Regression Model Accuracy with L2 Regularization (Ridge): {:.2f}%".format(accuracy * 100))



Model Coefficients:
[[ 2.09981182  0.13248576 -0.10346836 -0.00255646 -0.17024348 -0.37984365
  -0.69120719 -0.4081069  -0.23506963 -0.02356426 -0.0854046   1.12246945
  -0.32575716 -0.06519356 -0.02371113  0.05960156  0.00452206 -0.04277587
  -0.04148042  0.01425051  0.96630267 -0.37712622 -0.05858253 -0.02395975
  -0.31765956 -1.00443507 -1.57134711 -0.69351401 -0.84095566 -0.09308282]]

Model Intercept:
[2.13128402]

Logistic Regression Model Accuracy with L2 Regularization (Ridge): 95.61%


In [13]:
# 7. Write a Python program to train a Logistic Regression model for multiclass
# classification using multi_class='ovr' and print the classification report.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report

# Load a sample dataset from sklearn
cancer = load_breast_cancer()

# Convert  into a  DataFrame
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

# Split the dataset into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train Logistic Regression model with multi_class='ovr'
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=200)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)


report = classification_report(y_test, y_pred)
print("Classification Report for Logistic Regression:\n",report)



Classification Report for Logistic Regression:
               precision    recall  f1-score   support

           0       0.97      0.91      0.94        43
           1       0.95      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



In [16]:
# 8. Write a Python program to apply GridSearchCV to tune C and penalty
# hyperparameters for Logistic Regression and print the best parameters and validation
# accuracy.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score


# Load a sample dataset from sklearn
cancer = load_breast_cancer()

# Convert  into a  DataFrame
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

# Split the dataset into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Logistic Regression model
logreg = LogisticRegression(max_iter=200, solver='liblinear')


# Define the hyperparameter grid to search
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

#  Apply GridSearchCV
grid_search = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

#  Get the best parameters and validation accuracy
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Hyperparameters:")
print(best_params)
print("\nBest Cross-Validation Accuracy: {:.2f}%".format(best_score * 100))

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("\nTest Set Accuracy: {:.2f}%".format(test_accuracy * 100))




Best Hyperparameters:
{'C': 100, 'penalty': 'l1'}

Best Cross-Validation Accuracy: 96.70%

Test Set Accuracy: 98.25%


In [15]:
# 9. Write a Python program to standardize the features before training Logistic
# Regression and compare the model's accuracy with and without scaling.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the  dataset
cancer = load_breast_cancer()

# Convert  into a  DataFrame
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

# Split dataset into features and target
X = df.drop('target', axis=1)
y = df['target']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression without feature scaling
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred)

# Standardize features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression with scaled features
model_scaled = LogisticRegression(max_iter=200)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred_scaled)

# Print and compare accuracies
print("Accuracy without scaling: {:.2f}%".format(accuracy_no_scaling * 100))
print("Accuracy with scaling: {:.2f}%".format(accuracy_with_scaling * 100))


Accuracy without scaling: 95.61%
Accuracy with scaling: 97.37%


10. Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.
    - To build a Logistic Regression model for predicting customer responses in an imbalanced dataset where only 5% of customers respond, I would follow a structured approach. First, I would begin with data preprocessing, handling missing values, encoding categorical variables, and analyzing feature distributions. Since Logistic Regression is sensitive to feature scales, I would apply feature scaling, such as standardization, to ensure numerical features are on comparable scales. Given the extreme class imbalance, I would implement techniques to address this, such as using class weighting in Logistic Regression or resampling methods like SMOTE (Synthetic Minority Over-sampling Technique) to create a more balanced training set. Next, I would perform hyperparameter tuning using tools like GridSearchCV or RandomizedSearchCV to optimize parameters such as the regularization strength C and the type of regularization (l1 or l2) while considering cross-validation strategies to prevent overfitting. For model evaluation, accuracy alone would be misleading due to class imbalance, so I would rely on metrics like precision, recall, F1-score, and AUC-ROC, focusing particularly on recall or F1 for the minority class to capture as many responders as possible. Finally, I would validate the model using a hold-out test set or cross validation and monitor business relevant KPIs, such as the predicted response rate, to ensure the model’s predictions are actionable and aligned with marketing objectives.
