### 1. How do the assumptions of linear regression influence model accuracy, and how would you check for their validity in a dataset?

The assumptions of linear regression are foundational for the model's performance, accuracy, and validity of results. 
If there are violations of these assumptions can reduce predictive power, mislead interpretations, or result in inefficient estimations. 

1. Linearity of the Relationship
Assumption: The relationship between the independent variables and the dependent variable is linear.
Influence on Accuracy: Linear regression assumes that any change in the predictor variables has a consistent, proportional change in the response variable. If the actual relationship is non-linear, the model may underfit, failing to capture the true relationship, leading to poor predictions and biased estimates.
• Checking Validity:
Scatter Plots: Plot each predictor against the response variable. Patterns that are curved or show complex interactions may indicate non-linearity.

2. Independence of Errors
Assumption: The residuals (errors) should be independent of each other.
Influence on Accuracy: When observations are correlated, as in time series data, the model can underestimate error, leading to over-optimistic results. This is particularly problematic in sequential or clustered data where errors may follow a pattern.
• Checking Validity:
Durbin-Watson Test: Primarily for time series data, this test detects autocorrelation in residuals. A value close to 2 indicates no autocorrelation.
Residuals Plot Over Time: For time-dependent data, plot residuals over time to observe any patterns that suggest dependency.

4. Normality of Errors
Assumption: Residuals should be normally distributed, especially important in small samples for valid hypothesis tests and confidence intervals.
Influence on Accuracy: Non-normally distributed errors affect the accuracy of p-values and confidence intervals, potentially leading to incorrect inferences about predictors.
Checking Validity:
Q-Q Plot: A quantile-quantile plot of residuals shows if they align closely with a normal distribution.

Practical Implications of Assumption Violations:
Biased Estimates: Linear relationships and independence violations can bias estimates, particularly when errors or relationships aren’t as assumed.
Reduced Predictive Accuracy: Heteroscedasticity and multicollinearity often affect prediction accuracy, making models less generalizable.
Unreliable Statistical Inference: Non-normality of errors or heteroscedasticity can lead to unreliable p-values, confidence intervals, and hypothesis tests.
Ensuring these assumptions hold is essential to producing reliable, interpretable linear regression models.

### 2. If you had a dataset prone to overfitting, which regularization technique would you apply in linear regression: Lasso orRidge?

Overfitting in linear regression:
Both Lasso and Ridge regularization are effective techniques to reduce complexity and improve model generalizability even though each works differently and is suited to different scenarios.

1. Ridge Regression (L2 Regularization)
How it Works: Ridge regression penalizes the sum of squared coefficients by adding an term α×∑βj2 to the cost function. This term discourages large coefficients, effectively "shrinking" them towards zero but generally not making them exactly zero.
Best For: Ridge is ideal when you have many predictors, especially if they are not sparse (meaning most features contribute somewhat to the response). It works well in cases where the predictors have small but meaningful contributions and where multicollinearity (high correlation between predictors) might be an issue.
Impact on Overfitting: By shrinking coefficients, Ridge reduces the model’s sensitivity to noise, which helps with overfitting, but it retains all predictors to some degree.
Example Use Case: Ridge regression is often preferred for datasets where all predictors are believed to be relevant but the model needs smoothing.

2. Lasso Regression (L1 Regularization)
How it Works: Lasso adds an α×∑∣βj ∣penalty to the cost function, encouraging some coefficients to become exactly zero, thus performing feature selection.
Best For: Lasso is particularly effective if you believe that only a subset of predictors are truly relevant, as it drives irrelevant feature coefficients to zero. This makes it well-suited for high-dimensional datasets where some features are likely irrelevant.
Impact on Overfitting: By simplifying the model and reducing the number of features, Lasso mitigates overfitting by limiting complexity and focusing only on the most significant predictors.
Example Use Case: Lasso is often the better choice for sparse datasets or when you aim to reduce the number of predictors for interpretability.

• Choosing Between Lasso and Ridge
If all predictors are likely relevant but the model overfits, Ridge may be more suitable since it smooths coefficient sizes without removing features.
If only a few predictors are likely relevant, Lasso is preferable as it can eliminate irrelevant features, resulting in a simpler and more interpretable model.

### 3. Describe a scenario where logistic regression would be more apt than K-Nearest Neighbors for a classification task. How does the sigmoid function influence predictions in this context?

Logistic regression would be more useful than K-Nearest Neighbors (KNN) for a classification task in scenarios where interpretability, computational efficiency, and linearity of the decision boundary are priorities.

• Large Datasets: Logistic regression is computationally efficient and scales well with large datasets. KNN, on the other hand, becomes slower as the dataset grows, as it has to calculate the distance between the test point and every other point in the training set.

• High Dimensionality: Logistic regression is often preferable when there are many features, especially if the data is sparse (e.g., text classification tasks). KNN can struggle with high-dimensional spaces due to the "curse of dimensionality," where the distance metric becomes less meaningful as dimensions increase.

• Linearly Separable Data: Logistic regression assumes a linear decision boundary, which can be a strength when the data is linearly separable or close to it. In cases like determining if an email is spam or not, logistic regression can be highly effective if the features indicate a strong linear relationship.

• Need for Interpretability: Logistic regression provides interpretable results by estimating the probability of class membership. This probability output can be useful for decision-making, and the model’s coefficients show the relationship between features and the outcome, unlike KNN, which is less interpretable.

• Regularization: Logistic regression can be regularized using L1 (Lasso) or L2 (Ridge) penalties to handle multicollinearity or reduce overfitting, which is beneficial for high-dimensional data. KNN lacks this capability, and overfitting is controlled only by adjusting 
k, the number of neighbors.

• Role of the Sigmoid Function in Logistic Regression
In logistic regression, the sigmoid function transforms the linear combination of input features into a probability value between 0 and 1. Here’s how it works:

Log-Odds Transformation: Logistic regression models the log-odds of the probability of an event (e.g., defaulting on a loan) as a linear function of the features:
log-odds(p)=β0 +β1 x1 +β2 x2 +⋯+βn xn
​
Sigmoid Transformation: To convert the log-odds into a probability, the sigmoid function is applied:

p(y=1∣X)=1/1+e−(β0 +β1 x1 +⋯+βn xn )

This maps the output to the range (0,1), allowing it to be interpreted as the probability of the positive class.


Threshold-Based Decision: By setting a threshold (e.g., 0.5), predictions are classified into binary outcomes. If the output probability is greater than 0.5, the prediction is classified as the positive class; otherwise, it’s classified as the negative class.

Useful for below cases:
Probability Prediction
Decision Boundary
Interpretation of Coefficients


### 4. Compare the evaluation metrics for regression and classification. Why might you choose Mean Absolute Error over R² for a regression problem?

In regression and classification tasks, evaluation metrics assess model performance based on the nature of the predictions.

1. Evaluation Metrics for Regression
   
• Mean Absolute Error (MAE): MAE calculates the average of absolute differences between actual and predicted values. It provides a direct measure of error in the same units as the data, making it straightforward and easy to interpret.

• Mean Squared Error (MSE): MSE calculates the average of the squared differences between predicted and actual values. Squaring penalizes larger errors more than smaller ones, which is useful when large errors are particularly undesirable.

• Root Mean Squared Error (RMSE): RMSE is the square root of MSE, maintaining the same units as the data and amplifying large errors. It is often used to interpret model accuracy with fewer biases.
R-squared (R²): represents the proportion of variance explained by the model, indicating goodness of fit. It ranges from 0 to 1, where values closer to 1 imply a better fit. 

 R² is a relative measure, and its value depends on the variance in the data.
When to Prefer MAE over R²:
Interpretability: MAE directly reflects the average error, making it more interpretable in terms of how much error to expect on average.

R²can be less reliable in models with fewer predictors, where it might overestimate fit. MAE suitable for a simple, absolute measure of prediction error regardless of dataset complexity.

2. Evaluation Metrics for Classification
   
• Accuracy: Accuracy is the percentage of correctly classified instances out of the total. It is straightforward but may not be informative if classes are imbalanced.
• Precision, Recall, and F1 Score:
Precision measures the accuracy of positive predictions (True Positives / (True Positives + False Positives)).
Recall (or Sensitivity) measures the proportion of actual positives that were correctly predicted (True Positives / (True Positives + False Negatives)).
• F1 Score balances precision and recall, useful in cases of class imbalance.


Choosing Between Metrics
For regression, MAE is often chosen over R²:
Simplicity and Interpretability are Key: MAE’s straightforward error measure is beneficial when stakeholders or users need a clear, direct interpretation of error.
Outliers Are Present: MAE is less impacted by outliers, making it preferable if the dataset contains anomalies that could skew results.


### 5. When would you prefer to use a Support Vector Machine over a Decision Tree for classification, considering the nature of the data and computational efficiency?


1. High-Dimensional Data
   
• SVM Advantage: SVMs are generally well-suited for high-dimensional datasets, such as text data or image recognition, where there are many features relative to the number of samples. SVM can effectively find the hyperplane that maximizes the margin between classes, even in high-dimensional spaces.

• Decision Tree Limitation: Decision Trees can struggle in high-dimensional data without a large sample size because they may overfit to noise or require extensive depth to capture the decision boundaries.

3. Linearly Separable Data or Clear Decision Boundaries
   
• SVM Advantage: If the data is linearly separable or nearly so, SVM can find the optimal separating hyperplane, leading to better generalization. Even for non-linear separability, using kernel functions (such as the radial basis function, or RBF) enables SVMs to model complex boundaries.

• Decision Tree Limitation: Decision Trees often create axis-aligned splits, which may not capture complex or non-linear decision boundaries as effectively as SVM with a kernel.

5. Robustness Against Overfitting
   
• SVM Advantage: SVM is generally less prone to overfitting, especially with a proper regularization parameter and kernel choice. SVM aims to maximize the margin around the decision boundary, which leads to better generalization, especially on smaller datasets.

• Decision Tree Limitation: Decision Trees are more prone to overfitting, especially without pruning, as they tend to capture the exact details of the training data, including noise.

7. Small to Medium Dataset Size
   
• SVM Advantage: SVMs can be computationally expensive for very large datasets, particularly in non-linear cases with kernels. For small to medium-sized datasets, however, SVMs are feasible and efficient, and they tend to produce robust models.

• Decision Tree Advantage on Large Data: Decision Trees are generally faster to train and more scalable to large datasets. They can be less computationally intensive and are well-suited for massive datasets when combined with ensemble techniques (e.g., Random Forests), which further improve their robustness.

### 6. Explain the key differences between bagging and boosting. In which scenario would Adaboost be more beneficial than Random Forest?

Bagging and Boosting are two ensemble techniques that combine multiple models to improve overall performance.

1. Bagging (Bootstrap Aggregating)
 Bagging builds multiple independent models by training each on a random subset of the dataset (with replacement). It averages the predictions from these models, reducing variance and enhancing stability.
Key Model - Random Forest: A popular bagging method is the Random Forest, which uses multiple decision trees trained on random samples with random feature subsets, resulting in a robust, high-performing ensemble. Since each tree is independent, Random Forest is less sensitive to noise and less prone to overfitting.
When Bagging is Effective:

Bagging is beneficial when the base model is prone to high variance, such as a single decision tree. By averaging many models, it reduces overfitting, making it ideal for high-variance, complex datasets.
Random Forest works well in cases where interpretability is less important than predictive power, and the data has sufficient size and features.

2. Boosting
 Boosting, unlike bagging, builds models sequentially. Each model learns from the errors of the previous one, giving more weight to misclassified points. This iterative process creates a strong learner by correcting errors step-by-step, making boosting sensitive to data characteristics.
Key Model - AdaBoost: AdaBoost (Adaptive Boosting) is a popular boosting technique. It starts with a weak model (often a decision stump) and iteratively adjusts the weights of misclassified points, refining the model with each step. AdaBoost’s focus on hard-to-classify points can lead to high accuracy on complex datasets.
When Boosting is Effective:

Boosting is beneficial in cases where the data has subtle patterns that require a focus on more difficult-to-predict examples. It tends to outperform bagging when the primary goal is accuracy, and the data isn’t excessively noisy.

AdaBoost is particularly useful for binary classification tasks or datasets with clear but complex decision boundaries, where each iteration can correct previous mistakes and improve precision.

Smaller, Less Noisy Datasets: AdaBoost can achieve high accuracy on smaller datasets by refining mistakes across rounds. However, it’s sensitive to noise, so it’s less effective on noisy data or cases with many outliers, where Random Forest’s robustness would be more beneficial.



### 7. Using the `California housing` dataset from sklearn, demonstrate how you'd implement multiple linear regression. Check for assumptions of linear regression and apply necessary regularization techniques if required. How would you interpret the performance using regression evaluation metrics?

In [38]:
#Load and Prepare the Data

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import pandas as pd
import numpy as np

# Load the dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

#Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)





In [40]:
#Implement Multiple Linear Regression
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)


In [42]:
#Multicollinearity: Features should ideally not be highly correlated. Use variance inflation factor (VIF) to check for multicollinearity.
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
print(vif_data)

      Feature         VIF
0      MedInc   11.511140
1    HouseAge    7.195917
2    AveRooms   45.993601
3   AveBedrms   43.590314
4  Population    2.935745
5    AveOccup    1.095243
6    Latitude  559.874071
7   Longitude  633.711654


In [44]:
#Apply Regularization (if needed)
#If high multicollinearity is found, or if the model overfits, you can use Ridge or Lasso regularization to improve model performance.

# Using Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)

# Using Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)



In [48]:
#Evaluate Model Performance

# Predictions
y_pred = model.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

#MAE: Gives the average magnitude of errors in predictions, providing a straightforward error measure.
#MSE: Penalizes larger errors more heavily, useful when larger errors are more problematic.
#R-squared (R²): Indicates the proportion of variance explained by the model,close to 1 suggests a good fit, but it’s sensitive to overfitting in complex models.
#Ridge or Lasso regularization can be applied if multicollinearity or overfitting is detected.



Mean Absolute Error: 0.5332001304956554
Mean Squared Error: 0.5558915986952444
R-squared: 0.5757877060324508


### 8. With the `digits` dataset available in sklearn, how would you approach the classification task using both K-Nearest Neighbors and Support Vector Machine algorithms? Compare their performance using classification evaluation metrics and discuss the importance of the sigmoid function in the context of logistic regression.

In [62]:
# Load the Digits Dataset and Prepare the Data
# Description : The digits dataset contains 1,797 images of handwritten digits (0–9), 
# each represented as an 8x8 grid of pixel intensities, along with labels indicating the digit.

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset
digits = load_digits()
X, y = digits.data, digits.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data for SVM to perform optimally
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)



### KNN 

K Nearest Neighbors - Classification K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.

Algorithm A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is simply assigned to the class of its nearest neighbor.

In [65]:
#  Implement K-Nearest Neighbors (KNN)
# Initialize and train the KNN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict and evaluate
y_pred_knn = knn.predict(X_test)
print("KNN Classification Report:\n", classification_report(y_test, y_pred_knn))
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))



KNN Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        33
           1       1.00      1.00      1.00        28
           2       0.97      1.00      0.99        33
           3       0.97      0.97      0.97        34
           4       0.98      1.00      0.99        46
           5       0.96      0.96      0.96        47
           6       0.97      1.00      0.99        35
           7       1.00      0.94      0.97        34
           8       0.97      1.00      0.98        30
           9       0.95      0.90      0.92        40

    accuracy                           0.97       360
   macro avg       0.98      0.98      0.98       360
weighted avg       0.98      0.97      0.97       360

KNN Accuracy: 0.975


### Support Vector Machine (SVM)
SVM aims to find the optimal hyperplane that separates classes with the maximum margin. We’ll use the radial basis function (RBF) kernel for better performance on the non-linearly separable digits data.


In [68]:

# Initialize and train the SVM model
svm = SVC(kernel='rbf', gamma='scale', C=1.0)
svm.fit(X_train, y_train)

# Predict and evaluate
y_pred_svm = svm.predict(X_test)
print("SVM Classification Report:\n", classification_report(y_test, y_pred_svm))
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))


SVM Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        33
           1       1.00      1.00      1.00        28
           2       1.00      1.00      1.00        33
           3       1.00      0.97      0.99        34
           4       0.96      1.00      0.98        46
           5       0.96      0.98      0.97        47
           6       0.97      1.00      0.99        35
           7       1.00      0.94      0.97        34
           8       0.97      0.97      0.97        30
           9       0.97      0.95      0.96        40

    accuracy                           0.98       360
   macro avg       0.98      0.98      0.98       360
weighted avg       0.98      0.98      0.98       360

SVM Accuracy: 0.9805555555555555


### Compare Performance with Classification Metrics

<u>Accuracy</u>: The percentage of correctly classified samples.

<u>Precision</u>: The accuracy of positive predictions.

<u>Recall</u>: The ratio of correctly predicted positives to all actual positives.

<u>F1-score</u>: The harmonic mean of precision and recall, useful for imbalanced datasets.

KNN is simpler and can perform well when relationships are easily detectable with neighborhood voting, though it may struggle 
with high-dimensional data or large datasets due to its high memory and computation cost.
SVM, especially with a kernel like RBF, can handle complex, high-dimensional data better, often leading to improved accuracy 
and precision on structured image data.

For the digits dataset, SVM is likely to perform better than KNN due to its ability to form complex decision boundaries in 
high-dimensional spaces.

The Role of the Sigmoid Function in Logistic Regression
The sigmoid function, defined as is essential in logistic regression as it maps any real-valued input to a value between 0 and 1,
which represents the probability of belonging to a specific class. In multi-class classification
(like digits), logistic regression can extend to one-vs-rest or softmax approaches to handle multiple classes, and the sigmoid’s probability interpretation supports probabilistic decision-making for each class.

KNN may be easier to implement but computationally expensive on large datasets, while SVM typically excels on high-dimensional data like images.
The sigmoid function is crucial in logistic regression for converting scores to probabilities, making it fundamental to probabilistic classification decisions in binary or multi-class settings.
