 ## Q1. Explain the difference between linear regression and logistic regression models. Provide an example of a scenario where logistic regression would be more appropriate.

Linear Regression: 
- Linear regression is used when the dependent variable (the variable you want to predict) is continuous. It predicts a numeric value, making it suitable for regression problems. For example, predicting house prices, stock prices, or temperature.
- Linear regression uses a linear equation to model the relationship between the independent variables and the continuous dependent variable. The equation is of the form Y = a + bX, where Y is the predicted value, X is the independent variable, and a and b are coefficients.
- Linear regression is suitable for problems where you want to predict a numeric value, such as predicting sales revenue, estimating a person's age, or forecasting future stock prices.

Logistic Regression: 
- Logistic regression is used when the dependent variable is binary or categorical, typically representing two classes (0 or 1, Yes or No, True or False). It predicts the probability that an instance belongs to a particular class, making it suitable for classification problems. For example, predicting whether an email is spam or not, whether a customer will buy a product (yes/no), or whether a patient has a disease (yes/no).
- Logistic regression uses the logistic function (also known as the sigmoid function) to model the relationship between the independent variables and the binary dependent variable. The logistic function maps any real-valued number into a value between 0 and 1, representing the probability of belonging to one of the two classes.
- Logistic regression is more appropriate when you want to solve classification problems, where the outcome is binary or categorical. For instance, consider a scenario where you want to predict whether a student will pass or fail an exam based on features like study time, previous exam scores, and attendance. In this case, logistic regression can provide the probability of passing (class 1) or failing (class 0) for each student.

## Q2. What is the cost function used in logistic regression, and how is it optimized?

The cost function used in logistic regression is often referred to as the "logistic loss" or "cross-entropy loss" function. It measures the error between the predicted probabilities of the logistic regression model and the actual binary outcomes in the dataset. The goal during training is to minimize this cost function. Here's the formula for the logistic loss function:

For a single training example with true label y and predicted p:
Cost(y,p)= -y*log(p) - (1-y)log(1-p)

If y=1, the cost becomes −log(p), which punishes the model more for predicting a low probability when the true label is 1.

If y=0, the cost becomes −log(1−p), which punishes the model more for predicting a high probability when the true label is 0.

To optimize the logistic regression model, you typically use an optimization algorithm to find the model parameters (coefficients) that minimize the overall cost across the entire training dataset. The most commonly used optimization algorithm for logistic regression is gradient descent. 

Adjust the model parameters in the opposite direction of the gradient to minimize the cost. This is done iteratively using a learning rate, which determines the step size for parameter updates.

repeat until convergance :{

    wj = wj - Alpha*(d/dw(J(w,b))

}

by which we can are optimizing the values of coefficent of feature w1,w2,w3,w4,...,wj.
we are not optimize the value of B, which is constant.

## Q3. Explain the concept of regularization in logistic regression and how it helps prevent overfitting.

Regularization is a technique used in logistic regression to prevent overfitting, which occurs when a model learns to fit the training data too closely, capturing noise and making it perform poorly on unseen data. Regularization introduces a penalty term into the logistic regression cost function, discouraging the model from assigning excessively large weights to features. It encourages the model to have smaller and more balanced feature coefficients, which can improve generalization to new, unseen data.

There are two common types of regularization used in logistic regression:

L1 Regularization (Lasso Regularization):

In L1 regularization, the cost function is augmented by adding the absolute values of the feature coefficients (weights) as a penalty term.

new cost function :

Cost(y,p)=−ylog(p) - (1−y)log(1−p) + λ sum(WI)
Here, WI, represents the weights of individual features, and λ controls the strength of regularization (the regularization parameter).
L1 regularization tends to produce sparse models because it encourages many feature coefficients to become exactly zero. This is beneficial for feature selection, as it effectively removes less relevant features from the model.


L2 Regularization (Ridge Regularization):

In L2 regularization, the cost function is augmented by adding the squared values of the feature coefficients as a penalty term.

new cost function :

Cost(y,p)=−ylog(p) - (1−y)log(1−p) + λ*sum_of_square_of(WI)

Similar to L1 regularization,
λ controls the strength of regularization.
L2 regularization encourages all feature coefficients to be small but does not force them to become exactly zero. It tends to produce models with small, non-zero weights for all features.

Elastic net regularization,

Cost(y,p)=−ylog(p) - (1−y)log(1−p) + λ1 sum(WI) + λ2*sum_of_square_of(WI)

in Elastic net regularization it add both l1 and l2 penalty term, elastic net is used for both feature selection and preventing over fitting.

## Q4. What is the ROC curve, and how is it used to evaluate the performance of the logistic regression model?

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

True Positive Rate
False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

TPR = TP/TP+FN


False Positive Rate (FPR) is defined as follows:

FPR = FP/FP+TN

AUC stands for Area Under the Curve, and the AUC curve represents the area under the ROC curve. It measures the overall performance of the binary classification model. As both TPR and FPR range between 0 to 1, So, the area will always lie between 0 and 1, and A greater value of AUC denotes better model performance. Our main goal is to maximize this area in order to have the highest TPR and lowest FPR at the given threshold. The AUC measures the probability that the model will assign a randomly chosen positive instance a higher predicted probability compared to a randomly chosen negative instance.

## Q5. What are some common techniques for feature selection in logistic regression? How do these techniques help improve the model's performance?

Feature selection is an important step in building a logistic regression model. It involves choosing a subset of relevant features from the available set of predictors to improve model performance and reduce the risk of overfitting. 

Feature selection techniques:
- We can select feature manually
- L1 regularization
- Variance threshold
- Feature Importance from Tree-Based Models

Using feature selection techniques we can improve our model performance, model take less time for computation, it take less space, Reduced Overfitting, Potentially Better Generalization.


## Q6. How can you handle imbalanced datasets in logistic regression? What are some strategies for dealing with class imbalance?

Handling imbalanced datasets in logistic regression is essential because logistic regression models tend to be biased towards the majority class when there is a significant class imbalance. The majority class typically has more data, which can lead to the model having a higher accuracy for the majority class but poor performance on the minority class.
- Resampling Techniques : Undersampling, Oversampling.
- Smote(Resampling techniques)
- interpolation technique

In [1]:
from imblearn.over_sampling import SMOTE 
import pandas as pd
import numpy as np

In [16]:
np.random.seed()
X1 = np.random.randint(1,11,10)
X2 = np.random.randint(1,11,10)
X1,X2

(array([6, 3, 3, 2, 1, 2, 5, 1, 2, 8]),
 array([ 3,  1,  7,  7,  2,  3, 10,  7,  5,  8]))

In [38]:
y = [1,1,1,0,0,0,0,0,0,0]

In [47]:
df = pd.DataFrame({"X1":X1,"X2":X2,"y":y})
df

Unnamed: 0,X1,X2,y
0,6,3,1
1,3,1,1
2,3,7,1
3,2,7,0
4,1,2,0
5,2,3,0
6,5,10,0
7,1,7,0
8,2,5,0
9,8,8,0


In [48]:
df["y"].unique()

array([1, 0], dtype=int64)

In [49]:
df[["y"]].value_counts()

y
0    7
1    3
dtype: int64

In [65]:
smote = SMOTE(k_neighbors=2)

In [56]:
X = df[["X1","X2"]]

In [57]:
X

Unnamed: 0,X1,X2
0,6,3
1,3,1
2,3,7
3,2,7
4,1,2
5,2,3
6,5,10
7,1,7
8,2,5
9,8,8


In [59]:
y = df["y"]
y

0    1
1    1
2    1
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: y, dtype: int64

In [66]:
X_new, y_new = smote.fit_resample(X,y)

In [67]:
X_new.shape

(14, 2)

In [68]:
y_new.shape

(14,)

In [69]:
X_new

Unnamed: 0,X1,X2
0,6,3
1,3,1
2,3,7
3,2,7
4,1,2
5,2,3
6,5,10
7,1,7
8,2,5
9,8,8


In [70]:
y_new

0     1
1     1
2     1
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    1
11    1
12    1
13    1
Name: y, dtype: int64

In [71]:
y_new.unique()

array([1, 0], dtype=int64)

In [73]:
y_new.value_counts()

1    7
0    7
Name: y, dtype: int64

## Q7. Can you discuss some common issues and challenges that may arise when implementing logistic regression, and how they can be addressed? For example, what can be done if there is multicollinearity among the independent variables?

Some common issues are :

1. Multicollinearity: Multicollinearity occurs when two or more independent variables in the model are highly correlated      with each other. This can make it challenging to interpret the individual contributions of these variables to the target    variable.
   Solve by using covariance, vif, dimensionality reduction...(EX if there are two feature which is highly corelated then      we can simply take any one from that two)
   
   
2. Imbalanced Datasets: Class imbalance, where one class significantly outnumbers the other, can lead to biased model          predictions, as logistic regression may favor the majority class.
   Solve by using resampling, interpolation...(if we have two output category 0 and 1 and the number of 0 is 100 and the      number of 1 is 900 so we have to make more sample of 0 category or we can choose subset of 1 ccategory and make number      of zero and 1 are same)
   
   
3. Outliers: Outliers in the data can distort the logistic regression model's coefficients and predictions.
   Solution are visualization, statistical tests, or outlier detection algorithms.(we can drop outliers or if they are        important then we can handle them by some techniques like anomaly detection.)
  
  
4. Missing Data: Missing data can impact model training and prediction.
   Solve by fill the value by mean, mode, median, predict value by using some ml model.


5. Model Evaluation: Selecting the appropriate evaluation metric is crucial for logistic regression models.
   Solution, Choose evaluation metrics based on the problem and class distribution, such as accuracy, precision, recall,      F1-score, AUC-ROC, or AUC-PR.