# Logistic Regression

<img src= "https://i.pinimg.com/736x/53/44/a1/5344a1f885466272b8d0be85df922006.jpg">

* Logistic Regression is one of the most popular Machine Learning algorithms which comes under Supervised Learning technique, it is used for predicting the categorical dependent variable using a given set of independent variables

* Logistic Regression is much similar to Linear Regression except that how they are used. Linear regression is used for solving regression problems, whereas Logisticregression is used for solving classification problems

* In Logistic regression instead of fitting regression line, we fit an "S" shaped logistic function which two maximum values (0 or 1)

* Logistic Regression is a significant machine learning algorithm because it has the ability to provide probabilities and classify new data using continuous and discrete datasets

* Logistc Regression can be used to classify the observation using different types of data and can easily determine the most effective variables used for the classification.

## Logistic Function (Sigmoid Function):

* Sigmoid function is the mathematical function used to map the predicted values to probabilities
* It maps any real value into another value with in a range of 0 and 1
* The value of Logistic Regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the 'S' form. The S-from curve is called the Sigmoid function or Logistic function
* In logistic regression, concept of threshold value is used, which defines the probability of either 0 or 1.Such as values above the threshold value tends to 1, and value below threshold tends to 0.

## Sigmoid function formula

$$y_{pred} =\frac{1}{1 + e^{-xcap}}$$

## Logistic Regression requirements

* The dependent variable must be categorical in nature
* The independent variable should not have multi-colinearity

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = {
    "X" : [0.1, 0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9],
    "Y" : [0, 0, 0, 0, 1, 1, 1, 1]
}

In [3]:

df = pd.DataFrame(data=data)
df

Unnamed: 0,X,Y
0,0.1,0
1,0.2,0
2,0.3,0
3,0.4,0
4,0.6,1
5,0.7,1
6,0.8,1
7,0.9,1


In [4]:
# Finding multiplication of X and Y
df['SumX_Y'] = df['X'] * df['Y']
df['SumX_Y']

NumExpr defaulting to 8 threads.


0    0.0
1    0.0
2    0.0
3    0.0
4    0.6
5    0.7
6    0.8
7    0.9
Name: SumX_Y, dtype: float64

In [5]:
# Finding square of X
df['Sqr_X'] = df['X'] ** 2
df['Sqr_X']

0    0.01
1    0.04
2    0.09
3    0.16
4    0.36
5    0.49
6    0.64
7    0.81
Name: Sqr_X, dtype: float64

In [6]:
df

Unnamed: 0,X,Y,SumX_Y,Sqr_X
0,0.1,0,0.0,0.01
1,0.2,0,0.0,0.04
2,0.3,0,0.0,0.09
3,0.4,0,0.0,0.16
4,0.6,1,0.6,0.36
5,0.7,1,0.7,0.49
6,0.8,1,0.8,0.64
7,0.9,1,0.9,0.81


## Sigmoid function formula

$$y_{pred} =\frac{1}{1 + e^{-xcap}}$$

## Confusion Matrix: -
* A confusion matrix is a table this is often used to describe the performance of a classification model on a set of test data for which the true values are known.
* The Confuson Matrix shows the ways in which your classiication model is confused when it makes predictions.

## True Positive (TP) = For correctly predicted event values
## False Positive (FP)  = For incorrectly predicted event values
## True Negative (TN) = For correctly predicted no-event values
## False Neagative (FN) = For incorrectly predicted no-event values


## Evaluating Logistc Regression using Accuracy, Precision, Recall and F1 Score

<img src = "https://www.researchgate.net/publication/346062755/figure/fig5/AS:960496597483542@1606011642491/Confusion-matrix-and-performance-evaluation-metrics.png">


## Accuracy :-

* It is used to check the accuracy of the predicted data. 
* In total no of predicted values how much percentage of the data is accurately predicted is known as accuracy.
* Accuracy greater than 70% is a great model performance.
* In fact, an accuracy measure of anything between 70%-90% is not only ideal, it's realistic. This is also consistent with industry standards.
* Accuracy formula is shown as below
              
$$Accuracy = (\frac{Total No.of Correctly Predicted values}{Total no.of values}) * 100$$

OR

$$Accuracy = \frac{(TP + TN)}{(TP + TN + FP + FN)}$$

# Precision :-
 
* Precision is the fraction of relevant instances among the retrieved instances. 
* Precision is always calculated based on predicted values only.
* Precision can be seen as a measure of quality
* The precision value lies between 0 and 1. 
* Higher precision i.e. nearer to 1 means that an algorithm returns more relevant results than irrelevant ones,
* Precision is more important than recall when you would like to have less False Positives in trade off to have more False Negatives. Meaning, getting a False Positive is very costly, and a False Negative is not as much.

              
$$Precision = \frac{True Positives}{Selected Elements}$$

OR

$$Precision = \frac{TP}{(TP + FP)}$$

# Recall :-

* Recall is the fraction of relevant instances that were retrieved. 
* Recall is always calculated based on original values only.
* Recall is more important where Overlooked Cases (False Negatives) are more costly than False Alarms (False Positive). The focus in these problems is finding the positive cases.
* Recall should be near to 1 (high) for a good classifier
              
$$Recall = \frac{True Positives}{Relevant Elements}$$

OR

$$Recall = \frac{TP}{(TP + FN)}$$

## F1 Score :-

* F1 Score is the harmonic mean of precision and recall. 
* In the most simple terms, higher F1 scores are generally better. 
* F1 scores can range from 0 to 1, with 1 representing a model that perfectly classifies each observation into the correct class and 0 representing a model that is unable to classify any observation into the correct class.
* Formula for F1 score is shown below. P is Precision value and R is Recall value


$$f1 = 2\tfrac{P*R}{P+R}$$

In [7]:

class logistic_Regression:
        def __init__(self, df):
            self.sum_xy = df['SumX_Y'].sum()
            self.sum_x = df['X'].sum()
            self.sum_y = df['Y'].sum()
            self.sqr_x = df['Sqr_X'].sum()
            self.sumX_h_2 = self.sum_x ** 2 
            self.n = len(df)
        
        def M(self, sum_xy, sum_x, sum_y, sqr_x, sumX_h_2, n):   # To find m value
            self.numerator = n*((sum_xy)) - (sum_x) * (sum_y)
            self.denominator = n*((sqr_x)) - (sumX_h_2)
            m = numerator / denominator
            return m
            
        def B(self, n, sum_x, sum_y):                                            # TO find b value
            self.numerator_b = ((sum_y) - (m * sum_x))
            self.denominator_b = n
            self.b = self.numerator_b / self.denominator_b
            return self.b

In [8]:
# object declaration for b(constant)
m_obj = logistic_Regression(df)

# object.methodname
m_obj.M(sum_xy, sum_x, sum_y, sqr_x, sumX_h_2, n)

# printing the b value
print("m value is: ", M)

NameError: name 'sum_xy' is not defined

In [9]:
# Calculating Accuracy, Precision, Recall and F1 Score

class logistic_Regression:
        def __init__(self, df):
            self.sum_xy = df['SumX_Y'].sum()
            self.sum_x = df['X'].sum()
            self.sum_y = df['Y'].sum()
            self.sqr_x = df['Sqr_X'].sum()
            self.sumX_h_2 = self.sum_x ** 2 
            self.n = len(df)
        
        def M(self, sum_xy, sum_x, sum_y, sqr_x, sumX_h_2, n):   # To find m value
            self.numerator = n*((sum_xy)) - (sum_x) * (sum_y)
            self.denominator = n*((sqr_x)) - (sumX_h_2)
            m = numerator / denominator
            return m
            
        def B(self, n, sum_x, sum_y):                                            # TO find b value
            self.numerator_b = ((sum_y) - (m * sum_x))
            self.denominator_b = n
            self.b = self.numerator_b / self.denominator_b
            return self.b
        
        X_cap = [(m*X + b) for X in df['X']]
        print("X_cap value is: ", X_cap)   

    
        def sigmoid(X_cap):       ## To find sigmoid of X_cap for finding ypred
            return [(1 / (1 + np.exp(-xcap))) for xcap in X_cap]
        
        ypred = sigmoid(X_cap)                                    # Finding Ypred 
        print("Sigmoid values:",ypred)
    
        def final(ypred):    # replacing values in ypred with 0 and 1 
            re = [1 if val >= 0.5 else 0 for val in ypred]
            return re
    
        ypred = final(ypred)                                     # y-prediction values
        print("y predicted values: ", ypred)
            
        ytrue = df.Y.values                                      # y-original values
        print("y true values: ", ytrue)
    
        def conf_matrix(self,ypred,ytrue):  # Confusion matrix
            TP = 0           # Initializing True Positive (TP), Flase Positive (FP), True Negative (TN), False Negative (FN) values to 0
            FP = 0
            TN = 0
            FN = 0
        
            for i in range(len(ypred)):
                if ytrue[i]==ypred[i]==1:
                    TP +=1
                if ypred[i]==1 and ytrue[i]!=ypred[i]:
                    FP +=1
                if ytrue[i]==ypred[i]==0:
                    TN +=1
                if ypred[i]==0 and ytrue[i]!=ypred[i]:
                    FN +=1
                
            return TP,FP,TN,FN
        #print("Confusion matrix: ",conf_matrix(ytrue,ypred))
    
        def evaluate_1(self, TP,FP,TN,FN):    # Evaluating for correct values i.e. For 1 (Predicted values are close or same)
            accuracy_1 = (TP + TN)/(TP + FP + FN + TN)  # for 1 (Predicted values are close or same)
            precision_1 = (TP)/(TP + FP)
            recall_1 = (TP)/(TP + FN)
            f1_score_1 = 2 * ((precision_1 * recall_1)/(precision_1 + recall_1))
            return accuracy_1, precision_1, recall_1, f1_score_1
    
        def evaluate_0(self, TP,FP,TN,FN):    # Evaluating for incorrect values i.e. For 0 (Predicted values are not close or not same)
            accuracy_0 = (TP + TN)/(TP + FP + FN + TN)  # for 0 (Predicted values are not close or not same)
            precision_0 = (TN)/(FN + TN)
            recall_0 = (TN)/(FP + TN)
            f1_score_0 = 2 * ((precision_) * recall_0/(precision_0 + recall_0))
            return accuracy_0, precision_0, recall_0, f1_score_0
    
    
    
    
    

NameError: name 'm' is not defined

In [11]:
# object declaration for b(constant)
m_obj = logistic_Regression(df)

# object.methodname
m_obj.M(sum_xy, sum_x, sum_y, sqr_x, sumX_h_2, n)

# printing the b value
print("m value is: ", M)

NameError: name 'sum_xy' is not defined