# Python: Logistic Regression on Pima Indians Diabetes Database

<h2>[1]. Introduction</h2>

1. What Is Logistic Regression?

   Logistic regression is one of the most popular machine learning algorithms for "binary classification" used in the Supervised Machine Learning technique. It can be extended when the dependent has more than two categories.
   
   Problems having binary outcomes, such as Yes/No, 0/1, True/False, are the ones being called binary classification problems.
   
   In logistic regression, we estimate an unknown probability for any given linear combination of independent variables.

   
   
2. What are the applications of Logistic Regression?

    Logistic regression can be applied anywhere that the outcome is binary. Some of the applications of logistic regression are:
   1. Trying to figure out who will win the election
   2. Whether an email is a spam.
   3. Predicting whether a student will pass or not on the basis of hours of study or any relevant information.
   4. Predicting the approval of loan on the basis of credit score.
   5. Predicting the failure of a firm.
   

3. What are the terminology Used for Logistic Regression

   1. Probability: Probability is the measure of the likelihood that an event will occur.Probability is quantified as a number between 0 and 1.
   2. Odds: Odds is the ratio of the probability of occurring of an event and probability of not occurring such as odds = P(Occurring)/P(Not occurring) where P = Probability.
   3. Odds ratio: Odds ratio for a variable in logistic regression that represents how the odds change with 1 unit increase in that variable holding all the other variables as constant. It can be defined as the ratio of two odds.
   4. Logit: In logistic regression, we need a function that can link independent variables or map the linear combination of variables that could result in any value from 0 to 1. That function is called logit: ln(odds) = ln(p/1-p) = logit(p).


 
4. Why Apply Logistic Regression?

   Linear regression doesn’t give a good fit line for the problems having only two values(being shown in the figure), It will give less accuracy while prediction because it will fail to cover the datasets, being linear in nature.
![linear_vs_logistic_regression](1_linear_vs_logistic_regression.jpg)

    Linear Regression vs. Logistic Regression:

    1. In linear regression, the outcome is continuous. It can have any one of an infinite number of possible values while In logistic regression, the outcome has only a limited number of possible values, generally 0 or 1.

    2. Linear regression needs to establish the linear relationship between dependent and independent variable whereas it is not necessary for logistic regression.

    3. In linear regression, the independent variables can be correlated with each other. On the contrary, in logistic regression, the variable must not be correlated with each other.

5. What is the mathematics involved in logistic regression.
   The main reason behind bending of the logistic regression  curve  is because of being calculated using a sigmoid function (also known as logistic function because being used in logistic regression)being given below:
   
   f(z) = 1/(1+e^(-z))

   This is the mathematical function which is having the ‘S –  Shaped curve’.  The value of the sigmoid function always lies between 0 and 1, which is why it’s being deployed to solve categorical problems having two possible values.
   
   
6. How to implement the logistic regression?

   Logistic Regression deploys the sigmoid function to make predictions in the case of Categorical values.

   It sets a cut-off point value, which is mostly being set as 0.5,  which, when being exceeded by the predicted output of the Logistic curve, gives respective predicted output in form of which category the dataset belongs

   For Example,

   In the case of the Diabetes prediction Model, if the output  exceeds the cutoff point,  prediction output will be given as Yes for  Diabetes otherwise No, if the value is below the cutoff point

7. How to measure the performance of logistic regression?

   For measuring the performance of the model solving classification problems, the Confusion matrix is being used.

   Key terms:

       – TN Stands for True Negatives(The predicted(negative) value matches the actual(negative) value)
       – FP stands for False Positives (The actual value, was negative, but the model predicted a positive value)
       – FN stands for False Negatives(The actual value, was positive, but the model predicted a negative value)
       – TP stands for True Positives(The predicted(positive) value matched the actual value(positive))

   For a good model, one should not have a high number of False Positive or  False Negative

 
8. What are the key features of logistic regression?

   1. It is used for predicting the categorical dependent variable, using a given set of independent variables.

   2. It predicts the output of a categorical variable, which is discrete in nature. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the output as the probability of the dataset which lies between 0 and 1.

   3. It is similar to Linear Regression. The only difference is that Linear Regression is used for solving regression problems, whereas Logistic regression is used for solving the classification problems/categorical problems.

9. Types of Logistic Regression

   There Are Three Types:

   1. Binomial: Binomial Logistic regression deals with those problems with target variables having only two possible values, 0 or 1.
      Which can Signify Yes/No, True /False, Dead/Alive, and other categorical values.
   2. Ordinal:Ordinal Logistic Regression Deals with those problems whose target variables can have 3 or more than 3 values, unordered in nature. Those values don’t have any quantitative significance.
      For Example Type 1 House, Type 3 House, Type 3 House, etc

   3. Multinomial:Multinomial Logistic regression, just Ordinal Logistic Regression, deals with Problems having target values to be more than or equal to3. The main difference lies that unlike Ordinal, those values are well ordered. The values Hold Quantitative Significance.
      For Example, Evaluation Of skill as Low, Average, Expert 

<h2>[2]. Workflow for the logistic regression model</h2>

1. Set learning rate and number of iterations (this will be done in __init__() function); initiate random weight and bias value (this will be done in fit() function). 
2. Build logistic regression function (sigmoid function).
3. Update the parametrs using gradient descent algorithm in each iterations. Based on the number of iterations, we will finally get the best model (best weight and bias) as it has minimum cost.
5. Finally, evaluate the model using prediction() function to determine the class of the dataset.

<h3>Note:
    
    1. learning rate and number of iterations are called as the hyperparameters of the model because these are outside the model. They do not learn from the dataset!
    
    2. Weight and bias are called the parameters as their values are initialized with some random variables but gets updated during training phase of the model. They learn from the dataset!
</h3>

<h3>[2.1]. First of all, we import the dependencies</h3>

We wiil use only numpy library for the implementation of logistic regression 

In [1]:
# imports numpy library
import numpy as np # to create arrays

<h3>[2.2]. Write code to build logistic regression model</h3>

Create a class for the logistic regression algorithm. In this class we would need 4 functions:
1. __init__(): function to initialize the parameters of the class.
2. fit(): function to fit the dataset to our model, X and y of the training dataset wil be the parameters of the fit() function.
3. update_weights(): function where we wil use the gradient descent to update the weights to get the optimal value of weights.
4. predict(): function predicts the output on the test data after the model is trained.

In [2]:
# creates a logistic regression class so you do not need to implement the logistic regression everytime you are using it!
# The self parameter is a reference to the current instance of the class, and is used to access variables that belongs to the class.
# self is always the first argument to any function of a class! 
class LogisticRegression():
    
    
    # declares learning rate and number of iterations (hyperparameters)
    def __init__(self, learning_rate, no_of_iterations): #initializes the parameters of the LogisticRegression class
        
        self.learning_rate = learning_rate
        self.no_of_iterations = no_of_iterations
     
    
    # trains the model with dataset; X: feature matrix, and Y:target coloumn    
    def fit(self, X, Y): #fits the train dataset to LogisticRegression model
        
        # here, first thing is to determine the number of number of data points and number of input features in your dataset
        # number of data points in the dataset (number of rows) --> m
        # number of input features in the dataset (number of coloumns) --> n
        self.m, self.n = X.shape
        
        # m is needed to find the derivatives and n is needed to find the size of weight matrix
        # initilizes weight and bias values
        self.w = np.zeros(self.n) #numpy array with all the weight values related to each feature is set as 0
        self.b = 0
        
        self.X = X
        self.Y = Y
        
        # implementing gradient descent for optimization
        # creates the gradient descent algorithm to update the weight and bias value
        for i in range(self.no_of_iterations): #instead of no_of_iterations self.no_of_iterations made it successful!
            self.update_weights()      
    
    # updates the weight and bias in each iteration
    def update_weights(self, ): #updates the weight and bias to get get the optimal model
        
        # Y_hat value(sigmoid function)
        # Y_hat = 1 / (1 + np.exp(-z))
        Y_hat = 1 / (1 + np.exp(-(self.X.dot(self.w) + self.b))) #dot represents matrix multiplication
        
        # derivatives
        dw = (1/self.m)*np.dot((self.X.T), (Y_hat - self.Y)) #T represnts the transpose of X is taken to match the matrix multuplication dimension rule
        db = (1/self.m)*np.sum(Y_hat - self.Y)
        
        # update the weight and bias using gradient descent
        self.w = self.w - self.learning_rate * dw
        self.b = self.b - self.learning_rate * db
        
    
    # write sigmoid equation and the decision boundary
    def predict(self, X): #pridcits the y value on the test data after the model is trained
        # if Y_predicted > 0.5 => y =1
        # if Y_predicted < 0.5 => y =0
        Y_pred = 1 / (1 + np.exp(-(X.dot(self.w) + self.b))) #here self.X is not required as we are predicting on X
        
        # converts the predicted decimal value to binary value
        Y_pred = np.where(Y_pred > 0.5 , 1, 0) # if Y_pred > 0.5 => 1 else 0
        
        return Y_pred

<h3>[2.3]. Train the model</h3>

In [3]:
# imports dependencies to train the model
import pandas as pd #to create DataFrame. DataFrame is a structural table
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [4]:
# data collection and analysis
# PIMA diabetes data
# loading a diabetes dataset to a pandas DataFrame
diabetes_dataset = pd.read_csv('data/diabetes.csv')

In [5]:
# prints the first 5 rows of the dataset
diabetes_dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
# number of rows and coloumns in the dataset
diabetes_dataset.shape

(768, 9)

In [7]:
# gets the statistical measures of the data
diabetes_dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [8]:
# tells how many target 0's and 1's
diabetes_dataset['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

0: Non-diabatic = 500
1: Diabetic = 268

In [9]:
# groups the data based on their mean value
diabetes_dataset.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [10]:
# separating the data and lebels to get X and Y for training the model
features = diabetes_dataset.drop(columns = 'Outcome', axis =1)
target = diabetes_dataset['Outcome']

In [11]:
print(features)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  
0                       0.627   50  
1                       0.351   31  


In [12]:
print(target)

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


Data Standardisation

In [13]:
# each feature data has different range of values, so standarizing is important to make them in similar range 
scaler = StandardScaler()

In [14]:
scaler.fit(features)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [15]:
standardized_data = scaler.transform(features)

In [16]:
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [17]:
features = standardized_data
target = diabetes_dataset['Outcome']

In [18]:
print(features)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [19]:
print(target)

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


Train test split

In [20]:
# for test_size = 0.2: 20 % will go to test data and 80 % will go to train data
# random_state = 2 whoever gives this, their data will be splitted in the similar way; used to reproduce the data exactly
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state = 2)

In [21]:
print(features.shape, x_train.shape, x_test.shape)

(768, 8) (614, 8) (154, 8)


Training the model

In [22]:
# instantiates an instance of the logistic regression class (or creates an object of the logistic regression class)
# remember to pass all the parameters declared in the __init__() function of LogisticRegression class
classifier = LogisticRegression(learning_rate = 0.01, no_of_iterations = 1000)

In [23]:
# now you can call the function of the logistic regression class on the created instance
classifier.fit(x_train, y_train) #will go to logistic regression class and searches for fit() method

<h3>[2.4]. Model Evaluation</h3>

Accuracy score

In [24]:
# accuracy score on the training dat
x_train_prediction = classifier.predict(x_train)
training_data_accuracy = accuracy_score(y_train, x_train_prediction)
print('Accuracy score on the training data: ', training_data_accuracy )

Accuracy score on the training data:  0.7768729641693811


In [25]:
# accuracy score on the test dat
x_test_prediction = classifier.predict(x_test)
test_data_accuracy = accuracy_score(y_test, x_test_prediction)
print('Accuracy score on the training data: ', test_data_accuracy )

Accuracy score on the training data:  0.7662337662337663


<h3>[2.5]. Making a predictive system</h3>

In [26]:
input_data = (5,166,72,19,175,25.8,0.587,51)

# changing the input_data to numpy array
input_data_as_numpy_array = np.asarray(input_data)

# reshape the array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

# standardize the input data
std_data = scaler.transform(input_data_reshaped)
print(std_data)

prediction = classifier.predict(std_data)
print(prediction)

if(prediction[0]==0):
    print('The person is not diabetic')
else:
    print('The person is diabetic')


[[ 0.3429808   1.41167241  0.14964075 -0.09637905  0.82661621 -0.78595734
   0.34768723  1.51108316]]
[1]
The person is diabetic


<h2>[3]. References:</h2>

[1]. https://dzone.com/articles/machinex-simplifying-logistic-regression

[2]. https://www.analyticsvidhya.com/blog/2021/04/beginners-guide-to-logistic-regression-using-python/

[3]. https://www.youtube.com/c/Siddhardhan/playlists

[4]. https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

In [27]:
#Mtahematical e value
print(np.exp(1)) #e (2.718281828459045)
print(np.exp(5)) #e^5 (148.4131591025766)

2.718281828459045
148.4131591025766
