### Logistic Regression

Logistic regression is a statistical method used for binary classification, meaning it predicts the probability of an outcome that can have only one of two possible values (e.g., 0 or 1, Yes or No, Success or Failure). Unlike linear regression, which outputs continuous values, logistic regression is specifically designed for situations where the target variable is categorical.

#### Key Concepts

1. Odds and Probability.
   Probability: The likelihood of an event occurring, ranging from 0 to 1.
   Odds: The ratio of the probability of an event occurring to the probability of it not occurring.
2. Log-Odds and Logistic Function
   a) Logistic regression uses a logistic function called sigmoid function to model the relationship between the independent variables and the probability of the dependent variable
   b) In logistic regression, the log-odds of the probability of an event is a linear function of the independent variables. The log-odds is also known as the logit.
3. Evaluation Metrics:
   - Confusion Matrix: Summarizes the performance of the classifier by providing counts of true positives, true negatives, false positives, and false negatives.
   - Accuracy: The ratio of correctly predicted outcomes to total outcomes.
   - Precision, Recall, and F1-Score: Additional metrics used especially when dealing with imbalanced datasets.



#### Understanding F1-score

The F1-score is a performance metric used in classification models to evaluate a model’s accuracy, especially when dealing with imbalanced classes. It is the harmonic mean of precision and recall, which are themselves two important metrics in assessing classification accuracy

Precision: true positives over true positives + false positives
Recall = true positives over true positives + false negatives

F1 score = 2 times precision times recall over precision + recall

In [2]:
#import libraries

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [3]:
#load dataset
data = pd.read_csv('customer_data.csv')

data.head()

Unnamed: 0,Age,Income,Credit Score,Purchased
0,62,13959,452,1
1,65,5957,341,0
2,18,7469,574,1
3,21,11752,738,0
4,21,12797,507,1


In [4]:
#populate x and y variables
X = data[['Age', 'Income', 'Credit Score']]
y = data['Purchased']

#split the data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

In [6]:
#Logistic regression using statmodels

x_train_1 = sm.add_constant(x_train)

model = sm.Logit(y_train, x_train_1)
result = model.fit()

result.summary()

Optimization terminated successfully.
         Current function value: 0.663766
         Iterations 4


0,1,2,3
Dep. Variable:,Purchased,No. Observations:,60.0
Model:,Logit,Df Residuals:,56.0
Method:,MLE,Df Model:,3.0
Date:,"Tue, 29 Oct 2024",Pseudo R-squ.:,0.02271
Time:,13:55:09,Log-Likelihood:,-39.826
converged:,True,LL-Null:,-40.752
Covariance Type:,nonrobust,LLR p-value:,0.6038

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.6324,1.321,0.479,0.632,-1.957,3.222
Age,0.0088,0.017,0.506,0.613,-0.025,0.043
Income,-5.163e-05,8.85e-05,-0.583,0.560,-0.000,0.000
Credit Score,-0.0016,0.002,-0.914,0.360,-0.005,0.002


### Interpreting the rsult summary

1. Pseudo R-squared:

Similar to R-squared in linear regression, pseudo R-squared gives an idea of model fit but does not have a direct interpretation as variance explained. Higher values indicate a better fit, although this value tends to be lower than R-squared values in linear models.

2. LLR p-value: The p-value for the likelihood ratio test. A low p-value (e.g., < 0.05) indicates that the predictors collectively improve the model fit significantly compared to the null model.
3. LL-Null: The log-likelihood of the model with no predictors (intercept-only model). It represents the starting fit of the model before any predictors were added.

In [7]:
#using sklearn

regression = LogisticRegression()

regression.fit(x_train, y_train)

In [11]:
y_prediction = regression.predict(x_test)

In [12]:
y_prediction

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1])

In [13]:
y_test

26    0
86    1
2     1
55    0
75    1
93    1
16    0
73    0
54    1
95    0
53    1
92    0
78    0
13    1
7     1
30    1
22    0
24    0
33    1
8     1
43    1
62    1
3     0
71    0
45    1
48    0
6     1
99    1
82    0
76    0
60    1
80    0
90    0
68    1
51    1
27    0
18    1
56    1
63    1
74    0
Name: Purchased, dtype: int64

In [16]:
accuracy = accuracy_score(y_test, y_prediction)
accuracy

0.5

In [17]:
#confusion matrix
cf = confusion_matrix(y_test, y_prediction)
cf

array([[15,  3],
       [17,  5]])