# Lecture 12a, Logistic Regression

## Available libraries and differences   
- [Scikit learn function and parameters](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)  

- [Scikit learn user guide](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)   

- [Statsmodels](https://www.statsmodels.org/devel/discretemod.html)   

- [Pingouin docs](https://pingouin-stats.org/build/html/generated/pingouin.logistic_regression.html)   

- [Pingouin source code](https://pingouin-stats.org/build/html/_modules/pingouin/regression.html#logistic_regression), Check docstring, difference with scikit learn.

  > Notes
    .. caution:: This function is a wrapper around the
    :py:class:`sklearn.linear_model.LogisticRegression` class.
    However, Pingouin internally disables the L2 regularization and changes the
        default solver to 'newton-cg' to obtain results that are similar to R and
        statsmodels

- Other: TensorFlow, PyTorch.

[Different use of statsmodels and scikit-learn: stats or predictions](https://stats.stackexchange.com/a/48578)    
[Stats VS Machine Learning: complementarity and differences](https://stats.stackexchange.com/questions/6/the-two-cultures-statistics-vs-machine-learning)

## Three Datasets: wikipedia, program effectiveness, iris flowers,
Loaded from python libraries.

[Program_Effectiveness Data Source and Info](https://www.statsmodels.org/devel/datasets/generated/spector.html)   

[Program_Effectiveness Original Source and Info](https://pages.stern.nyu.edu/~wgreene/Text/econometricanalysis.htm)  

**Remember**:   
Logistic Regression is a "supervised learning" model.
We know the labels. We have a "y" target variable. 


Goal of the algo: Classify instances (observations) in different use-cases:
* binary
* multi-class
* one VS rest

Classes can be:  
* unordered: all classes of equal "quantity"(?).
* ordered: some classes have "more" of what defines a class. Useful in socio-economic studies: Post Office code.   

Important notes:   
Sigmoid functions property: Non constant "returns".   
Slope: The slope of the sigmoid function is not constant and depends on the value of the input.

How much does a factor affect the result?
What will the result be?
What is the probability of the result?

In [1]:
pwd  

'/home/tharg/venv_projects/uoa_py_course/lectures_07_12_pandas_plots_scikit/lecture_12_logistic_regression_naive_bayes_svm'

# A. Import the necessary modules.   
You may add libraries here as you go on with your work. But, all imports should be at the first code cell.

In [2]:
# data management libraries
import pandas as pd
import numpy as np

# scikit learn
from sklearn.linear_model import LogisticRegression
# metrics libraries
from sklearn import metrics
from sklearn.datasets import load_iris

# import pingouin stats library
import pingouin as pg

# import statsmodels stats library
import statsmodels.api as sm

# https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_pipeline_display.html?highlight=grid%20search#displaying-a-grid-search-over-a-pipeline-with-a-classifier

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# B. Inspect the data and understand its features.   
See which are the variables, what is their type, what are the values that the variables take.  
At this step, you should think about possible relations that you ought to examine.   
Which do you think might be more important?

## The wikipedia example

In [3]:
# First, let's create the dataframe
Hours = [0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 1.75, 2.00, 2.25, 2.50,
         2.75, 3.00, 3.25, 3.50, 4.00, 4.25, 4.50, 4.75, 5.00, 5.50]
Pass = [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1]
df = pd.DataFrame({'HoursStudy': Hours, 'PassExam': Pass})

In [4]:
df.sample(4)

Unnamed: 0,HoursStudy,PassExam
4,1.5,0
17,4.75,1
1,0.75,0
19,5.5,1


In [5]:
# X is the independent variable
X = df['HoursStudy']

In [6]:
X.sample(2)

2     1.0
18    5.0
Name: HoursStudy, dtype: float64

In [7]:
# y is the dependent variable
y = df['PassExam']

## Using logistic regression as a linear regression (example using Pingouin)

In [8]:
lr = pg.logistic_regression(X, y, penalty="l2").round(3)
lr

Unnamed: 0,names,coef,se,z,pval,CI[2.5%],CI[97.5%]
0,Intercept,-3.139,1.438,-2.183,0.029,-5.958,-0.32
1,HoursStudy,1.148,0.498,2.308,0.021,0.173,2.124


In [9]:
pg.logistic_regression(X.to_numpy(), y.to_numpy(), coef_only=True, penalty="l2")

array([-3.13904399,  1.14843327])

## The program effectiveness data

In [10]:
spector_data = sm.datasets.spector.load()

In [11]:
spector_data

<class 'statsmodels.datasets.utils.Dataset'>

In [12]:
spector_data.endog

0     0.0
1     0.0
2     0.0
3     0.0
4     1.0
5     0.0
6     0.0
7     0.0
8     0.0
9     1.0
10    0.0
11    0.0
12    0.0
13    1.0
14    0.0
15    0.0
16    0.0
17    0.0
18    0.0
19    1.0
20    0.0
21    1.0
22    0.0
23    0.0
24    1.0
25    1.0
26    1.0
27    0.0
28    1.0
29    1.0
30    0.0
31    1.0
Name: GRADE, dtype: float64

In [13]:
spector_data.exog = sm.add_constant(spector_data.exog, prepend=False)

In [14]:
spector_data.exog

Unnamed: 0,GPA,TUCE,PSI,const
0,2.66,20.0,0.0,1.0
1,2.89,22.0,0.0,1.0
2,3.28,24.0,0.0,1.0
3,2.92,12.0,0.0,1.0
4,4.0,21.0,0.0,1.0
5,2.86,17.0,0.0,1.0
6,2.76,17.0,0.0,1.0
7,2.87,21.0,0.0,1.0
8,3.03,25.0,0.0,1.0
9,3.92,29.0,0.0,1.0


In [15]:
spector_data

<class 'statsmodels.datasets.utils.Dataset'>

In [16]:
type(spector_data)

statsmodels.datasets.utils.Dataset

In [17]:
print(spector_data.exog.head())

    GPA  TUCE  PSI  const
0  2.66  20.0  0.0    1.0
1  2.89  22.0  0.0    1.0
2  3.28  24.0  0.0    1.0
3  2.92  12.0  0.0    1.0
4  4.00  21.0  0.0    1.0


In [18]:
print(spector_data.endog.head())

0    0.0
1    0.0
2    0.0
3    0.0
4    1.0
Name: GRADE, dtype: float64


In [19]:
X = pd.DataFrame(spector_data.exog)

In [20]:
y = pd.DataFrame(spector_data.endog)
y.sample(10)

Unnamed: 0,GRADE
25,1.0
9,1.0
5,0.0
28,1.0
2,0.0
16,0.0
24,1.0
17,0.0
22,0.0
0,0.0


## Using logistic regression as a linear regression (example using Statsmodels)

In [21]:
logit_mod = sm.Logit(spector_data.endog, spector_data.exog)

In [22]:
logit_mod = sm.Logit(y, X)

In [23]:
logit_res = logit_mod.fit(disp=0)
print("Parameters:\n",logit_res.params)

Parameters:
 GPA       2.826113
TUCE      0.095158
PSI       2.378688
const   -13.021347
dtype: float64


In [24]:
logit_res.summary()

0,1,2,3
Dep. Variable:,GRADE,No. Observations:,32.0
Model:,Logit,Df Residuals:,28.0
Method:,MLE,Df Model:,3.0
Date:,"Tue, 20 May 2025",Pseudo R-squ.:,0.374
Time:,16:29:24,Log-Likelihood:,-12.89
converged:,True,LL-Null:,-20.592
Covariance Type:,nonrobust,LLR p-value:,0.001502

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
GPA,2.8261,1.263,2.238,0.025,0.351,5.301
TUCE,0.0952,0.142,0.672,0.501,-0.182,0.373
PSI,2.3787,1.065,2.234,0.025,0.292,4.465
const,-13.0213,4.931,-2.641,0.008,-22.687,-3.356


In [25]:
A = np.identity(len(logit_res.params))
A = A[1:,:]

In [26]:
logit_res.f_test(A)

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=2.8850001268853, p=0.05331525256957256, df_denom=28, df_num=3>

In [27]:
margeff = logit_res.get_margeff()
print(margeff.summary())

        Logit Marginal Effects       
Dep. Variable:                  GRADE
Method:                          dydx
At:                           overall
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
GPA            0.3626      0.109      3.313      0.001       0.148       0.577
TUCE           0.0122      0.018      0.686      0.493      -0.023       0.047
PSI            0.3052      0.092      3.304      0.001       0.124       0.486


In [28]:
print(logit_res.summary())

                           Logit Regression Results                           
Dep. Variable:                  GRADE   No. Observations:                   32
Model:                          Logit   Df Residuals:                       28
Method:                           MLE   Df Model:                            3
Date:                Tue, 20 May 2025   Pseudo R-squ.:                  0.3740
Time:                        16:29:24   Log-Likelihood:                -12.890
converged:                       True   LL-Null:                       -20.592
Covariance Type:            nonrobust   LLR p-value:                  0.001502
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
GPA            2.8261      1.263      2.238      0.025       0.351       5.301
TUCE           0.0952      0.142      0.672      0.501      -0.182       0.373
PSI            2.3787      1.065      2.234      0.0

In [29]:
print(logit_res.summary2())

                         Results: Logit
Model:              Logit            Method:           MLE      
Dependent Variable: GRADE            Pseudo R-squared: 0.374    
Date:               2025-05-20 16:29 AIC:              33.7793  
No. Observations:   32               BIC:              39.6422  
Df Model:           3                Log-Likelihood:   -12.890  
Df Residuals:       28               LL-Null:          -20.592  
Converged:          1.0000           LLR p-value:      0.0015019
No. Iterations:     7.0000           Scale:            1.0000   
-----------------------------------------------------------------
            Coef.    Std.Err.     z     P>|z|    [0.025    0.975]
-----------------------------------------------------------------
GPA          2.8261    1.2629   2.2377  0.0252    0.3508   5.3014
TUCE         0.0952    0.1416   0.6722  0.5014   -0.1823   0.3726
PSI          2.3787    1.0646   2.2344  0.0255    0.2922   4.4652
const      -13.0213    4.9313  -2.6405  0.00

## Iris from scikit-learn

In [30]:
dataset = load_iris()

In [31]:
dataset.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [32]:
dataset.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [33]:
dataset.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [34]:
X, y = load_iris(return_X_y=True)

### Always holdout data for testing in real life.
### in these lecture notes I performed no split of the datasets only to focus on other issues.   

In [35]:
LogisticRegression?

In [36]:
clf = LogisticRegression(max_iter=400).fit(X, y)

In [37]:
clf.predict(X[:, :])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [38]:
np.set_printoptions(suppress=True)

In [39]:
clf.predict_proba(X)

array([[0.98155631, 0.01844368, 0.00000001],
       [0.97132739, 0.02867258, 0.00000003],
       [0.98526968, 0.01473031, 0.00000001],
       [0.97607553, 0.02392443, 0.00000004],
       [0.98521363, 0.01478635, 0.00000001],
       [0.97014304, 0.02985689, 0.00000007],
       [0.98676691, 0.01323307, 0.00000002],
       [0.97612741, 0.02387256, 0.00000003],
       [0.97964714, 0.02035283, 0.00000003],
       [0.96876852, 0.03123145, 0.00000003],
       [0.97617918, 0.0238208 , 0.00000002],
       [0.9752059 , 0.02479405, 0.00000004],
       [0.97423594, 0.02576404, 0.00000002],
       [0.99187865, 0.00812135, 0.        ],
       [0.9879641 , 0.0120359 , 0.        ],
       [0.98659131, 0.01340868, 0.00000001],
       [0.98792278, 0.01207721, 0.00000001],
       [0.98130199, 0.01869799, 0.00000002],
       [0.95598656, 0.04401337, 0.00000007],
       [0.98394996, 0.01605002, 0.00000002],
       [0.94610128, 0.05389863, 0.00000009],
       [0.98153374, 0.01846623, 0.00000003],
       [0.

In [40]:
clf.score(X, y)

0.9733333333333334

In [41]:
clf.n_features_in_?

In [42]:
clf.score?

In [43]:
# link to liblinear libary, go to datasets, phising dataset
# LogisticRegression?

clf.classes_

array([0, 1, 2])

In [44]:
clf.coef_

array([[-0.42456599,  0.96664261, -2.51554625, -1.08216927],
       [ 0.53541119, -0.32073935, -0.20740629, -0.94263206],
       [-0.1108452 , -0.64590325,  2.72295254,  2.02480133]])

In [45]:
clf.intercept_

array([  9.85494228,   2.23117432, -12.0861166 ])

In [46]:
len(clf.predict_proba(X))

150

In [47]:
clf.predict_proba(X)

array([[0.98155631, 0.01844368, 0.00000001],
       [0.97132739, 0.02867258, 0.00000003],
       [0.98526968, 0.01473031, 0.00000001],
       [0.97607553, 0.02392443, 0.00000004],
       [0.98521363, 0.01478635, 0.00000001],
       [0.97014304, 0.02985689, 0.00000007],
       [0.98676691, 0.01323307, 0.00000002],
       [0.97612741, 0.02387256, 0.00000003],
       [0.97964714, 0.02035283, 0.00000003],
       [0.96876852, 0.03123145, 0.00000003],
       [0.97617918, 0.0238208 , 0.00000002],
       [0.9752059 , 0.02479405, 0.00000004],
       [0.97423594, 0.02576404, 0.00000002],
       [0.99187865, 0.00812135, 0.        ],
       [0.9879641 , 0.0120359 , 0.        ],
       [0.98659131, 0.01340868, 0.00000001],
       [0.98792278, 0.01207721, 0.00000001],
       [0.98130199, 0.01869799, 0.00000002],
       [0.95598656, 0.04401337, 0.00000007],
       [0.98394996, 0.01605002, 0.00000002],
       [0.94610128, 0.05389863, 0.00000009],
       [0.98153374, 0.01846623, 0.00000003],
       [0.

In [48]:
clf.score(X, y)

0.9733333333333334

In [49]:
y_pred_default = clf.predict(X)
y_pred_default

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [50]:
clf.score(X, y_pred_default)

1.0

In [51]:
y_pred_tuned = (clf.predict_proba(X)[:,1] >= 0.3).astype(int)
y_pred_tuned

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [52]:
clf.score(X, y_pred_tuned)

0.6533333333333333

In [53]:
# Predict confidence scores for samples.
# where > 0 means this class would be predicted.
clf.decision_function(X)

array([[  7.33470626,   3.36028842, -10.69499468],
       [  6.93629815,   3.41357586, -10.34987401],
       [  7.4660945 ,   3.26308638, -10.72918088],
       [  6.90877759,   3.20013794, -10.10891552],
       [  7.47382712,   3.27467337, -10.74850048],
       [  6.62289578,   3.14186774,  -9.76476351],
       [  7.34210807,   3.03039356, -10.37250162],
       [  7.02894397,   3.31808061, -10.34702458],
       [  7.05191689,   3.1779442 , -10.22986109],
       [  6.88962472,   3.4550245 , -10.34464922],
       [  7.14911036,   3.43602328, -10.58513364],
       [  6.86230254,   3.19025774, -10.05256029],
       [  7.08697168,   3.45429795, -10.54126963],
       [  8.05391855,   3.24881424, -11.30273279],
       [  8.02394062,   3.61618783, -11.64012845],
       [  7.48195653,   3.18360267, -10.66555921],
       [  7.62911428,   3.22483025, -10.85394453],
       [  7.22648933,   3.26602521, -10.49251455],
       [  6.50707865,   3.42882823,  -9.93590688],
       [  7.26492749,   3.14906

In [54]:
# clf?

In [55]:
clf.get_params([X])

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 400,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [56]:
clf.n_iter_

array([109], dtype=int32)

In [57]:
# clf.predict_log_proba(X)

In [58]:
clf.predict_proba?

In [59]:
# len(clf.predict_proba(X))
clf.predict_proba(X)

array([[0.98155631, 0.01844368, 0.00000001],
       [0.97132739, 0.02867258, 0.00000003],
       [0.98526968, 0.01473031, 0.00000001],
       [0.97607553, 0.02392443, 0.00000004],
       [0.98521363, 0.01478635, 0.00000001],
       [0.97014304, 0.02985689, 0.00000007],
       [0.98676691, 0.01323307, 0.00000002],
       [0.97612741, 0.02387256, 0.00000003],
       [0.97964714, 0.02035283, 0.00000003],
       [0.96876852, 0.03123145, 0.00000003],
       [0.97617918, 0.0238208 , 0.00000002],
       [0.9752059 , 0.02479405, 0.00000004],
       [0.97423594, 0.02576404, 0.00000002],
       [0.99187865, 0.00812135, 0.        ],
       [0.9879641 , 0.0120359 , 0.        ],
       [0.98659131, 0.01340868, 0.00000001],
       [0.98792278, 0.01207721, 0.00000001],
       [0.98130199, 0.01869799, 0.00000002],
       [0.95598656, 0.04401337, 0.00000007],
       [0.98394996, 0.01605002, 0.00000002],
       [0.94610128, 0.05389863, 0.00000009],
       [0.98153374, 0.01846623, 0.00000003],
       [0.

In [60]:
clf.predict_proba(X)[:,1]

array([0.01844368, 0.02867258, 0.01473031, 0.02392443, 0.01478635,
       0.02985689, 0.01323307, 0.02387256, 0.02035283, 0.03123145,
       0.0238208 , 0.02479405, 0.02576404, 0.00812135, 0.0120359 ,
       0.01340868, 0.01207721, 0.01869799, 0.04401337, 0.01605002,
       0.05389863, 0.01846623, 0.00404427, 0.0482164 , 0.04835589,
       0.04902783, 0.03070504, 0.02539188, 0.02298451, 0.02901308,
       0.03606051, 0.03560671, 0.01172788, 0.01107866, 0.0316564 ,
       0.01558361, 0.02143125, 0.01326719, 0.01429391, 0.02621482,
       0.01355602, 0.03833119, 0.01108516, 0.027842  , 0.03997691,
       0.02647387, 0.01985967, 0.01682052, 0.02168768, 0.02160605,
       0.87407819, 0.85981231, 0.72525793, 0.93964232, 0.81527699,
       0.86006552, 0.71662283, 0.8489463 , 0.89669226, 0.91183771,
       0.93760263, 0.89884805, 0.97656642, 0.77926874, 0.91513279,
       0.92641289, 0.77461592, 0.96516591, 0.80099624, 0.95941064,
       0.44016173, 0.95680135, 0.59626649, 0.85988459, 0.94298

In [61]:
y_pred_tuned = (clf.predict_proba(X)[:,1] >= 0.4).astype(int)
y_pred_tuned

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [62]:
y_pred_tuned = (clf.predict_proba(X)[:,1] >= 0.4).astype(bool)
y_pred_tuned

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
        True, False,

In [63]:
preds = clf.predict_proba(X)

In [64]:
preds

array([[0.98155631, 0.01844368, 0.00000001],
       [0.97132739, 0.02867258, 0.00000003],
       [0.98526968, 0.01473031, 0.00000001],
       [0.97607553, 0.02392443, 0.00000004],
       [0.98521363, 0.01478635, 0.00000001],
       [0.97014304, 0.02985689, 0.00000007],
       [0.98676691, 0.01323307, 0.00000002],
       [0.97612741, 0.02387256, 0.00000003],
       [0.97964714, 0.02035283, 0.00000003],
       [0.96876852, 0.03123145, 0.00000003],
       [0.97617918, 0.0238208 , 0.00000002],
       [0.9752059 , 0.02479405, 0.00000004],
       [0.97423594, 0.02576404, 0.00000002],
       [0.99187865, 0.00812135, 0.        ],
       [0.9879641 , 0.0120359 , 0.        ],
       [0.98659131, 0.01340868, 0.00000001],
       [0.98792278, 0.01207721, 0.00000001],
       [0.98130199, 0.01869799, 0.00000002],
       [0.95598656, 0.04401337, 0.00000007],
       [0.98394996, 0.01605002, 0.00000002],
       [0.94610128, 0.05389863, 0.00000009],
       [0.98153374, 0.01846623, 0.00000003],
       [0.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearch#sklearn.model_selection.GridSearchCV

In [77]:

pipeline_instance = Pipeline([('classifier' , LogisticRegression())])

# Create param grid, 40 combinations of parameters
param_grid = [
    {'classifier' : [LogisticRegression(max_iter=500)],
     'classifier__penalty' : ['l1', 'l2'],
    'classifier__C' : np.logspace(-4, 4, 20),
    'classifier__solver' : ['liblinear']},
    
]

# Create grid search object
clf = GridSearchCV(pipeline_instance, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)

# Fit on data

grid_clf = clf.fit(X, y)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


In [74]:
grid_clf.score(X, y)

0.98

mhttps://scikit-learn.org/stable/auto_examples/compose/plot_digits_pipe.html#sphx-glr-auto-examples-compose-plot-digits-pipe-py

In [76]:
# The values 9.737937, -2.357894, -38.465492 are the decision scores for a single sample across three classes (e.g., class 0, class 1, and class 2).
# The class with the highest score is the predicted class for that sample.
grid_clf.decision_function(X)

array([[  9.73793724,  -2.35789446, -38.46549239],
       [  7.80928462,  -0.93392793, -35.51128145],
       [  8.84085257,  -1.57248307, -36.47427166],
       [  7.31275309,  -1.01546741, -34.38619193],
       [  9.98318112,  -2.61236528, -38.61948357],
       [  9.28189433,  -3.69000404, -37.12945639],
       [  8.63250221,  -2.24919731, -35.35366651],
       [  8.75091795,  -1.9304923 , -37.0324205 ],
       [  6.96195511,  -0.5495036 , -33.48507168],
       [  7.87190143,  -0.80471698, -36.40791733],
       [ 10.18908282,  -2.84551719, -39.67864907],
       [  8.00914255,  -1.75756096, -35.7533398 ],
       [  7.96704254,  -0.63654182, -36.28486113],
       [  9.08854512,  -0.91708773, -36.6897025 ],
       [ 13.25048828,  -4.14938259, -44.28990116],
       [ 12.39305495,  -5.39485649, -41.70571944],
       [ 11.44621513,  -4.20845804, -39.74948776],
       [  9.47983171,  -2.63362165, -37.37987628],
       [  9.49545119,  -3.20312387, -38.68515418],
       [  9.97552596,  -3.33239