# Logistic regression applied to Iris

**Reading and partitioning the dataset:**

In [6]:
import numpy as np; from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris(); X = iris.data.astype(np.float16); y = iris.target.astype(np.uint)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=23)

**LogisticRegression:** $\;$ implementation of logistic regression in sklearn

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
clf = LogisticRegression(random_state=23).fit(X_train, y_train)
y_test_pred = clf.predict(X_test)
err_test = 1 - accuracy_score(y_test, clf.predict(X_test))
print(f"Test error: {err_test:.1%}")

Test error: 0.0%


**Warnings:** $\;$ sklearn is a bit "picky" with warnings; we will ignore the warnings about convergence

In [8]:
import warnings; from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")

**Solvers:** $\;$ the `solver` parameter of LogisticRegression allows you to choose between different solvers (optimization algorithms)

In [10]:
for solver in ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']:
    clf = LogisticRegression(random_state=23, solver=solver, max_iter=10000).fit(X_train, y_train)
    err_test = 1 - accuracy_score(y_test, clf.predict(X_test))
    print(f"Test error after training with the solver {solver!s}: {err_test:.1%}")

Test error after training with the solver lbfgs: 0.0%
Test error after training with the solver liblinear: 3.3%
Test error after training with the solver newton-cg: 0.0%
Test error after training with the solver newton-cholesky: 3.3%
Test error after training with the solver sag: 0.0%
Test error after training with the solver saga: 0.0%


**Tolerance:** $\;$ the `tol` parameter sets a tolerance threshold to end training (1e4 by default)

In [12]:
for tol in (1e-4, 1e-2, 1, 1e2, 1e4):
    clf = LogisticRegression(tol=tol, random_state=23, max_iter=10000).fit(X_train, y_train)
    err_test = 1 - accuracy_score(y_test, clf.predict(X_test))
    print(f"Test error with tolerance {tol}: {err_test:.1%}")

Test error with tolerance 0.0001: 0.0%
Test error with tolerance 0.01: 0.0%
Test error with tolerance 1: 0.0%
Test error with tolerance 100.0: 60.0%
Test error with tolerance 10000.0: 60.0%


**Regularization:** $\;$ parameter `C` (positive, $1.0$ by default) de-regularizes the training criterion
* **Possibility of under-adjustment:** $\;$ with a value close to zero (maximum regularization)
* **Possibility of overfitting:** $\;$ with a very high positive value (minimum regularization)

In [13]:
for C in (1e-2, 1e-1, 1, 1e1, 1e2):
    clf = LogisticRegression(C=C, random_state=23, max_iter=10000).fit(X_train, y_train)
    err_test = 1 - accuracy_score(y_test, clf.predict(X_test))
    print(f"Test error with C {C:g}: {err_test:.1%}")

Test error with C 0.01: 6.7%
Test error with C 0.1: 3.3%
Test error with C 1: 0.0%
Test error with C 10: 3.3%
Test error with C 100: 3.3%


**Early stopping:** $\;$ saving computation and avoiding over-training ("regularize") by finishing earlier (in a few iterations)

In [14]:
for max_iter in (10, 20, 50, 100):
    clf = LogisticRegression(random_state=23, max_iter=max_iter).fit(X_train, y_train)
    err_test = 1 - accuracy_score(y_test, clf.predict(X_test))
    print(f"Test error with max_iter {max_iter}: {err_test:.1%}")

Test error with max_iter 10: 0.0%
Test error with max_iter 20: 3.3%
Test error with max_iter 50: 0.0%
Test error with max_iter 100: 0.0%
