# Mock-up exam

In this mock-up exam, we will be using one of the classification tasks found in OpenML. More precisely, the task [*diabetes*](https://www.openml.org/search?type=data&id=37) (data_id=37) is selected. This task consists in the classification of patients into diabetes tested negative (class label = 0) or positive (class label = 1) according to the following eight features:
<ol>
<li>Number of times pregnant</li>
<li>Plasma glucose concentration a 2 hours in an oral glucose tolerance test</li>
<li>Diastolic blood pressure (mm Hg)</li>
<li>Triceps skin fold thickness (mm)</li>
<li>2-Hour serum insulin (mu U/ml)</li>
<li>Body mass index (weight in kg/(height in m)^2)</li>
<li>Diabetes pedigree function</li>
<li>Age (years)</li>
</ol>

Below you can find a baseline result achieved with default parameters.

In [6]:
import warnings; warnings.filterwarnings("ignore"); import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data_id = 37
test_size = 0.1
X, y = fetch_openml(data_id=data_id, return_X_y=True, as_frame=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=True, random_state=23)
clf = LogisticRegression(random_state=23).fit(X_train, y_train)
print(f'Test error: {(1 - accuracy_score(y_test, clf.predict(X_test)))*100:5.0f}%')

Test error:    19%


### Exercise 
Applying the logistic regression classifier, assess the effect of its parameters studied in the lab sessions on the classification error rate provided a train-test partition devoting 90%-10%, respectively. Use *random_state=23* to define this partition. In this process, try to improve the baseline result provided above on the test set. According to the results you have obtained, could you claim that this task is linearly separable? why? 

### Solution

**Experiments:** $\;$ exploring initial parameter values for solver, tolerance, C and maximum number of iterations. 

In [7]:
print('   solver     tol       C max_iter   etr   ete')
print('--------- ------- ------- -------- ----- -----')
for solver in ['lbfgs', 'newton-cg', 'sag', 'saga']:
    for tol in (1e-4, 1e-2, 1, 1e2, 1e4):
        for C in (1e-2, 1e-1, 1, 1e1, 1e2):
            for max_iter in (50, 100, 200, 500):
                clf = LogisticRegression(solver=solver, tol=tol, C=C, random_state=23, max_iter=max_iter).fit(X_train, y_train)
                etr = 1 - accuracy_score(y_train, clf.predict(X_train))
                ete = 1 - accuracy_score(y_test, clf.predict(X_test))
                print(f'{solver:>9} {tol:.1e} {C:.1e} {max_iter:8d} {etr:5.0%} {ete:5.0%}')

   solver     tol       C max_iter   etr   ete
--------- ------- ------- -------- ----- -----
    lbfgs 1.0e-04 1.0e-02       50   31%   30%
    lbfgs 1.0e-04 1.0e-02      100   24%   23%
    lbfgs 1.0e-04 1.0e-02      200   23%   18%
    lbfgs 1.0e-04 1.0e-02      500   23%   18%
    lbfgs 1.0e-04 1.0e-01       50   29%   25%
    lbfgs 1.0e-04 1.0e-01      100   23%   19%
    lbfgs 1.0e-04 1.0e-01      200   23%   19%
    lbfgs 1.0e-04 1.0e-01      500   23%   19%
    lbfgs 1.0e-04 1.0e+00       50   30%   23%
    lbfgs 1.0e-04 1.0e+00      100   22%   19%
    lbfgs 1.0e-04 1.0e+00      200   22%   18%
    lbfgs 1.0e-04 1.0e+00      500   22%   18%
    lbfgs 1.0e-04 1.0e+01       50   30%   32%
    lbfgs 1.0e-04 1.0e+01      100   23%   18%
    lbfgs 1.0e-04 1.0e+01      200   22%   18%
    lbfgs 1.0e-04 1.0e+01      500   22%   18%
    lbfgs 1.0e-04 1.0e+02       50   30%   35%
    lbfgs 1.0e-04 1.0e+02      100   22%   19%
    lbfgs 1.0e-04 1.0e+02      200   22%   18%
    lbfgs 1.0

Additional experiments exploring parametes values on the frontiers:

In [8]:
print('   solver     tol       C max_iter   etr   ete')
print('--------- ------- ------- -------- ----- -----')
for solver in ['lbfgs']:
    for tol in (1e-4, 1e-2, 1):
        for C in (1e-4, 1e-2, 1e-1, 1, 1e1, 1e2, 1e4):
            for max_iter in (50, 100, 200, 500, 1000):
                clf = LogisticRegression(solver=solver, tol=tol, C=C, random_state=23, max_iter=max_iter).fit(X_train, y_train)
                etr = 1 - accuracy_score(y_train, clf.predict(X_train))
                ete = 1 - accuracy_score(y_test, clf.predict(X_test))
                print(f'{solver:>9} {tol:.1e} {C:.1e} {max_iter:8d} {etr:5.0%} {ete:5.0%}')

   solver     tol       C max_iter   etr   ete
--------- ------- ------- -------- ----- -----
    lbfgs 1.0e-04 1.0e-04       50   24%   21%
    lbfgs 1.0e-04 1.0e-04      100   24%   22%
    lbfgs 1.0e-04 1.0e-04      200   24%   22%
    lbfgs 1.0e-04 1.0e-04      500   24%   22%
    lbfgs 1.0e-04 1.0e-04     1000   24%   22%
    lbfgs 1.0e-04 1.0e-02       50   31%   30%
    lbfgs 1.0e-04 1.0e-02      100   24%   23%
    lbfgs 1.0e-04 1.0e-02      200   23%   18%
    lbfgs 1.0e-04 1.0e-02      500   23%   18%
    lbfgs 1.0e-04 1.0e-02     1000   23%   18%
    lbfgs 1.0e-04 1.0e-01       50   29%   25%
    lbfgs 1.0e-04 1.0e-01      100   23%   19%
    lbfgs 1.0e-04 1.0e-01      200   23%   19%
    lbfgs 1.0e-04 1.0e-01      500   23%   19%
    lbfgs 1.0e-04 1.0e-01     1000   23%   19%
    lbfgs 1.0e-04 1.0e+00       50   30%   23%
    lbfgs 1.0e-04 1.0e+00      100   22%   19%
    lbfgs 1.0e-04 1.0e+00      200   22%   18%
    lbfgs 1.0e-04 1.0e+00      500   22%   18%
    lbfgs 1.0

The best error rate on the test set 18% improves the proposed baseline result 19% and it was achieved with a variety of combinations with appropriate values of tolerance and C when enough iterations are executed. However, the error rate of 0% in the training set was never achieved, indeed it was not even close, so I suspect that this task is not linearly separable. 