## Building Logistic Regression models

The simple model built in the proof of concept showed acceptable performances, however an attempt to improve its ability
to generalize should be made involving techniques to scale the data and reduce its dimensionality.

### Data scaling

A step that could help improving the performance of the model might be scaling the data, allowing the algorithm to
converge faster.

This step is usually important when features vary a lot in magnitudes, units and range. This is not the case as each of
them represents the same measure so it could end up being ineffective or even negative.

The options available to scale the data are using a MinMax scaler to reduce the values in a given range (the default one
is between 0 and 1) and Standard scaler to adjust the data to have a normal distribution.

### Dimensionality reduction

Another important step could be reducing the number of features onto which the model should be fitted. This step could
help avoiding overfitting as the number of features in the dataset is very large in opposition to the fact that
Logistic Regression is a fairly simple model.

Dimensionality reduction could be achieved with a variety of techniques, PCA above all. An alternative method would be
to train a Random Forest Regressor over the train data, cross validate it and have a look at features importance to
choose them for the Logistic Regressor.

### Hyperparameters tuning

The pipeline to apply and the hyperparameters for each step can be chosen applying a grid search approach over the 9
possible combinations of scalers/selectors. Logistic regression itself has a variety of hyperparameters to be tuned.

The research can be done with the help of GridSearchCV, cross-validating each model trained with every combination of 
hyperparameters.

```python
from numpy import logspace

from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

steps = []

scalers = [None, MinMaxScaler(), StandardScaler()]
selectors = [None, PCA(), RandomForestRegressor()]

for scaler in scalers:
    for selector in selectors:
        # Logistic regression hyperparameters
        log_reg_param_grid = [
                {
                    "classifier__solver": ["liblinear"],
                    "classifier__penalty": ["l1", "l2"],
                    "classifier__C": logspace(-1, 3, 5, endpoint=True),
                    "classifier__random_state": [RANDOM_STATE],
                    "classifier__max_iter": [1000],
                    "classifier__n_jobs": [N_JOBS]
                },
                {
                    "classifier__solver": ["lbfgs"],
                    "classifier__penalty": ["none", "l2"],
                    "classifier__C": logspace(-1, 3, 5, endpoint=True),
                    "classifier__random_state": [RANDOM_STATE],
                    "classifier__max_iter": [1000],
                    "classifier__n_jobs": [N_JOBS]
                }
            ]
        
        if scaler is not None:
            steps.append(("scaler", scaler))

        if type(selector) is PCA:
            
            # PCA hyperparameters
            for param_grid in log_reg_param_grid:
                param_grid["selector__n_components"] = [0.95]

        elif type(selector) is RandomForestRegressor:
    
            # Random Forest parameters
            for param_grid in log_reg_param_grid:
                param_grid["selector__estimator__n_estimators"] = [10, 100, 1000]
                param_grid["selector__estimator__random_state"] = [RANDOM_STATE]
                param_grid["selector__estimator__n_jobs"] = [N_JOBS]
                param_grid["selector__threshold"] = [0.01, 0.025, 0.05]

            steps.append(("selector", SelectFromModel(selector)))

        steps.append(("classifier", LogisticRegression()))
        pipeline = Pipeline(steps=steps)
        classifier = GridSearchCV(pipeline, param_grid=log_reg_param_grid)

```

This configuration yields over 3 thousands of combinations to be explored and the whole logic such as model fitting and
dumping has been all gathered in a single function.

In [None]:
from src.modeling import model_builder

model_builder.build_log_reg_models()

Models are stored in [models](/models) directory as .joblib files that can be used to generate predictions over the test set to
actually estimate their ability to generalize.