# Logistic Regression (aka logit, MaxEnt) classifier - Experiment

This is a component that trains a Logistic Regression (aka logit, MaxEnt) classifier model using [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). 
<br>
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

This notebook shows:
- how to use the [SDK](https://platiagro.github.io/sdk/) to load datasets, save models and other artifacts.
- how to declare parameters and use them to build reusable components.

## Declare parameters and model hyperparameters
Components may declare (and use) these default parameters:
- dataset
- target

Use these parameters to load/save datasets, models, metrics, and figures with the help of [PlatIAgro SDK](https://platiagro.github.io/sdk/). <br />
You may also declare custom parameters to set when running an experiment.

Select the hyperparameters and their respective values to be used when training the model:
- solver
- penalty
- C
- fit_intercept
- class_weight
- max_iter
- multi_class

These parameters are just a few offered by the model class, you may also use another existing parameter. <br />
Check the [model parameters](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) for more information.

In [None]:
# parameters
dataset = "iris" #@param {type:"string"}
target = "Species" #@param {type:"feature", label:"Atributo alvo", description: "Seu modelo será treinado para prever os valores do alvo."}

# selected features to perform the model
model_features = ["SepalLengthCm","SepalWidthCm","PetalLengthCm"] #@param {type:"feature",multiple:true,label:"Features selecionadas para o modelo",description:"Seu modelo será feito considerando apenas as features selecionadas. Caso não selecione nenhuma, todas as features serão utilizadas"}

# features to apply Ordinal Encoder
ordinal_features = "" #@param {type:"feature",multiple:true,label:"Features para fazer codificação ordinal", description: "Seu modelo utilizará a codificação ordinal para as features selecionadas. As demais features categóricas serão codificadas utilizando One-Hot-Encoding."}

# hyperparameters
penalty = "l2" #@param ["l1, "l2", "elasticnet", "None"] {type:"string", label:"Penalidade", description:"Norma utilizada na penalização do erro"}
C = 1.0 #@param {type:"number", label:"Regularização Inversa", description:"Retém a modificação de força da regularização ao ser posicionada inversamente no regulador Lambda"}
fit_intercept = True #@param {type"boolean", label:"Interceptação", description:"Especifica se uma constante (viés ou interceptação) deve ser adicionada à função de decisão"}
class_weight = None #@param ["balanced”, “balanced_subsample”] {type:"string", label:"Peso das Classes", description:"Especifica pesos de amostras quando for ajustar classificadores como uma função da classe do target"}
solver = "liblinear" #@param ["lbfgs", "sgd", "adam"] {type:"string", label:"Solucionador", description:"Algoritmo a ser usado no problema de otimização"}
max_iter = 100 #@param {type: "integer", label:"Iterações", description:"Número máximo de itereações feitas para os solvers convergirem"}
multi_class = "auto" #@param ["auto", "ovr", "multimomial"] {type:"string", label:"Multiclasse", description:"Classificação com mais de duas classes, porém cada amostra pode ser rotulada apenas como uma classe"}

# predict method
method = "predict_proba" #@param ["predict_proba", "predict"] {type:"string", label:"Método de Predição", description:"Se optar por 'predict_proba', o método de predição será a probabilidade estimada de cada classe, já o 'predict' prediz a qual classe pertence"} 

## Load dataset

Import and put the whole dataset in a pandas.DataFrame.

In [None]:
from platiagro import load_dataset

df = load_dataset(name=dataset)

## Load metadata about the dataset
For example, below we get the feature type for each column in the dataset. (eg. categorical, numerical, or datetime)

In [None]:
import numpy as np
from platiagro import stat_dataset

metadata = stat_dataset(name=dataset)
featuretypes = metadata["featuretypes"]

columns = df.columns.to_numpy()
featuretypes = np.array(featuretypes)
target_index = np.argwhere(columns == target)
columns = np.delete(columns, target_index)
featuretypes = np.delete(featuretypes, target_index)

## Encode target labels

The target labels are converted to ordinal integers with value between 0 and n_classes-1.

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

## Split dataset into train/test splits

Training Dataset: the sample of data used to fit the model.

Test Dataset: the sample of data used to provide an unbiased evaluation of a model fit on the training dataset.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,  train_size=0.7)

## Keep selected features to perform the model

Select only the features that should be used in the model

In [None]:
if len(model_features)>=1:
    columns_index = (np.where(np.isin(columns,model_features)))[0]
    columns_index.sort()
    columns = columns[columns_index]
    featuretypes = featuretypes[columns_index]

#keep the features selected in model_features parameter
df_model = df[columns]
X = df_model.to_numpy()
y = df[target].to_numpy()

## Features configuration

In [None]:
from platiagro.featuretypes import NUMERICAL

# Selects the indexes of numerical and non-numerical features
numerical_indexes = np.where(featuretypes == NUMERICAL)[0]
non_numerical_indexes = np.where(~(featuretypes == NUMERICAL))[0]

# Selects non-numerical features to apply ordinal encoder or one-hot encoder
ordinal_features = np.asarray(ordinal_features)
non_numerical_indexes_ordinal = np.where(~(featuretypes == NUMERICAL) & np.isin(columns,ordinal_features))[0]
non_numerical_indexes_one_hot = np.where(~(featuretypes == NUMERICAL) & ~(np.isin(columns,ordinal_features)))[0]

# After the step handle_missing_values, 
# numerical features are grouped in the beggining of the array
numerical_indexes_after_handle_missing_values = \
    np.arange(len(numerical_indexes))
non_numerical_indexes_after_handle_missing_values = \
    np.arange(len(numerical_indexes), len(featuretypes))
one_hot_indexes_after_handle_missing_values = non_numerical_indexes_after_handle_missing_values[np.where(np.isin(non_numerical_indexes,non_numerical_indexes_one_hot))[0]]         
ordinal_indexes_after_handle_missing_values = non_numerical_indexes_after_handle_missing_values[np.where(np.isin(non_numerical_indexes,non_numerical_indexes_ordinal))[0]]                                              

## Fit a model using sklearn.linear_model.LogisticRegression

In [None]:
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.one_hot import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[
    ('handle_missing_values',
     ColumnTransformer(
        [('imputer_mean', SimpleImputer(strategy='mean'), numerical_indexes),
         ('imputer_mode', SimpleImputer(strategy='most_frequent'), non_numerical_indexes)],
         remainder='drop')),
    ('handle_categorical_features',
     ColumnTransformer(
         [('feature_encoder_ordinal', OrdinalEncoder(), ordinal_indexes_after_handle_missing_values),
          ('feature_encoder_onehot', OneHotEncoder(), one_hot_indexes_after_handle_missing_values)],
         remainder='passthrough')),
    ('estimator', LogisticRegression(solver=solver,
                                     penalty=penalty,
                                     C=C,
                                     fit_intercept=fit_intercept,
                                     class_weight=class_weight,
                                     max_iter=max_iter,
                                     multi_class=multi_class))
])

pipeline.fit(X_train, y_train)   

## Measure the performance
The [**Confusion Matrix**](https://en.wikipedia.org/wiki/Confusion_matrix) is a performance measurement for machine learning classification.<br>
It is extremely useful for measuring [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification), [Recall, Precision, and F-measure](https://en.wikipedia.org/wiki/Precision_and_recall).

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
# uses the model to make predictions on the Test Dataset
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)

# computes confusion matrix
labels = np.unique(y)
data = confusion_matrix(y_test, y_pred, labels=labels)

# computes precision, recall, f1-score, support (for multiclass classification problem) and accuracy
if len(labels)>2:  #multiclass classification
    p, r, f1, s = precision_recall_fscore_support(y_test, y_pred,
                                                  labels=labels,
                                                  average=None)
    
    commom_metrics = pd.DataFrame(data=zip(p, r, f1, s),columns=['Precision','Recall','F1-Score','Support']) 
    
    average_options = ('micro', 'macro', 'weighted')
    for average in average_options:
        if average.startswith('micro'):
            line_heading = 'accuracy'
        else:
            line_heading = average + ' avg'

        # compute averages with specified averaging method
        avg_p, avg_r, avg_f1, _ = precision_recall_fscore_support(
            y_test, y_pred, labels=labels,
            average=average)
        avg = pd.Series({'Precision':avg_p,  'Recall':avg_r,  'F1-Score':avg_f1,  'Support':np.sum(s)},name=line_heading)
        commom_metrics = commom_metrics.append(avg)
else: #binary classification
    p, r, f1, _ = precision_recall_fscore_support(y_test, y_pred,
                                                  average='binary')
    accuracy=accuracy_score(y_test, y_pred)
    commom_metrics = pd.DataFrame(data={'Precision':p,'Recall':r,'F1-Score':f1,'Accuracy':accuracy},index=[1])

# puts matrix in pandas.DataFrame for better format
labels = label_encoder.inverse_transform(labels)
confusion_matrix = pd.DataFrame(data, columns=labels, index=labels)

# add correct index labels to commom_metrics DataFrame (for multiclass classification)
if len(labels)>2:
    as_list = commom_metrics.index.tolist()
    as_list[0:len(labels)] = labels
    commom_metrics.index = as_list


## Save metrics

Record the metrics used to evaluate the model.<br>
It's a good way to document the experiments, and also help to avoid running the same experiment twice. 

In [None]:
from platiagro import save_metrics

save_metrics(confusion_matrix=confusion_matrix,commom_metrics=commom_metrics)

## Save figure

Compute and plot Compute Receiver operating characteristic (ROC) curve to evaluate the model performance. It illustrates the performance of a binary classifier system as its discrimination threshold is varied. For multiclass classification task, it is used the one-vs-rest algorithm, that is, computes the AUC of each class against the rest.


In [None]:
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
from numpy import unique
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm

y_prob = pipeline.predict_proba(X_test)

def plot_roc_curve(y_test,y_prob,labels):
    n_classes = len(labels)
    
    if n_classes == 2:
        # Compute ROC curve 
        fpr, tpr, _ = roc_curve(y_test, y_prob[:, 1])
        roc_auc = auc(fpr, tpr)  
        
        # Plot ROC Curve
        plt.figure()
        lw = 2
        plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
        plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
        plt.xlim([-0.01, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('ROC Curve')
        plt.legend(loc="lower right")
        plt.show()
        
    else:  
        # Binarize the output
        lb = preprocessing.LabelBinarizer()
        y_test_bin = lb.fit_transform(y_test)

        # Compute ROC curve for each class
        fpr = dict()
        tpr = dict()
        roc_auc = dict()  

        for i in range(n_classes):
            fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_prob[:, i])
            roc_auc[i] = auc(fpr[i], tpr[i])
        
        color=cm.rainbow(np.linspace(0,1,n_classes+1))
        plt.figure()
        lw = 2
        plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
        plt.xlim([-0.01, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        
        for i,c in zip(range(n_classes),color):                   
            plt.plot(fpr[i], tpr[i], color=c,
             lw=lw, label='ROC curve - Class %s (area = %0.2f)' % (labels[i] ,roc_auc[i]))
            plt.title('ROC Curve One-vs-Rest')
            plt.legend(loc="lower right")

        plt.show()

from platiagro import save_figure
from platiagro import list_figures

plot_roc_curve(y_test,y_prob,labels)

save_figure(figure=plt.gcf())

## Save dataset

Add model result to the dataset and stores the transformed dataset in a object storage.


In [None]:
from platiagro import save_dataset
from re import sub

pipeline.fit(X, y)

new_columns = list()
if method == "predict_proba":
    y_pred = pipeline.predict_proba(X)
    for i,class_j in zip(range(len(labels)),labels):
        new_columns.append(sub('[^a-zA-Z0-9\n\.]', '_', str('Logistic_'+ method + '_' + str(class_j))))
        df[new_columns[i]] = y_pred[:,i]
else:
    y_pred = pipeline.predict(X)
    y_pred = label_encoder.inverse_transform(y_pred)
    new_columns.append('Logistic_'+ method )
    df[new_columns[0]] = y_pred

save_dataset(name=dataset, df=df)

## Save model

Stores the model artifacts in a object storage.<br>
It will make the model available for future deployments.

In [None]:
from platiagro import save_model

save_model(columns=columns,
           label_encoder=label_encoder,
           pipeline=pipeline,
           method=method,
           new_columns=new_columns)