# Regresión logística

You already know that there is a regression that, instead of predicting continuous numerical values, helps us make classifications. This regression is logistic regression, and in scikit-learn, the implementation of this algorithm is found in the `LogisticRegression` class.

Logistic regression is a supervised learning model commonly used for classification problems.

Given a set of features, logistic regression estimates the probability that an instance belongs to a particular class. This probability is transformed into a class label using a decision threshold.

In [None]:
# Crea un dataset
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=1000, random_state=42, noise=0.40)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)


Then, an object of the `LogisticRegression` class is instantiated:

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()


And it is adjusted to the training data using the `fit` method.

In [None]:
lr.fit(X_train, y_train)


Once trained, the model can be used to make predictions on the test data using the `predict` method.

In [None]:
y_pred = lr.predict(X_test)


The truth is that there isn't much science to it.

## Predict proba

In classification problems, scikit-learn classifiers have a method called `predict_proba` that you can use to obtain an estimate of how likely an instance is to belong to one class or another.

For example, you can call the predict proba method on our model and our input data:

In [None]:
probabilities = lr.predict_proba(X_test)
probabilities


In this case, as we are talking about a binary classification problem, `probabilities` is a matrix with two columns, where the first column represents the probability that the sample belongs to the negative class and the second to the positive class.

Predicting probabilities instead of obtaining a hard classification is useful in some cases. To learn more, I invite you to check the resources in this book.

Remember also that all scikit-learn classifiers have this method, not just linear regression.

## Arguments

The `LogisticRegression` class in scikit-learn has a large number of parameters that allow customizing the model according to the specific needs of the problem. Here are some common ones that I recommend you experiment with when working:

 - `penalty`: specifies the regularization norm to use in the model. Common options are "L1", "L2", and "elasticnet". The default value is "L2". In general, my recommendation is to try not to use "L1" with logistic regression. 
 - `tol`: specifies the tolerance for detecting convergence of the optimization algorithm. As this is an iterative algorithm, it's important to establish a tolerance value, in case the algorithm reaches a point where the values don't change enough, to be able to stop the training.
 - `max_iter`: continuing on the topic of iterations, it's also possible to set a maximum number of these.
 - `C`: is a value that controls the strength with which regularization is applied. `C` has the peculiarity of being a value that inversely affects regularization; the smaller this value, the stronger the applied regularization.
 - `class_weight`: this argument is useful when you're dealing with a problem where there's an imbalance in the data.  
## Example of using `class_weight`:

In [None]:
from utils import classification_report_comparison

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Create an imbalanced classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5,
                           weights=[0.9], random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Fit Logistic Regression with class_weight='balanced'
balanced_lr = LogisticRegression(class_weight='balanced')
balanced_lr.fit(X_train, y_train)

vanilla_lr = LogisticRegression()
vanilla_lr.fit(X_train, y_train)


# Make predictions on the testing set
balanced_y_pred = balanced_lr.predict(X_test)
vanilla_y_pred = vanilla_lr.predict(X_test)

classification_report_comparison(y_test, {"Balanced": balanced_y_pred, "No balance": vanilla_y_pred})


```{hint} As an exercise, why don't you try playing a bit more with the parameters? Use the `classification_report_comparison` function

```