<a href="https://colab.research.google.com/github/julija-dmrk/data-mining/blob/main/12_07.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Logistic Regression**

Logistic regression and its extensions, like multinomial logistic
regression, allow us to predict the probability that an observation is of a certain class
using a straightforward and well-understood approach.

1. Training a Binary Classifier

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
# Load data with only two classes
iris = datasets.load_iris()
features = iris.data[:100,:]
target = iris.target[:100]

# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
# Create logistic regression object
logistic_regression = LogisticRegression(random_state=0)
# Train model
model = logistic_regression.fit(features_standardized, target)

#a logistic regression is actually a widely used binary classifier (i.e., the target vector can only take two values)
#linear model is included in logistic(sigmoid) function


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [11]:
# Create new observation
new_observation = [[.5, .5, .5, .5]]
# Predict class
model.predict(new_observation)

# View predicted probabilities
model.predict_proba(new_observation)

array([[0.17738424, 0.82261576]])

2.Training a multiclass classifier

In [12]:
#If there are >2 classes, we neeed to train a classifier model
# Create one-vs-rest logistic regression object
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)

# Create one-vs-rest logistic regression object
logistic_regression = LogisticRegression(random_state=0, multi_class="ovr")
# Train model
model = logistic_regression.fit(features_standardized, target)

#logistic regressions are only binary classifiers, meaning they cannot handle
#target vectors with more than two classes. However, two clever extensions to logistic
#regression do just that. First, in one-vs-rest logistic regression (OVR) a separate model is
#trained for each class predicted whether an observation is that class or not (thus making it
#a binary classification problem)

3.Reducing Variance Through Regularization

In [20]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)

# Create decision tree regression object
logistic_regression = LogisticRegressionCV(
penalty='l2', Cs=10, random_state=0, n_jobs=-1)
# Train model
model = logistic_regression.fit(features_standardized, target)

#Regularization is a method of penalizing complex models to reduce their variance.
#Specifically, a penalty term is added to the loss function we are trying to minimize,
#typically the L1 and L2 penalties

#To reduce variance while using logistic
#regression, we can treat C as a hyperparameter to be tuned to find the value of C that
#creates the best model. In scikit-learn we can use the LogisticRegressionCV class to
#efficiently tune C

#Unfortunately, LogisticRegressionCV does not allow us to search over different
#penalty terms

LogisticRegression(C=1.0, class_weight='balance', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

4. Training a Classifier on Very Large Data

In [14]:
#using the stochastic average gradient (SAG) solver (much faster model training),
#but sensitive to feature scaling

# Load libraries
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
# Create logistic regression object
logistic_regression = LogisticRegression(random_state=0, solver="sag")
# Train model
model = logistic_regression.fit(features_standardized, target)

# Most of the time scikit-learn will select the best solver (techniques for training logistic regression)
# automatically for us or warn us that we cannot do something with that solver

# stochastic average gradient descent allows us to
#train a model much faster than other solvers when our data is very large. However, it is
#also very sensitive to feature scaling, so standardizing our features is particularly
#important. We can set our learning algorithm to use this solver by setting
#solver='sag'.



5. Handling Imbalanced Classes

In [19]:
#If we have highly imbalanced classes and have not addressed it during preprocessing, 
#we have the option of using the class_weight parameter to weight the classes to make
#certain we have a balanced mix of each class. 
#Specifically, the balanced argument will automatically weigh classes
#inversely proportional to their frequency


# Load libraries
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Make class highly imbalanced by removing first 40 observations
features = features[40:,:]
target = target[40:]
# Create target vector indicating if class 0, otherwise 1
target = np.where((target == 0), 0, 1)
# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
# Create decision tree regression object
logistic_regression = LogisticRegression(random_state=0, class_weight="balance")
# Train model
model = logistic_regression.fit(features_standardized, target)

LogisticRegression(C=1.0, class_weight='balance', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)