# Logistic Regreesion with scikit-learn

Logistic regression is another technique borrowed by machine learning from the field of statistics.

It is the go-to method for **binary classification problems** (problems with two class values). 

I'll be using the [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) to perform Logistic Regression.

Scikit-learn's [Logistic Regression documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)


<br>

#### Classification videos:
- Data Professor: [Machine Learning in Python: Building a Classification Model](https://www.youtube.com/watch?v=XmSlFPDjKdc&ab_channel=DataProfessor)
- Codebasics: [Machine Learning Tutorial Python - 8: Logistic Regression (Binary Classification)](https://www.youtube.com/watch?v=zM4VZR0px8E&ab_channel=codebasics)

Learn more about Logistic Regression on the [Machine Learning Mastery](https://machinelearningmastery.com/logistic-regression-for-machine-learning/) website.

In [1]:
# Scikit-learn libraries
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

## 1. Load dataset + Create X and Y data matrices

In [2]:
# Load iris dataset
iris = datasets.load_iris()

## 2. Input vs Output features

In [3]:
# Feature names
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [4]:
# Target names
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


In [5]:
# First 10 rows of iris data
'''
This is our X variable
'''
iris.data[:10, :]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [6]:
# Iris target values
'''
This is our y variable
'''
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

### 2.1 Assigning input and output variables

In [7]:
X = iris.data
y = iris.target

### 2.2 Data dimensions

In [8]:
print('X shape:', X.shape)
print('y shape:', y.shape)

X shape: (150, 4)
y shape: (150,)


## 3. Build Classification Model using Logistic Regression

In [9]:
# Importing logistic regression model and setting number of iterations to 1000
lr = LogisticRegression(max_iter=1000)

In [10]:
# Fit model
lr.fit(X,y)

LogisticRegression(max_iter=1000)

## 4. Coefficients and Intercept

In [11]:
print(lr.coef_)

[[-0.42434519  0.96692807 -2.51720846 -1.07938946]
 [ 0.53499003 -0.32132698 -0.20620328 -0.94424639]
 [-0.11064484 -0.64560109  2.72341174  2.02363584]]


In [12]:
print(lr.intercept_)

[  9.85512128   2.23277161 -12.08789289]


## 5. Make prediction

In [13]:
print(X[0])

[5.1 3.5 1.4 0.2]


In [14]:
print(lr.predict(X[[0]]))

[0]


In [15]:
print(lr.predict_proba([X[0]]))

[[9.81588489e-01 1.84114969e-02 1.45146963e-08]]


### 5.1. Rebuild model to see the names of the labels in the predictions

In [16]:
lr.fit(iris.data, iris.target_names[iris.target])

LogisticRegression(max_iter=1000)

In [17]:
print(lr.predict(X[[0]]))
print(lr.predict(X[[77]]))

['setosa']
['virginica']


## 6. Data split (80/20 ratio)

In [18]:
# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [19]:
# 80% goes into the training set
X_train.shape, y_train.shape

((120, 4), (120,))

In [20]:
# 20% goes into the test set
X_test.shape, y_test.shape

((30, 4), (30,))

## 7. Rebuilding the Logistic Regression model

In [21]:
lr.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

### 7.1. Perform prediction on single sample from the data set

In [22]:
print(lr.predict([X[0]]))

[0]


In [23]:
print(lr.predict_proba([X[0]]))

[[9.79095901e-01 2.09040102e-02 8.91877882e-08]]


### 7.2. Perform prediction on the test set
#### _Predicted class labels_

In [24]:
print(lr.predict(X_test))

[0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 2 0 2 1 0 0 1 2]


#### _Actual class labels_

In [25]:
print(y_test)

[0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2]


## 8. Model Performance

In [26]:
print(lr.score(X_test, y_test))

0.9666666666666667


## 9. Comparing with Random Forest Classifier

In [27]:
from sklearn.ensemble import RandomForestClassifier

# Import model
rf= RandomForestClassifier()

# Fit model
rf.fit(X_train, y_train)

# Make prediction
rf.predict(X_test)

# Model Performance
print(rf.score(X_test, y_test))

0.9666666666666667


# In Summary

In [29]:
# Scikit-learn libraries
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load iris dataset
iris = datasets.load_iris()

# Assigning input and output variables
X = iris.data
y = iris.target

# TrainTest split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Logistic Regression model
'''
1. Define the logistic regression model
2. Fit model
3. Apply model to make prediction (on test set)
'''
lr = LogisticRegression(max_iter=1000) # step 1
lr.fit(X_train, y_train) # step 2
lr.predict(X_test) # step 3

# Print results
print('Coefficients:', lr.coef_)
print('Intercept:', lr.intercept_)
print('Score', lr.score(X_test, y_test))

Coefficients: [[-0.4345712   0.8243118  -2.35072311 -0.96749421]
 [ 0.61906814 -0.42736808 -0.20574344 -0.83176201]
 [-0.18449694 -0.39694372  2.55646655  1.79925622]]
Intercept: [  9.50176119   1.63227362 -11.13403481]
Score 0.9666666666666667
