In [None]:
% matplotlib inline

# Meet the Data

We will use Iris Dataset that contains 4 features/ measurements (*petal length, petal width, sepal length, sepal width*) of **50 samples of 3 species of Iris flower (*Iris setosa, Iris Versicolor, Iris Virginica*) = 50×3 = 150 samples in total**. This is one of the best known datasets till date. For more details on the dataset, visit <a href="http://archive.ics.uci.edu/ml/datasets/Iris">UCI Machine Learning Repository - IRIS dataset</a>.

For reference, here are pictures of the three flowers species:
<img src="images/iris-machinelearning.png" width =50%>

Here is picture showing the 4 measurements made from each flower:
<img src="images/iris_flower_petal_sepal.jpeg" width =20%>

 We can download it into our program from `scikit-learn` datasets module. To load the data, we need to use `load_iris()` function.

In [None]:
# Load iris dataset #



The `load_iris()` function returns the data in `dict` format, that contains key, value pairs.

In [None]:
# Inside iris dataset #


*Details of the `iris_dataset` keys & values*

- `DESCR` - Short description of the dataset
- `feature_names` - the name of 4 measurements made from each flower
- `target_names` - the name of 3 species of the flowers that we are going to work with
- `data` - contains numeric measurements of sepal length, sepal width, petal length and petal width in numpy array(all in cm)
- `target` - Species that each flower belongs to
    - `0` means Setosa
    - `1` means Versicolour
    - `2` means Virginica

In [None]:
# Target flower species names #


In [None]:
# Feature measurements made from each flower #


In [None]:
# Shape of data containing numeric measurements (aka features) of 150 flowers #


In [None]:
# Feature data from first 5 row/ flower entries #


In [None]:
# target is a 1D array which contains detail of which species the flowers belong to#


In [None]:
# Let's see the species/ target class of all the flowers #


In [None]:
# Prints the flowers measurements data (features) and corresponding target (class label)
i = 0
print ("Feature:", iris_dataset['data'][i])
print ("Label  :",iris_dataset['target'][i])

# Logistic Regression

## 1 input feature and 2 class

In [None]:
# Let's load only the first feature (sepal length) out of 4 features => Univariate #
# Let's use only the first two classes => Binary Classification #



In previous post, we learn that for <b>regression</b>, the prediction formula for <b>a linear model with one input feature (variable)</b> is

- `ŷ = w[1]*x[1] + b = w*x + b`

<p>Linear models can also be used for <b>classificaiton</b>. For <span title="Classification with 2 types of categories (0 or 1) | (+ve or -ve)" style="text-decoration:none;color:black;border: 1px; border-style:dotted; border-color:gray;">binary classification</span>, the prediction formula for **a linear model with one input feature (variable)** is</p>

- `ŷ = w[1]*x[1] + b > 0` or `w*x + b > 0`

The formula looks very similar to linear regression, except we threshold the `predicted value` aka `weighted sum of features` (`w*x + b`) at zero 

<p style="color:blue; font-size:110%;  border-left: 5px solid blue; padding-left: 10px;">If the predicted output is <b>smaller than 0</b>, we predict the input as <b>-ve class</b></p>
<p style="color:purple; font-size:110%;  border-left: 5px solid blue; padding-left: 10px;">If the predicted output is <b>greater than 0</b>, we predict the input as <b>+ve class</b></p>

Here, 
- `x` denotes the input features (`x[1]` is input feature of a single data point) and, 
- `ŷ` denotes the classification made by the model (could be `<0 = +ve class` or `>0 = -ve class`).
- `w` is the slope (*weights or coefficients*) along each feature axis (`w[1]` is slope along axis `x[1]`)
- `b` is the intercept (offset) along `y-axis`

`w*x + b` is called as **Decision boundary** in logistic regression. It decides which class or category the input belongs to.

<ul style="color:purple; font-size:110%;  border-left: 5px solid blue; padding-left: 20px; list-style: none; line-height: 190%;"> For a binary linear classifier, the two classes are separated by a <font size="+2">Decision Boundary</font> which could be a line, a plane, or hyperplane.</ul></p>

In [None]:
# Let's plot the feature and their corresponding class #
import matplotlib.pyplot as plt
import numpy as np

import matplotlib as mpl
inline_rc = dict(mpl.rcParams)

plt.plot(data[target==0], np.zeros(np.sum(target==0)), 'x')
plt.plot(data[target==1], np.ones(np.sum(target==0)), 'x')

plt.legend(['class 0: setosa', 'class 1: versicolor'])
plt.grid()
plt.xlabel('Feature: sepal length (cm)')
plt.ylabel('Target class or label')

plt.title("One input feature")

plt.show()

In [None]:
# Let's split the data into train (75%) and test (25%)#



print ('Train set:', X_train.shape, Y_train.shape)
print ('Test set :', X_test.shape, Y_test.shape)

Reference: <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">sklearn.linear_model.LogisticRegression</a>

The most commonly used linear classification algorithm is logistic regression. Keep in mind that, logisitic regression is a classification algorithm and not a regression algorithm.



In [None]:
# Let's train the LR model #





In [None]:
# Predict from the learned model #


In [None]:
# For the specified input data in variable 'i', lets see the actual class label vs predicted class label
i = 0
print ("For input X =", X_train[i],
       "\nPredicted probability =", lr.predict_proba(X_train[i:i+1])[0,1],
       "\nPredicted class label =", lr.predict(X_train[i:i+1])[0],
       "\nActual class label =", Y_train[i])

In [None]:
### Compute decision boundary and plot with the input data ###
x = np.arange(4,8,.005).reshape(-1,1)
probs = lr.predict(x)

plt.figure(figsize=(8,4))

plt.plot(data[target==0], np.zeros(np.sum(target==0)), 'x')
plt.plot(data[target==1], np.ones(np.sum(target==0)), 'x')

plt.axvspan(np.min(x[probs==0]), np.max(x[probs==0]), alpha=0.2)
plt.axvspan(np.min(x[probs==1]), np.max(x[probs==1]), alpha=0.2, color='orange')

plt.legend(['class 0: actual data', 'class 1: actual data', 'class 0', 'class 1'])
plt.grid()
plt.xlabel('Feature: sepal length (cm)')
plt.ylabel('Target class or label')
plt.xlim(4,8)
plt.title('Decision boundary for class 0 and class 1')

plt.show()

In [None]:
# train and test accuracy of the model #
print("Train set accuracy: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set accuracy : {:.2f}".format(lr.score(X_test, Y_test)))

In [None]:
# Let's load all the four features => Multivariate #
# Let's use only the first two classes => Binary Classification #
data = iris_dataset['data'][:100][:,:]
target = iris_dataset['target'][:100]

# Split the data intro train and test #
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(data, 
                                                    target, 
                                                    random_state=0)
print ('Train set:', X_train.shape, Y_train.shape)
print ('Test set :', X_test.shape, Y_test.shape)

# Train the LR model #
lr = LogisticRegression(solver='liblinear').fit(X_train,Y_train)


# Train and Test Accuracy #
print("Train set accuracy: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set accuracy : {:.2f}".format(lr.score(X_test, Y_test)))

## All 3 input features and all 3 class

In [None]:
# Let's load all the four features => Multivariate #
# Let's use all the 3 classes=> Multiclass Classification #
data = iris_dataset['data']
target = iris_dataset['target']

# Split the data intro train and test #
X_train, X_test, Y_train, Y_test = train_test_split(data, 
                                                    target, 
                                                    random_state=0)
print ('Train set:', X_train.shape, Y_train.shape)
print ('Test set :', X_test.shape, Y_test.shape)

In [None]:
### Scatter matrix plot for all the 4 features ###


# create pandas datafrome from data in X_train
# label the 4 columns using the iris_dataset.feature names 


# create a scatter matrix from dataframe
# color the scatter using Y_train target label/class data





In [None]:
# Alternatively, we can use seaborn library to make the scatter plot #
import seaborn as sns
sns.set(style="ticks")

df = sns.load_dataset("iris")
sns.pairplot(df, hue="species")
plt.show()

In [None]:
# Train the LR model for multiclass (all 3 classes) #



In [None]:
# For the specified input data in variable 'i', lets see the actual class label vs predicted class label #
i = 1
print ("For input X =", X_train[i],
       "\nPredicted probability =", lr.predict_proba(X_train[i:i+1]),
       "\nPredicted class label =", lr.predict(X_train[i:i+1]),
       "\nActual class label =", Y_train[i])

In [None]:
# Train and Test Accuracy #
print("Train set accuracy: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set accuracy : {:.2f}".format(lr.score(X_test, Y_test)))

### Cross Validation

Reference : <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV">sklearn.linear_model.LogisticRegressionCV</a>

Allows to internally tune the models parameters and choose the settings that gives optimal result!

In [None]:
# Train a LR model with auto tuning based on CV #
from sklearn.linear_model import LogisticRegressionCV
lr = LogisticRegressionCV(verbose=0,multi_class='auto',cv=3).fit(X_train,Y_train)

In [None]:
# Coefficients learned by the model, train and test accuracy #
print ("Coefficients:", lr.coef_, "\nIntercept   :",lr.intercept_)

print("Train set accuracy: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set accuracy : {:.2f}".format(lr.score(X_test, Y_test)))

### Standardization

Reference : <a href="https://scikit-learn.org/stable/modules/preprocessing.html"> sklearn.preprocessing.StandardScaler </a>.
    
- Standardization of datasets is a common requirement for many machine learning estimators (models) implemented in scikit-learn; 
- they might behave badly if the individual features do not more or less look like standard normally distributed data:
    - Gaussian with zero mean and unit variance.

In [None]:
# Standardise the Train and Test data #
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train) # Scaled data has zero mean and unit variance #
X_test_scaled = scaler.transform(X_test)

In [None]:
# Train and Test scaled info #
print ("Train :", X_train_scaled.mean(axis=0), X_train_scaled.std(axis=0))
print ("Test  :", X_test_scaled.mean(axis=0), X_test_scaled.std(axis=0))

In [None]:
# Train the LR model on scaled data #
lr = LogisticRegressionCV(verbose=0,multi_class='auto',cv=3).fit(X_train_scaled,Y_train)

In [None]:
# Coefficients learned by the model, Train and Test Accuracy #
print ("Coefficients:", lr.coef_, "\nIntercept   :",lr.intercept_)

print("Train set accuracy: {:.2f}".format(lr.score(X_train_scaled, Y_train)))
print("Test set accuracy : {:.2f}".format(lr.score(X_test_scaled, Y_test)))