# Scikit-Learn Python Library


***

This notebook is a overview of the [sckit-learn Python libray](https://scikit-learn.org/stable/): a free software machine learning library for the [Python](https://www.python.org/) programming language. 
It also supports supervised and unsupervised learning and provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

<br>

## An introduction to machine learning with scikit-learn
https://scikit-learn.org/stable/tutorial/basic/tutorial.html

In [1]:
import sklearn.datasets as datasets

In [2]:
iris = datasets.load_iris()

In [3]:
digits = datasets.load_digits()

In [4]:
print(digits.data)

[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]


In [5]:
digits.target

array([0, 1, 2, ..., 8, 9, 8])

In [6]:
digits.images[0]

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

In [7]:
from sklearn import svm

In [8]:
clf = svm.SVC(gamma=0.001, C=100.)

In [9]:
clf.fit(digits.data[:-1], digits.target[:-1])

SVC(C=100.0, gamma=0.001)

In [10]:
clf.predict(digits.data[-1:])

array([8])

<br>

## Different Types of Scikit-Learn Algorithms

***

The two main categories of Machine Learning Algorithms are Supervised learning and Unsupervised learning.

### Supervised Learning
Supervised learning is where an algorithm is trained on input data that has been labeled for a particular output based on example input-output pairs that exist in the algorithm.
This basically means the computer is already aware of what the possible output will be of the raw input being processed.

<img src="https://cdn.datafloq.com/cms/2018/01/23/supervised-learning.png" alt="Supervised-Learning" width="600">

Click [here](https://datafloq.com/read/machine-learning-explained-understanding-learning/4478) for more information about the image.

### UnSupervised Learning
Unsupervised learning uses algorithms to analyze raw input data, without any pre-assigned labels. It works by discovering patterns and differences in the data set.

<img src="https://www.gong-jj.com/images/ml-unsup/unsup_header.png" alt="Supervised-Learning" width="600">

Click [here](https://www.gong-jj.com/ul/) for more information about the image.

<br>

## Supervised Learning Algorithms

***

## Classification: Analysis on The Wine Quality Data Set 
![Wine Quality Data Set](https://archive.ics.uci.edu/ml/assets/MLimages/Large186.jpg)

[Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) available at the [UC Irvine Machine Learning](https://archive.ics.uci.edu/ml/index.php)
***

## Setup

In [11]:
# Numerical arrays.
import numpy as np

# Data frames.
import pandas as pd

# Plotting.
import matplotlib.pyplot as plt

# Logistic regression.
import sklearn.linear_model as lm

# K nearest neaighbours.
import sklearn.neighbors as nei

# Helper functions.
import sklearn.model_selection as mod

# Fancier, statistical plots.
import seaborn as sns

In [12]:
# Standard plot size.
plt.rcParams['figure.figsize'] = (15, 10)

# Standard colour scheme.
plt.style.use('ggplot')

### The Red Wine Quality Dataset

In [13]:
# Load the Wine Quality data set from a URL.
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';')

In [14]:
df.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5
6,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,5
7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7
8,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,7
9,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5


In [15]:
# Have a look at the data.
df.shape

(1599, 12)

In [16]:
# Summary statistics.
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [17]:
# Data Information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [18]:
# Feature Relationships
df.corr()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
fixed acidity,1.0,-0.256131,0.671703,0.114777,0.093705,-0.153794,-0.113181,0.668047,-0.682978,0.183006,-0.061668,0.124052
volatile acidity,-0.256131,1.0,-0.552496,0.001918,0.061298,-0.010504,0.07647,0.022026,0.234937,-0.260987,-0.202288,-0.390558
citric acid,0.671703,-0.552496,1.0,0.143577,0.203823,-0.060978,0.035533,0.364947,-0.541904,0.31277,0.109903,0.226373
residual sugar,0.114777,0.001918,0.143577,1.0,0.05561,0.187049,0.203028,0.355283,-0.085652,0.005527,0.042075,0.013732
chlorides,0.093705,0.061298,0.203823,0.05561,1.0,0.005562,0.0474,0.200632,-0.265026,0.37126,-0.221141,-0.128907
free sulfur dioxide,-0.153794,-0.010504,-0.060978,0.187049,0.005562,1.0,0.667666,-0.021946,0.070377,0.051658,-0.069408,-0.050656
total sulfur dioxide,-0.113181,0.07647,0.035533,0.203028,0.0474,0.667666,1.0,0.071269,-0.066495,0.042947,-0.205654,-0.1851
density,0.668047,0.022026,0.364947,0.355283,0.200632,-0.021946,0.071269,1.0,-0.341699,0.148506,-0.49618,-0.174919
pH,-0.682978,0.234937,-0.541904,-0.085652,-0.265026,0.070377,-0.066495,-0.341699,1.0,-0.196648,0.205633,-0.057731
sulphates,0.183006,-0.260987,0.31277,0.005527,0.37126,0.051658,0.042947,0.148506,-0.196648,1.0,0.093595,0.251397


In [19]:
# Assign Specific Labels to Wine Quality
bins = (0, 4, 7, 10)
labels = ["bad", "medium", "good"]
df['quality'] = pd.cut(x = df['quality'], bins = bins, labels = labels)

In [20]:
df['quality'].value_counts()

medium    1518
bad         63
good        18
Name: quality, dtype: int64

### Visualise Data Set

In [None]:
# Scatter plots and kdes.
sns.pairplot(df, hue='quality', palette = "Set2");

### Two Dimensions Data Representation

In [None]:
# New figure.
fig, ax = plt.subplots()

# Scatter plot.
ax.plot(df['alcohol'], df['pH'], '.')

# Set axis labels.
ax.set_xlabel('Alcohol');
ax.set_ylabel('Ph');

In [None]:
# Seaborn is great for creating complex plots with one command.
sns.lmplot(x='alcohol', y='pH', hue='quality', data=df, fit_reg=False, height=10, aspect=1.5, palette="Set2")

In [None]:
# Count Plot 
sns.countplot(x='quality',data=df, palette="Set2")

In [None]:
# Heatmap Visulation Dataset
plt.figure(figsize=(12,12))
sns.heatmap(data=df.corr(),annot=True,cmap="flare")
plt.show()

In [None]:
# Segregate the data.
bad = df[df['quality'] == 'bad']
medium = df[df['quality'] == 'medium']
good = df[df['quality'] == 'good']

# New plot.
fig, ax = plt.subplots()

# Scatter plots.
ax.scatter(bad['alcohol'], bad['pH'], label='Bad')
ax.scatter(medium['alcohol'], medium['pH'], label='Medium')
ax.scatter(good['alcohol'], good['pH'], label='Good')


# Show the legend.
ax.set_xlabel('Alcohol')
ax.set_ylabel('pH')
ax.legend();

In [None]:
# How the segregation works.
df['quality'] == 'bad'

In [None]:
df[df['quality'] == 'bad'].head()

### Using groupby()

In [None]:
# New plot.
fig, ax = plt.subplots()

# Using pandas groupby().
for quality, data in df.groupby('quality'):
    ax.scatter(data['alcohol'], data['pH'], label=quality)

# Show the legend.
ax.set_xlabel('Alcohol')
ax.set_ylabel('pH')
ax.legend();

In [None]:
# Group by typically takes a categorical variable.
x = df.groupby('quality')
x

In [None]:
# Pivot tables.
x.mean()

In [None]:
# Looping through groupby().
for i, j in x:
    print()
    print(f"i is: '{i}'")
    print(f"j is:\n{j[:3]}")
    print()

## Test and Train Split

In [None]:
# Split the data frame in two.
train, test = mod.train_test_split(df)

In [None]:
# Show some training data.
train.head()

In [None]:
# The indices of the train array.
train.index

In [None]:
# Show some testing data.
test.head()

In [None]:
test.index.size

## Two Dimensions: Test Train Split

In [None]:
# Segregate the data.
bad = df[df['quality'] == 'bad']
medium = df[df['quality'] == 'medium']
good = df[df['quality'] == 'good']

# New plot.
fig, ax = plt.subplots()

# Scatter plots.
ax.scatter(bad['alcohol'], bad['pH'], marker='o' , label='Bad')
ax.scatter(medium['alcohol'], medium['pH'], marker='o', label='Medium')
ax.scatter(good['alcohol'], good['pH'], marker='o', label='Good')

# Scatter plot for testing data.
ax.scatter(test['alcohol'], test['pH'], marker='x', label='Test data')

# Show the legend.
ax.set_xlabel('Alcohol')
ax.set_ylabel('pH')
ax.legend();

## Two Dimensions: Inputs and outputs

In [None]:
# Give the inputs and outputs convenient names.
inputs, outputs = train[['alcohol', 'pH']], train['quality']

In [None]:
# Peek at the inputs.
inputs.head()

In [None]:
# Peek at the outputs.
outputs.head()

## Two Dimensions: Logistic regression
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
***

In [None]:
# Create a new classifier.
lre = lm.LogisticRegression(random_state=0)

# Train the classifier on our data.
lre.fit(inputs, outputs)

In [None]:
# Ask the classifier to classify the test data.
predictions = lre.predict(test[['alcohol', 'pH']])

In [None]:
# Eyeball the misclassifications.
predictions == test['quality']

In [None]:
# What proportion were correct?
lre.score(test[['alcohol', 'pH']], test['quality'])

## Two Dimensions: Misclassified

In [None]:
# Append a column to the test data frame with the predictions.
test['predicted'] = predictions
test.head()

In [None]:
# Show the misclassified data.
misclass = test[test['predicted'] != test['quality']]
misclass

In [None]:
# Eyeball the descriptive statistics for the species.
train.groupby('quality').mean()

In [None]:
# New plot.
fig, ax = plt.subplots()

# Plot the training data
for quality, data in df.groupby('quality'):
    ax.scatter(data['alcohol'], data['pH'], label=quality)
    
# Plot misclassified.
ax.scatter(misclass['alcohol'], misclass['pH'], s=200, facecolor='none', edgecolor='r', label='Misclassified')

# Show the legend.
ax.set_xlabel('Alcohol')
ax.set_ylabel('PH')
ax.legend();

## Seperating Bad Quality

From [Wikipedia](https://en.wikipedia.org/wiki/Logistic_regression):    ${L} = \log(b) \frac{p}{1-p} = \beta_0 + \beta_1 x_{1} + \beta_0 + \beta_2 x_{2}$


***

In [None]:
# Give the inputs and outputs convenient names.
inputs = train[['alcohol', 'pH']]

# Set 'versicolor' and 'virginica' to 'other'.
outputs = train['quality'].apply(lambda x: x if x == 'bad' else 'other')

# Eyeball outputs
outputs.unique()

In [None]:
# Create a new classifier.
lre = lm.LogisticRegression(random_state=0)

# Train the classifier on our data.
lre.fit(inputs, outputs)

In [None]:
actual = test['quality'].apply(lambda x: x if x == 'bad' else 'other')

# What proportion were correct?
lre.score(test[['alcohol', 'pH']], actual)

## Using All Possible Inputs
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
***

In [None]:
# Load the Wine Quality data set from a URL.
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';')

In [None]:
# Split the data frame in two.
train, test = mod.train_test_split(df)

In [None]:
# Use all eleven possible inputs.
inputs, outputs = train[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']], train['quality']

In [None]:
# Create a new classifier.
lre = lm.LogisticRegression(random_state=0)

# Train the classifier on our data.
lre.fit(inputs, outputs)

In [None]:
# Ask the classifier to classify the test data.
predictions = lre.predict(test[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']])
predictions

In [None]:
# Eyeball the misclassifications.
(predictions == test['quality']).value_counts()

In [None]:
# What proportion were correct?
lre.score(test[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']], test['quality'])

## $\kappa$ Nearest Neighbours Classifier
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
***

In [None]:
# Load the Wine Quality data set from a URL.
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';')

In [None]:
# Split the data frame in two.
train, test = mod.train_test_split(df)

In [None]:
# Use all eleven possible inputs.
inputs, outputs = train[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']], train['quality']

In [None]:
# Classifier.
knn = nei.KNeighborsClassifier()

In [None]:
# Fit.
knn.fit(inputs, outputs)

In [None]:
# Test.
knn.score(test[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']],  test['quality'])

In [None]:
# Predict.
predictions = lre.predict(test[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']])
(predictions == test['quality']).value_counts()

In [None]:
# The score is just the accuracy in this case.
(predictions == test['quality']).value_counts(normalize=True)

## Cross validation
https://scikit-learn.org/stable/modules/cross_validation.html
***

In [None]:
knn = nei.KNeighborsClassifier()
scores = mod.cross_val_score(knn, df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']], df['quality'])
scores

In [None]:
print(f"Mean: {scores.mean()} \t Standard Deviation: {scores.std()}")

In [None]:
lre = lm.LogisticRegression(random_state=0)
scores = mod.cross_val_score(lre, df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']], df['quality'])
scores

In [None]:
print(f"Mean: {scores.mean()} \t Standard Deviation: {scores.std()}")

## Basic Example of a Sckit-Learn Algorithm

***

## References
1. DataQuest "Scikit-learn Tutorial: Machine Learning in Python" https://www.dataquest.io/blog/sci-kit-learn-tutorial/

2. 

3. 

***

## End