<img src="./Images/scikit-learn-cover.png" width="500" height="500">

# Scikit-learn

***

## Overview of scikit-learn
Scikit-learn is a free software machine learning library for the Python programming language. It is build upon NumPy, pandas and Matplotlib. It provides many unsupervised and supervised learning algorithms, the library is focused on modeling data.


## Why its used
Scikit-learn is the most useful and indept library for machine learning in Python. It provides a lot of efficient tools for machine learning including classification, regression, clustering, model selection and preprocessing.


## Contents of notebook
...


### Algorithms
For this module I have been tasked to demonstrate at least three scikit-learn algorithms. The first of which is:
- [K-means Algorithm](https://scikit-learn.org/stable/modules/clustering.html#k-means)
- [Decision Tree Regression](https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html#sphx-glr-auto-examples-tree-plot-tree-regression-py)
- [Examples](https://scikit-learn.org/stable/auto_examples/index.html#cluster-examples)


# K-means
***
The KMeans algorithm is an unsupervised clustering machine learning algorithm. A cluster referes to a collection of data points paired together because of certain similarities. This algorithm assumes the number of clusters are pre-defined. The k-means algorithm divides a set of ***N*** samples ***X*** into ***K*** disjoint clusters ***C*** , each described by the mean ***u***  of the samples in the cluster. The means are commonly called the cluster “centroids”; Centroids are initialized by shuffling the dataset and randomly selecting x data points for the centroids. This algorithm calculates and assigns data points to a cluster such that the sum of the squared distance between the data points and centroids are at a minimum.

The algorithm is very popular and is used in a wide variety of applications such as market segmentation, data clustering, image segmentation and compression. It is an easy algorithm to understand and implement especially with help from the [scikit library](https://scikit-learn.org/stable/modules/clustering.html#k-means). 

### How it works
The Kmean algorithm works as follows:

- **1.** Define the number of clusters *K*
- **2.** Initialize centroids by randomly selecting x amount of data points and assigning them to a cluster
- **3.** Compute the sum of the squared distance between the data points and centroids
     - **3.1** Assign each data point to the closest centroid
     - **3.2** Create new centroids by taking the mean value of all the data points assigned to each previous centroid. 
- **4.** Keep iterating **step 3** until there is little to no significant change to the centroids

The K-means algorithm follows the **Expectation-Maximization** approach to solve the problem. An approach where the Expectation step is assigning the data points to the closest cluster and the Maximization step is computing the centroid of each cluster.






## K-Means example: Color Quantization
Below I will demonstrate an example of implementing the K-means algorithm to perform a pixel-wise Vector Quantization of an image of a flower. Reducing the number of colours required to display the image from 96,615 unique colours to 64, while preserving the quality of the image. Every pixel is a 3 dimensional vector with Red, Green and Blue components. The image itself is 427 pixels by 640 pixels, so the total amount of vectors are 273,280. The algorithm is ran on these colour vectors and will specify 64 clusters. The result shows how the image is reduced to only 64 colours, some information is lost but the overall quality of the photo remains true. 

For comparison, a quantized image useing a random selection of colours is shown.



In [2]:
# Import libaries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
from sklearn.datasets import load_sample_image
from sklearn.utils import shuffle
from time import time
from skimage import io


In [None]:
# Defined number of clusters
n_colors = 64

In [3]:
# Load the flower photo
flower_img = load_sample_image('flower.jpg')

In [5]:
flower_img.dtype 

dtype('uint8')

In [None]:
# Convert to floats instead of the default 8 bits integer coding. Dividing by
# 255 is important so that plt.imshow behaves works well on float data (need to
# be in the range [0-1])
flower_img = np.array(flower_img, dtype=np.float64) / 255

In [None]:
# Load Image and transform to a 2D numpy array.
w, h, d = original_shape = tuple(flower_img.shape)
assert d == 3
image_array = np.reshape(flower_img, (w * h, d))

In [None]:
# Initialize centroids by randomly selecting x amount of data points and assigning them to the image_array_sample
print("Fitting model on a small sub-sample of the data")
t0 = time()
image_array_sample = shuffle(image_array, random_state=0, n_samples=1_000)
kmeans = KMeans(n_clusters=n_colors, random_state=0).fit(image_array_sample)
print(f"done in {time() - t0:0.3f}s.")

In [None]:
# Get labels for all points
# Then using the Kmeans predict method, iterates through each data point until there is no change to the clusters
print("Predicting color indices on the full image (k-means)")
t0 = time()
labels = kmeans.predict(image_array)
print(f"done in {time() - t0:0.3f}s.")

In [None]:
codebook_random = shuffle(image_array, random_state=0, n_samples=n_colors)
print("Predicting color indices on the full image (random)")
t0 = time()
labels_random = pairwise_distances_argmin(codebook_random, image_array, axis=0)
print(f"done in {time() - t0:0.3f}s.")

In [None]:
def recreate_image(codebook, labels, w, h):
    """Recreate the (compressed) image from the code book & labels"""
    return codebook[labels].reshape(w, h, -1)

In [None]:
# Display all results, alongside original image
plt.figure(1)
plt.clf()
plt.axis("off")
plt.title("Original image (96,615 colors)")
plt.imshow(flower_img)

plt.figure(2)
plt.clf()
plt.axis("off")
plt.title(f"Quantized image ({n_colors} colors, K-Means)")
plt.imshow(recreate_image(kmeans.cluster_centers_, labels, w, h))

plt.figure(3)
plt.clf()
plt.axis("off")
plt.title(f"Quantized image ({n_colors} colors, Random)")
plt.imshow(recreate_image(codebook_random, labels_random, w, h))
plt.show()

# Decision Tree Regression
***

[Breast Cancer dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Standard plot size.
plt.rcParams['figure.figsize'] = (15, 10)

In [None]:
# Standard colour scheme.
plt.style.use('seaborn')

In [None]:
# Reading in data
df = pd.read_csv("breast-cancer-wisconsin.csv")

In [None]:
# Displaying data
df

###  Attribute Domain
***
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

	a) radius (mean of distances from center to points on the perimeter)
	b) texture (standard deviation of gray-scale values)
	c) perimeter
	d) area
	e) smoothness (local variation in radius lengths)
	f) compactness (perimeter^2 / area - 1.0)
	g) concavity (severity of concave portions of the contour)
	h) concave points (number of concave portions of the contour)
	i) symmetry 
	j) fractal dimension ("coastline approximation" - 1)

Note each real-value has a mean, standard error and worst value

In [None]:
# Checking for Missing Values
df.isnull().sum()

In [None]:
# Get rid of empty data column with missing values
df = df.dropna(axis=1)

In [None]:
# Count the number of rows and columns in the data set
df.shape

In [None]:
## Checking the values of data types
df.dtypes

In [None]:
# Replace the Diagnosis values with 1 = Malignant and 0 = Benign
df.diagnosis.replace(('M', 'B'),(1,0), inplace=True)

In [None]:
# New dataset with diagnosis values replaced and missing values dropped
df

In [None]:
df.describe()

# Data Visualisation
***

In [None]:
# Distribution of Benign or Malignant breast tumour
sns.countplot(df['diagnosis'], label='Count')

In [None]:
# Visualisation of data in correlation to diagnosis values
sns.pairplot(df.iloc[:,1:12], hue='diagnosis')

In [None]:
# Using a heatmap to plot all the data effects against each other
f, ax = plt.subplots(figsize = (25, 10))
sns.heatmap(df.iloc[:,1:12].corr(), annot = True, fmt= '.0%', linewidth = 2)

## Train-Test Split
***

In [None]:
# Splitting the data
# X being all the relevant features that determines if the patient has malignant or benign cancer
X = df.iloc[:,2:31].values
# y has the diagnosis whether the patient has malginant or benign cancer
y = df.iloc[:,1].values

In [None]:
# Splitting the data into Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size =0.2, random_state = 0)

##  Decision Tree Regression

In [None]:
# Importing our model
from sklearn.tree import DecisionTreeRegressor
# Testing regression with different depths
regr_1 = DecisionTreeRegressor(random_state = 0, max_depth = 1)
regr_2 = DecisionTreeRegressor(random_state = 0, max_depth = 2)
regr_3 = DecisionTreeRegressor(random_state = 0, max_depth = 5)

In [None]:
# Fit regression model to the data
regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)

In [None]:
# Prediction data 
pred_1 = regr_1.predict(X_test)
pred_2 = regr_2.predict(X_test)
pred_3 = regr_3.predict(X_test)

In [None]:
# Plotting prediction and original data
plt.subplots(figsize = (25, 5))

plt.plot(Y_test, label = 'data', linewidth = 5, color = 'black')
plt.plot(pred_1, label = 'prediction1', linewidth = 1.5, color = 'red')
plt.plot(pred_2, label = 'prediction2', linewidth = 1.5, color = 'yellowgreen')
plt.plot(pred_3, label = 'prediction3', linewidth = .5, color = 'cyan')
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

In [None]:
# Showing the accurary of each prediction
print('Decision Tree Regression Accuracy with depth of 1:', regr_1.score(X_test, Y_test))
print('Decision Tree Regression Accuracy with depth of 2:', regr_2.score(X_test,Y_test))
print('Decision Tree Regression Accuracy with depth of 5:', regr_3.score(X_test, Y_test))

***
[car data set](https://archive.ics.uci.edu/ml/datasets/Car+Evaluation)