
# Machine Learning Methods Overview


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Supervised Learning

>Supervised learning algorithms are a class of machine learning algorithms that
use previously-labeled data to learn its features, so they can classify similar but unlabeled data. Let's use an example to understand this concept better.

### Preprocessing Real Estate Data

>Let's first read in a csv file using the `pandas.read_csv` function.

- `read_csv` : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [None]:
df = pd.read_csv("https://ist691.s3.amazonaws.com/real-estate.csv")
df.head()

> Dropping irrelevant columns: The `No` columns is simply the row number, which is unique for each row, and `X1 transction date` is not needed.

- `drop` : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html


In [None]:
# drop takes first argument as column name and other optional arguments
# axis = 1 means dropping columns
# inplace = True means making changing in the existing dataframe
df.drop('No', axis = 1, inplace = True)
df.drop('X1 transaction date', axis = 1, inplace = True)

In [None]:
df.head()

> Using the `train_test_split` function in sklearn's `model_selection` module to divide our into two sets, one for training and other for testing.

- `train_test_split` : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

- `iloc` : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

In [None]:
# for splitting the data into train and test
from sklearn.model_selection import train_test_split
# iloc is used to select columns and rows
# the first set of arguments is for the rows and second is for the columns
X = df.iloc[:,0:-1]
Y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.05, random_state = 0)

### Linear Regression

>For linear regression, we will use real estate data for the prediction of house prices. The data contains 6 features: transaction date, age of the house, distance to the nearest MRT station, number of convenience stores, and its latitude and longitude.

- `LinearRegression` : https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [None]:
# for the linear regression model
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
# the model.fit function trains the model with the training set passed
reg.fit(X_train, y_train)

>Predicting the prices of the house on test data by using the trained model, then printing the coeficients of the linear regression model.

In [None]:
# model.predict will predict the output of the data provided as argument
lr_pred = reg.predict(X_test)
print(set(zip(reg.feature_names_in_, reg.coef_)))

>In a linear-regression algorithm, the goal is to minimize a cost function. A popular cost function is **Mean Square Error (MSE)**, where we take the square of the difference between the expected value and the predicted value. The average over all the input examples gives us the mean error of the algorithm and represents the cost function.

- `mean_squared_error` - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

In [None]:
# importing the mean_squared_error function
from sklearn.metrics import mean_squared_error
mean_squared_error(lr_pred, y_test)

In [None]:
fig, ax = plt.subplots()
ax.plot([0,1],[0,1], transform = ax.transAxes)

plt.scatter(lr_pred, y_test)
plt.xlabel('Predicted')
plt.ylabel('Observed')
plt.title('Predicting House Prices with Linear Regression')

plt.show()

### Preprocessing the Iris Dataset

This is perhaps the best known data set to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attribute: class of iris plant.

Attribute Information:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class: Iris Setosa, Iris Versicolour, Iris Virginica

- `plot_iris_dataset` : https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

- `StandardScaler` : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
# loading the iris dataset from sklearn dataset class
from sklearn import datasets
iris = datasets.load_iris()

In [None]:
# using train test split to divide the iris dataset into training and testing group
X_train, X_test, y_train, y_test = train_test_split(iris.data,
                                                    iris.target,
                                                    test_size = .3,
                                                    random_state = 0)

print('Training set has {} samples and testing set has {} samples.'\
      .format(X_train.shape[0], X_test.shape[0]))

In [None]:
# using sklearn StandardScaler to standarize the features of training and test.
# it will scale our data to unit variance
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)

# using StandardScaler transform method to transform data
# more details can be found in the documentation from the link above
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

### Logistic Regression

- `LogisticRegression` : https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state = 0)\
  .fit(X_train_std, y_train)

In [None]:
# test set predictions using LogisticRegression model
clf.predict(X_test_std)

In [None]:
# actual values
y_test

In [None]:
# compare predictions to actual values for accuracy of predictions
clf.score(X_test_std, y_test)

### Support Vector Machine

- `SVM` : https://scikit-learn.org/stable/modules/classes.html?highlight=svm#module-sklearn.svm

In [None]:
from sklearn.svm import SVC

# using a support vector classifier (SVC) with a radial basis function kernal
svm = SVC(kernel = 'rbf', random_state = 0, gamma = .10, C = 1.0)
svm.fit(X_train_std, y_train)

print('The accuracy of the svm classifier on training data is {:.2f} out of 1'\
      .format(svm.score(X_train_std, y_train)))
print('The accuracy of the svm classifier on test data is {:.2f} out of 1'\
      .format(svm.score(X_test_std, y_test)))

### Decision Tree

- `DecisionTreeClassifier` : https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decisiontreeclassifier#sklearn.tree.DecisionTreeClassifier

- `plot_tree` : https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html?highlight=plot_tree#sklearn.tree.plot_tree

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

In [None]:
dt = DecisionTreeClassifier().fit(iris.data, iris.target)

In [None]:
plt.figure(figsize = (16, 8))
# plot the decision tree, showing the decisive values and the improvements in Gini impurity
plot_tree(dt, filled = True)
# display the tree plot figure
plt.show()

### Preprocessing Cancer Data

In [None]:
# using the pandas library to read a csv file
df = pd.read_csv("https://ist691.s3.amazonaws.com/cancer.csv")
df.head()

In [None]:
df.drop(['id'], axis = 1, inplace = True)
df.drop(['Unnamed: 32'], axis = 1, inplace = True)

In [None]:
df['diagnosis'] = [1 if i == 'M' else 0 for i in df.diagnosis]

In [None]:
df.head(3)

In [None]:
x = df.drop(['diagnosis'], axis = 1)
y = df.diagnosis.values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = .3, random_state = 0)
print('Training set has {} samples and testing set has {} samples.'\
      .format(X_train.shape[0], X_test.shape[0]))

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)

X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

### Gaussian Naive Bayes

 - `GaussianNB` : https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train_std, y_train)

In [None]:
print('Naive Bayes score: ',nb.score(X_test_std, y_test))

# Unsupervised Learning

## K-means Clustering

> The K-means clustering algorithm is an example of an iterative algorthm which tries to partition the dataset into K non-overlaping groups or clusters.

 - `KMeans` : https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html


In [None]:
from sklearn.cluster import KMeans
# for splitting the data into train and test
from sklearn.model_selection import train_test_split

In [None]:
# load the iris dataset from sklearn dataset class
from sklearn import datasets
iris = datasets.load_iris()

In [None]:
# using train test split to divide the iris dataset into training and testing group
X_train, X_test, y_train, y_test = train_test_split(iris.data,
                                                    iris.target,
                                                    test_size = .3,
                                                    random_state = 0)
print('Training set has {} samples and testing set has {} samples.'\
      .format(X_train.shape[0], X_test.shape[0]))

In [None]:
# specify the number of clusters
k = 3

# initialize a KMeans object and fit it with train data
# n_init controls the number of times the algorithm runs
kmeans = KMeans(n_clusters = k, random_state = 0, n_init = 25).fit(X_train)

In [None]:
# make "predictions" on test data using the trained kmeans model
y_pred = kmeans.predict(X_test)

In [None]:
# get the centers of the 3 clusters for each feature in iris
pd.DataFrame(data = kmeans.cluster_centers_,
             columns = iris.feature_names)

In [None]:
# create a scatter plot where our target variable is 0
plt.scatter(X_test[y_pred == 0, 0], X_test[y_pred == 0, 1], s = 100,
            c = 'purple', label = 'Iris-setosa')

# create a scatter plot where our target variable is 1
plt.scatter(X_test[y_pred == 1, 0], X_test[y_pred == 1, 1], s = 100,
            c = 'orange', label = 'Iris-versicolour')

# create a scatter plot where our target variable is 2
plt.scatter(X_test[y_pred == 2, 0], X_test[y_pred == 2, 1], s = 100,
            c = 'green', label = 'Iris-virginica')

# plot the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0],
            kmeans.cluster_centers_[:,1], s = 100,
            c = 'red', label = 'Centroids')

plt.legend()

## Clustering for Image Segmentation

> Now we will use Kmeans clustering with images for Image Segmentation. Image Segmentation is a task of partitioning an image into multiple segments. For this task, we will be using a simple variation of image segmentation which is color segmentation. Color segmentation will simply assign pixels to the same segment if they have a similar color.

> We will use `matplotlib.imread()` to load an image. The image is loaded as a 3D array, height, width, channel (3-channel for RGB or 4 -channel for RGB with alpha).

> Loading image into 3D array using the `imread()` function. If we look to the shape of the image, it's a 3D array. The image we loaded is a 720x1280 pixel RGB image. Therefore, the shape of the image is (720, 1280, 3).

In [None]:
from matplotlib.image import imread
%matplotlib inline

In [None]:
%%bash
if [[ ! -f ./colored-houses.jpg ]]; then
    wget https://ist691.s3.amazonaws.com/images/colored-houses.jpg
fi

In [None]:
image = imread('colored-houses.jpg')
image.shape

In [None]:
plt.imshow(image)

In [None]:
image.shape

>We will reshape the image array as a long list of RGB colors using `reshape()`.

In [None]:
x = image.reshape(-1,3)
x.shape

>After reshapping the image, we will fit it using `KMeans` for color segmentation. Here the value of K in `KMeans` will decide the number of colors in the output image. We will try 4 different variations with cluter values of 3, 4, 5 and 8.

>The algorithm will try to make K clusters of similar sizes. For example, it may try to find all shades of green and look for the mean color. Then it will replace all shades with the mean color.

In [None]:
# initilize a KMeans object with 3 clusters and fitting the object with the reshaped image
kmeans = KMeans(n_clusters = 3).fit(x)

# change the value of all the pixels with their cluster center value
segmented_img_3 = kmeans.cluster_centers_[kmeans.labels_]

# reshape the segmentated image to the original shape
segmented_img_3 = segmented_img_3.reshape(image.shape)

In [None]:
kmeans = KMeans(n_clusters = 4).fit(x)
segmented_img_4 = kmeans.cluster_centers_[kmeans.labels_]
segmented_img_4 = segmented_img_4.reshape(image.shape)

In [None]:
kmeans = KMeans(n_clusters = 5).fit(x)
segmented_img_5 = kmeans.cluster_centers_[kmeans.labels_]
segmented_img_5 = segmented_img_5.reshape(image.shape)

In [None]:
kmeans = KMeans(n_clusters = 8).fit(x)
segmented_img_6 = kmeans.cluster_centers_[kmeans.labels_]
segmented_img_6 = segmented_img_6.reshape(image.shape)

>We will now create a 2x2 grid to plot the 4 variations of color segmentation of `kmeans`.

In [None]:
f, ((ax1,ax2), (ax3,ax4)) = plt.subplots(2, 2, sharey = True, figsize=(15,10))

ax1.imshow((segmented_img_3).astype(np.uint8))
ax1.set_title('KMeans with 3 color clusters')

ax2.imshow((segmented_img_4).astype(np.uint8))
ax2.set_title('KMeans with 4 color clusters')

ax3.imshow((segmented_img_5).astype(np.uint8))
ax3.set_title('KMeans with 5 color clusters')

ax4.imshow((segmented_img_6).astype(np.uint8))
ax4.set_title('KMeans with 8 color clusters')

f.suptitle('Color Segmentation using KMeans with different color clusters size')

plt.show();

> Above is the result of color segmentation. In the first picture with cluster size 3, we notice there are only three colors, a variant of white, variant of yellow, and black. But as we increase the cluster size by 1, we see a new color blue is introduced in the output. If we look at the original image we only have 6-7 houses with different variations of blue. A mean value of all blue was calculated and all the shades of blue are replaced with the blue mean.

>As we increase the cluster size, new colors are introduced. `KMeans` tries to keep all the clusters of similar size so new colors will only be introduced when it has cluster size big enough in comparison with other color clusters.