# ML101:  A short course on machine learning
#### Malcolm C. A. White and Nori Nakata

*Machine learning* (ML) refers to a diverse set of computer algorithms that are designed to "learn" how to make inferences from input data. ML is a component of Artifical Intelligence (AI), the study of reasoned decision making in machines.

In this short course, we will look at two main problems that ML algorithms attempt to solve: a) **classification** and b) **regression**. There are many other problems that ML algorithms attempt to solve, but we will limit ourselves to this pair of related problems for this short course. For our purposes, we will treat classification problems as those where we wish to infer the value of a categorical or discrete, numerical variable for a set of independent input variables, and we will treat regression problems as those where we wish to infer the value of a continuous, numerical variable from a set of independent input variables.

In service of solutions to these classes of problems, we will investigate *clustering*, *supervised learning*, and *dimensionality reduction* techniques. We will also investigate *Artificial Neural Networks* (ANN).

In [None]:
# %matplotlib ipympl
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.cluster
import sklearn.datasets
import seaborn as sns

sns.set_theme()

# **1. Classification**

First, let's load and display the Iris data set, which contains data from three species of Iris flowers.

In [None]:
data = sklearn.datasets.load_iris()
labels = data["target_names"]
df = pd.DataFrame(data["data"], columns=data["feature_names"])
df["species"] = labels[data["target"]]
df.columns = ["_".join(column.split()[:2]) for column in df.columns]
df

As you can see, our data set comprises a sample of 150 cases (rows), split evenly between the three species. In each case, four measurements were taken: the a) sepal length, b) sepal width), c) petal length, and d) petal width. Let's see if, given only these four measurements, we can accurately classify the species for each case.

We can create a set of plots to show relationships between the different explanatory variables

In [None]:
plt.close("all")
sns.pairplot(data=df, hue="species");

Let's separate our data into *training* and *test* data sets. Our training data set will comprise 90% of the original data and will be used to build our models for making useful inferences. The test data set will comprise the remaining 10% and will be used to test the accuracy of our models.

In [None]:
df_train = df.sample(frac=0.9)
df_test = df.loc[~df.index.isin(df_train.index)]

df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

## **1.1 *k*-means clustering**

We see that the different species tend to cluster in different regions of the plots above. The *K-means* clustering algorithm aims to find the center of each cluster; each point is then associated with the nearest cluster center. The cell below will perform K-means clustering and plot the resulting cluster assignments.

In [None]:
n_clusters = 3
features = ["sepal_length", "sepal_width", "petal_length", "petal_width"]

classifier = sklearn.cluster.KMeans(
    n_clusters=n_clusters, 
    random_state=0
).fit(
    df_train[features]
)

df_train["cluster_id"] = classifier.labels_
df_train["cluster_id"] = df_train["cluster_id"].astype(str)

plt.close("all")
sns.pairplot(
    data=df_train, 
    hue="cluster_id",
    hue_order=sorted(df_train["cluster_id"].unique())
);

We can see how many of each species were assigned to each cluster.

In [None]:
df_train.groupby(["cluster_id", "species"]).size()

All Setosa species cases were assigned to cluster \#0, and no other species were. Cluster \#1 comprises primarily Viginica species, and cluster \#2 comprises primarily Versicolor species. Because we have a one-to-one mapping between the cluster IDs inferred by the model and the species, we can define this map:

In [None]:
counts = df_train.groupby(["cluster_id", "species"]).size()
counts.name = "size"
counts = counts.reset_index()
counts = counts.sort_values(["cluster_id", "size"])
counts = counts.drop_duplicates(subset="cluster_id")
counts["cluster_id"] = counts["cluster_id"].astype(int)
clust2spec = counts.set_index("cluster_id")["species"]
spec2clust = counts.set_index("species")["cluster_id"]

And we can infer the species for each case in our test data set.

In [None]:
df_test["inferred_species"] = clust2spec.loc[classifier.predict(df_test[features])].values
df_test

From the above, we see that our simple classifier has a modest accuracy around 50%.

A further drawback to the K-means algorithm is that the user must specify the desired number of clusters. Try changing the `n_clusters` variable at the beginning of the exercise in this section to see how the clustering is affected by the chosen number of clusters. Note that if the number of clusters is not equal to the number of species, we will not have a one-to-one map between cluster IDs and species, so the latter part of the analysis above will fail.

## **1.2 Support-Vector Machines (SVM)**

An alternative approach to solving the classification problem is using *Support Vector Machines* (SVM).

In [None]:
classifier = sklearn.svm.SVC().fit(df_train[features], df_train["species"])

In [None]:
classifier.predict(df_test[features])

## **1.3 Random Forests**

# **2. Regression**

## **2.1 Linear regression**

## **2.2 Aritifical Neural Networks**

## **2.3**