<a target="_blank" href="https://mybinder.org/v2/gh/joshmaglione/CS3101-Notes/HEAD?labpath=Notes%2Fnotebooks%2FD_Iris_flowers.ipynb">
  <img src="https://mybinder.org/badge_logo.svg" alt="Binder"/>
</a> 
<a target="_blank" href="https://colab.research.google.com/github/joshmaglione/CS3101-Notes/blob/main/Notes/notebooks/D_Iris_flowers.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a> <a target="_blank" href="https://github.com/joshmaglione/CS3101-Notes/blob/main/Notes/notebooks/D_Iris_flowers.ipynb">View on GitHub</a>

# Iris flowers

We will be looking at a famous data set of three iris species. 

This demonstrates the capabilities of Python and packages, including:
- NumPy,
- Matplotlib,
- Scikit-learn,

which have benefited from significant mathematical and statistical advances in data science. 

## Brief background

In the 1930s, botanist [Edgar Anderson](https://en.wikipedia.org/wiki/Edgar_Anderson) gathered data on three species of irises. 

The data was published in 1936 in the [Annals of the Missouri Botanical Garden](https://doi.org/10.2307/2394164).

The data set was made famous by [Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) who used it as an example of linear discriminant analysis, also in 1936. It was published in the [Annals of Eugenics](https://doi.org/10.1111/j.1469-1809.1936.tb02137.x).

The data comprises a sample of $50$ points in $\mathbb{R}^4$. 

Anderson measured the pedal and sepal lengths and widths. 

## The data

Let's finally load the data and take a brief look at it.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris = load_iris()
iris

We will exploit the fact that scatter plots in `Matplotlib` can display $4$ dimensions:
- $x$ values
- $y$ values
- color
- size of marker.

We will map the `target` array to color. Since there are only three speices, there will be three distinct colors.

We have three more dimensions we can display: $x$, $y$, and size. 

#### Organization

There are four measurements, so there are $4$ different sets of size $3$ we can consider. 

Of those $3$, there are essentially only $3$ different plots -- where interchanging $x$ and $y$ isn't really different.

In total, there are potentially $12$ distinct plots. Let's plot them all.

Here are two helper functions.

In [None]:
def triple(i, j, s=False):
	v = list(range(4))
	v.remove(3 - j)
	u = v[:i] + v[i+1:] + [v[i]]
	if s:
		return ''.join(map(str, u))
	return u

def scaled(arr, N=200):
	m = arr.max()
	return N/m * arr

Here's how we will organize. The string `'ijk'` represents the list `[i, j, k]`. 

The first two entries of the list are the $x$ and $y$ values, and the last entry is the size.

In [None]:
np.array([[triple(i, j, s=True) for j in range(4)] for i in range(3)])

One notices that each of the $12$ strings indicates a unique plot.

## Plotting the data

Now let's plot all of the data!

In [None]:
features = iris.data.T
fig, axs = plt.subplots(3, 4, figsize=(12, 6))
for i in range(3):
    for j in range(4):
        a, b, c = triple(i, j)
        axs[i, j].scatter(features[a], features[b], alpha=0.5,
            s=scaled(features[c], N=100), c=iris.target, cmap='viridis')
        axs[i, j].set_xlabel(iris.feature_names[a])
        axs[i, j].set_ylabel(iris.feature_names[b])
        axs[i, j].set_title(iris.feature_names[c])
plt.tight_layout()      # Make the labels fit
_ = plt.show()

If the colors weren't there, would you be able to draw the distinction between the three species? 

## Applying mathematical and statistical tools

One common tool to help visualize high-dimensional data is Principal Component Analysis (PCA).

PCA is a tool from linear algebra, and without explaining any of it, we apply it to the iris data set.

In [None]:
from sklearn.decomposition import PCA

# Perform PCA
pca = PCA(n_components=2)
iris_pca = pca.fit_transform(iris.data)

# Create a scatter plot
plt.scatter(iris_pca[:, 0], iris_pca[:, 1], c=iris.target, cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA Projection of Iris Dataset')
plt.show()

Notice that the three species seem to fit into three clusters. 

There are some border cases between the species corresponding to the green and yellow color, but otherwise it appears to correctly distinguish the species. 

This is a common approach to classifying (or labeling) data. 