# Analyzing the Iris Dataset

#### In this analysis, you will use two machine learning methods implemented in python to predict the species of Iris based on its flower anatomy.

#### Please take your time with this analysis. Run each command (using Shift + Enter) and think about what you are asking the computer to do.

#### The iris dataset was compiled by [R. A. Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) and published in 1936, which is ages ago. Though initially used for fairly naive discrimination methods, it is an early example of machine learning for predictive data in biology. You can read more about the dataset on [Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set).

#### The first thing we will do is load the data, which is included as part of the sklearn python module. If you end up using machine learning in your practcal projects, it is likely that sklearn will be of use to you.

In [None]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

#### We can quickly check a few things to ensure our data are loaded.

> #### 1. The data type - in python, we tend to call the data type used by sklearn a ["Bunch"](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html#sklearn.utils.Bunch)
> #### 2. The column headers (keys)
> #### 3. The values of the first 5 rows of data
> #### 4. The species each row's data belong to
> #### 5. The names of those species
> #### 6. The feature (variable) names
> #### 7. The location of the dataset

In [None]:
type(iris_dataset)
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))
print("\n")
print("First five columns of data:\n{}".format(iris_dataset['data'][:5]))
print("\n")
print("Targets:\n{}".format(iris_dataset['target'][:]))
print("\n")
print("Target names:\n{}".format(iris_dataset['target_names']))
print("\n")
print("Feature names:\n{}".format(iris_dataset['feature_names']))
print("\n")
print("Dataset location:\n{}".format(iris_dataset['filename']))


#### We can also look at the description of the dataset provided by one of the hardworking folks at sklearn

In [None]:
print(iris_dataset['DESCR'])


#### OK - we can be fairly confident that our data are loaded correctly.

#### Now we need to set our data up for analysis. We are using the handy python function "train_test_split" from the sklearn module. This converts our data into a test set and a training set. Have a look at how the funciton is implemented here and consider the arguments. random_state is set to 0 - don't worry about this. It's just the random seed and by fixing it we guarentee that we all get the same result. Normally, this would not be set.


### Questions
>#### 1. Why do we split our dataset into a training set and a test set?
>#### 2. What proportion of the dataset will be included in the training set?
>#### 3. Given our use of train_test_split, do you think the algorithms we will be using are supervised, or unsupervised method?



In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        iris_dataset['data'], iris_dataset['target'], test_size=0.25, random_state=0)
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape)) 
print("y_test shape: {}".format(y_test.shape))

#### The output of the above tells you the dimentions of the data - what it's saying is that, for example, the training dataset is 112 plants with 4 features plus the Y variable (species)

#### Now we will convert the data into a ["pandas"](https://pandas.pydata.org/docs/user_guide/index.html) dataframe. As far as I'm aware, pandas has nothing to do with the [black and white balls of clumsyness that they have in Edinburgh zoo](https://www.wwf.org.uk/learn/fascinating-facts/pandas). More boringly, it is a set of functions that allow us to interpret python data in data-frames, like you might be used to in R. A lot of sklearn functions rely on pandas data structures.

#### We can then plot the 4 measurement variables against each other, observing how they correlate and how species tend to cluster with each other. We call this a [scatter matrix](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.scatter_matrix.html).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',hist_kwds={'bins': 20}, s=60, alpha=.8)

plt.figure()
plt.imshow([np.unique(iris_dataset.target)])
_ = plt.xticks(ticks=np.unique(iris_dataset.target),labels=iris_dataset.target_names)

### Questions
>#### 4. Given the scatter matrix above, which combination of features will be most likely to create an accurate model with which to predict the species of iris?
>#### 5. Given the scatter matrix above, which species will be the easiest to predict?

#### Right. Now you get to do some actual computation. We will be implementing the [K Neighbours](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification) algorithm to create a model to predict the species of iris from petal length and petal width, which you hopefully identified as the most informative pair of features above. K Neighbours is a very simple method whose model is built simply by asking for a datapoint, the identify of the nearest K neighbours. The majority vote amounts to the prediction for that point. There is one parameter - n_neighbors (K). This is the size of the sample of nearest neighbours to make a prediction from. i.e. K=1 - the prediction is the identity of the nearest neighbour. K=3 - the prediction is the identity of the highest number of the nearest 3 neighbours. In our case, we will set this to one. Have a think about how changing this might affect your predictions.

#### Technically, unless we are estimating K, which we are not here, K nearest neighbours is not machine learning, but a lot of the principles that we learned in the lecture apply.

#### The way that sklearn works is, to my mind, a bit unintuitive. First, you create a variable that features all of the hyperparameters with which you can make your model (here that variable is knn).

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

#### Then you can fit the model on the data - we have here chosen to fit it to the petal length and petal width `(X_train[:,2:])`

In [None]:
knn.fit(X_train[:,2:], y_train)

#### We can visualise our model using the following code - don't worry too much about the code. Just know that it first sets up the plot area, then colours the plot according to the mode estimated, then plots the training data and in a colour corresponding to the species according to a legend.

#### The code below produces a warning about colours which is to boring to concern us. Don't worry about it.

In [None]:
import numpy as np
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

n_neighbors = 1
x_min,x_max = X_train[:,2].min() - 1, X_train[:,2].max()+ 1
y_min,y_max = X_train[:,3].min() - 1, X_train[:,3].max()+ 1
h=0.02
xx,yy = np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])
cmap_light=ListedColormap(['orange', 'cyan', 'cornflowerblue'])

fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='gouraud')
for target in iris_dataset.target_names:
    index=np.where(iris_dataset.target_names==target)[0][0]
    ax1.scatter(X_train[:,2][y_train==index],X_train[:,3][y_train==index],
                cmap=cmap_bold,edgecolor='k', s=20, label=target)
ax1.set_xlim(x_min,x_max)
ax1.set_ylim(y_min,y_max)
ax1.legend()
ax1.set_xlabel("petal length (cm)")
ax1.set_ylabel("petal width (cm)")
ax1.set_title("3-Class classification (k = %i, weights = '%s')"
              % (n_neighbors, 'uniform'))
plt.show()

#### Have a look at your "model" and think about what this means for making predictions in the test set.

### Questions
>#### 6. What combinations of petal length and petal width are likely to be difficult to predict in the test set based on the figure?
>#### 7. It is impossible to say for sure without extra analyses, but from the figure, do you think we've overfit or underfit the model to the data?

#### The next step is to predict the value for a new flower. Imagine you found an Iris and took measurements of 4, 3.5, 1.2, and 0.5 cm for sepal length, sepal width, petal length and petal width respectively? What species does our model propose this flower belongs to? Remember, you don't actually know for sure in this case, though it's pretty clear cut here.

In [None]:
new_data=np.array([[4,3.5,1.2,0.5]])
prediction = knn.predict(new_data[:,2:])
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(iris_dataset['target_names'][prediction]))

#### Now is the moment of truth. Does our model predict the species well on an independent dataset (our test set)?

#### To ask this question, we throw the X values (petal length and width) of the test set at our model and return the predictions. After this, you can print the true values of the species (0 = setosa, 1 = versicolor, 2 = virginica) and the predicted values.

In [None]:
y_pred = knn.predict(X_test[:,2:])
print("Test set predictions:\n {}".format(y_pred))
print("Test set true values:\n {}".format(y_test))

### Questions
>#### 8. What is the true species of the flower which was incorrectly asigned a species by our model?
>#### 9. What is the predicted species of the flower which was incorrectly assigned a species by our model?

#### We can also ask how accurate the model is - in this case simply as a measure of the proportion of corrections that were correct of the total number of predictions

In [None]:
print("Test set score: {:.2f}".format(knn.score(X_test[:,2:], y_test)))

#### We can plot the values for the test set on the same axes that we plotted the training points.

#### As before, ignore the warning

In [None]:
x_min,x_max = X_train[:,2].min() - 1, X_train[:,2].max()+ 1
y_min,y_max = X_train[:,3].min() - 1, X_train[:,3].max()+ 1
h=0.02
xx,yy = np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])
cmap_light=ListedColormap(['orange', 'cyan', 'cornflowerblue'])

fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='gouraud')
for target in iris_dataset.target_names:
    index=np.where(iris_dataset.target_names==target)[0][0]
    ax1.scatter(X_test[:,2][y_test==index],X_test[:,3][y_test==index],
                cmap=cmap_bold,edgecolor='k', s=20, label=target)
ax1.set_xlim(x_min,x_max)
ax1.set_ylim(y_min,y_max)
ax1.legend()
ax1.set_xlabel("petal length (cm)")
ax1.set_ylabel("petal width (cm)")
ax1.set_title("3-Class classification (k = %i, weights = '%s')"
              % (n_neighbors, 'uniform'))
plt.show()

#### Ask yourself how the model performs. Were you right about the range in which prediction would be more difficult?

### Questions
>#### 10. Can you think of any downsides of the K nearest neighbours algorithm for this dataset?
>#### 11. In a hypothetical scenario, you are given a load more data and your accuracy reduces to .9. Upon investigation you find that your model is overfit. This means that the model is inferring patterns that are true only of the training set, meaining they are not generalisable. How might you adapt your strategy to reduce overfitting but still using K nearest neighbours?

### Random forests

#### Assuming no dreadful miscalculation of timings on the part of your instructor, you will have heard about [Random Forests](https://towardsdatascience.com/random-forest-3a55c3aca46d) in the lecture part of this taster session. I hope you were paying attention.

#### Just in case you weren't, Random Forest is a machine learning algorithm that asks questions of our data using decision trees. The model is built by slowly splitting the dataset into purer classes (i.e. moving towards all setosa or versicolor etc.) according to the X values. A tree might split the data into those with a petal length over 3cm and those with petal length under 3cm. Those flowers with length over 3cm might then be split into those with petal width over 1.5cm and those with petla width under 1.5cm. This happens sequentially until the nodes of the tree are at some desired purity level (in this case complete purity) or the tree reaches a maximum depth (in this case 8). The random forest generates a given number of trees to make its predictions, here 100.

#### Unlike K nearest neighbours above, we will use all 4 features to create our model here. In principle, Random Forests should rely on the most informative features without us specifying them so there isn't much need to worry about which features will be most informative, though too many completely useless features will reduce the efficacy of the model.

#### As we've seen, in sklearn, you set up the model first then fit it to your data, which is done below. Have a look at the parameters and think about how the analysis might change under different hyperparameters. To read what each of these parameters does, you can look at the [sklearn Random Forest classifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 100,
                                       max_depth = 8,
                                       max_features='sqrt',
                                       min_samples_split=2,
                                       n_jobs=1,
                                       random_state=0)
model.fit(X_train, y_train)

#### To give you an idea of what the model actually looks like, here is some code that plots the first of the trees in your forest, which should all look identical because you've been given the same starting seed. In each box you are given:
>#### a) The decision being made (how the dataset is being split). This is missing in terminal nodes.
>#### b) The [GINI importance](https://sam-black.medium.com/calculating-a-features-importance-with-xgboost-and-gini-impurity-3beb4e003b80) of that node. Don't worry too much about GINI unless you are especially interested. For now, you can think of it as the efficacy with which that node splits the tree into purer samples
>#### c) The number of samples (flowers) that have taken the path leading to that node
>#### d) The value - number of samples of each species that have taken the path leading to that node, formatted as [N_setosa, N_versicolor, N_virginica]
>#### e) the class that would be predicted if a new flower took that path through the tree.

#### NOTE: when a flower satisfies the condition set out in the decision in a box it always passes to left-hand box below and flowers that do not satisfy the condition follow the branch on the right-hand side!

#### In my view, the tree displayed below will confirm the _Iris setosa_ is the easiest species to discriminate but also shows how random forests can find complex patterns of features that lead to a distinction between _Iris versicolor_ and _Iris virginica_.



In [None]:
from sklearn import tree

fn=iris_dataset['feature_names']
cn=iris_dataset['target_names']
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
tree.plot_tree(model.estimators_[0],
               feature_names = fn, 
               class_names=cn,
               filled = True);


### Questions
>#### 12. How many _Iris setosa_ plants in the training set have a petal width > 0.75cm?
>#### 13. Which species would a flower with petal width 2.0cm, petal length 4.1cm, sepal length 6.6cm, and sepal width 2.4cm be predicted to be by the tree displayed here?
>#### 14. What is the route taken through this tree by the majority of _Iris virginica_ plants?

#### Now we predict the species for our test set and compare to the true values

In [None]:
y_pred_test = model.predict(X_test)
print("Test set predictions:\n {}".format(y_pred_test))
print("Test set true values:\n {}".format(y_test))
print("Test set score: {:.2f}".format(model.score(X_test, y_test)))

#### So, with all that extra fuss about trees etc., our result is exactly the same. The same flower is predicted to be the same, incorrect species with all the rest being correct. Clearly, in this case, the extra complexity of the random forest, at least in how it was implemented here, has not improved prediction, though with more data, who knows what would happen?

#### There is still an added benefit of the Random Forest algorithm over K nearest neighbours. Remember when I said you don't need to worry too much about GINI? Well the commands below will print the values of GINI for each of the 4 different features. Again, we don't need to worry too much about the specifics of GINI, but you can know that it gives you an idea of the relative importance of each of the features in building the model.

In [None]:
for i in range(4):
    print(iris_dataset['feature_names'][i])
    print(model.feature_importances_[i].round(3))

### Questions
>#### 15. Order the variables in most important to least important. Compare with the scatter matrix. Does this fit your expectations?

#### That's the base practical done. Well done. If you are finished but your appetite for machine learning has not yet been satiated, you can play about with the code to your heart's content. Here are some suggestions.
>#### Change the parameters of the K nearest neigbours algorithm and rerun, comparing results with the basic ones.
>#### Change the parameters of the Random Forest part of the practical and see how results change. You might find it intersting to see how results change with only 1 tree of max depth 2 or something. This wouldn't be a good analysis but could show how much is acheived by just one feature
>#### Implement your own sklearn algorthm on the data - perhaps a neural network. This might be very difficult - I've not done it myself.

#### If you need any help with the controls of the jupyter notebook, let the instructor(s) know.