# Python ML Lab

Purpose of this document is to step you through building and running a "simple" machine learning algorithm. Each block of code is commented with explinations for the steps.

We are going to use 4 different features to predict that variety (species) of irises. The values of the features are going to be numeric.

If you do have any issues installing libraries, please feel free to reach out by email: l.m.sharp215@gmail.com. 

So without further ado, lets get started!

In [1]:
# Standard python libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# numpy for math
# pandas for managing data sets with multiple features and some plotting routines (pairwise plots for example)
# matplotlib for plotting
# Seaborn has advanced plotting routines (violin plot and pairwise plots)

In [None]:
# Scikit learn - this is the machine learning library that will be used

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# metrics for evaluations
# KNeighbors ML learning method
# LogisticRegression used for testing -> testing how often KNN is true or false
# train_test_split is to randomly split your data to minimize learning errors

In [None]:
# The iris data set, csv -> comma seperated variable
# The next commands all use the pandas library
# It has multiple features, sepal length, width, petal length, width and variety (species)

data = pd.read_csv('iris.csv')
# Head prints out the first 5 rows of data
print(data.head())

In [None]:
# Provies information on the data,the features (columns), the rows/feature (data), and data tyoe for each column
print(data.info())

In [None]:
# describe gives statistical information for each feature
# What are the mean, std, min.. for each feature
print(data.describe())

In [None]:
# data is our iris data, c=data[...] tells the function what feature to color
# Pairwise plot plots a feature compared to another. 
# Diagons are histograms of a single feature and off diagnals are cross features
#
# The figures plotted shows how data clusters or the distribution of traits
# Look at the first row, second collum -- it ask the question is there 
# a relationship beteen sepal length and sepal width

grr = pd.plotting.scatter_matrix(data, c=data["petal.width"], figsize=(15, 15), marker='o',
                        hist_kwds={'bins': 20}, s=60, alpha=.8)
plt.show()

# Do you see any correlations? What are the average values of the histograms?

In [None]:
# The violin plot will compare histograms of two features
# Useful to find means, medians, modes, and unexpected data
# Personally I find reading these a little easier to read when comparing features

# The first figure shows the varriety in sepal.length for a given variety of iris

g = sns.violinplot(y='variety', x='sepal.length', data=data, inner='quartile')
plt.show()
g = sns.violinplot(y='variety', x='sepal.width', data=data, inner='quartile')
plt.show()
g = sns.violinplot(y='variety', x='petal.length', data=data, inner='quartile')
plt.show()
g = sns.violinplot(y='variety', x='petal.width', data=data, inner='quartile')
plt.show()

# How do the average length/widths change for each variety of Iris?

OK, we've looked through some of the data so far. The data we've looked at provides statistical information and compares two or more distributions.

We can make some predictions based on it, however it may take us time to dig through all of it. Instead, we wil implement machine learning (specifically supevised learning using K nearest neighbors). 

What we want to do know is isolate one feature, and use it to train a machine learning model to predict species of an Iris

In [None]:
# X contains all features
# y is the label 

X = data.drop(["variety"],axis=1)
y = data["variety"]

# consider what the shapes are
# are X and y a matrix or a vector?
print(X.shape)
print(y.shape)

In [None]:
# experimenting with different k values
# k is a hyperparameter, tuning for a specific k is important
# And yes we are tuning as we train, in this example is saves time

k_range = list(range(1,50))
scores = []
for k in k_range:
    # model we are using is Kneighbors - used for classificaiton
    # n_neighbors -> setting the number of neighbors to compare
    
    # select the model, input hyperparameter
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # train the model,  
    knn.fit(X, y)
    
    # check how well your model predicts your data
    y_pred = knn.predict(X)
    
    # quantify your results, determines the number of accuratly predicted values
    # comparing the predicted values to the training values (how often is it right?)
    scores.append(metrics.accuracy_score(y, y_pred))
    
plt.plot(k_range, scores)
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')
plt.show()
# How does your accuracy varry with number of neighbors? What are the best values of k?

In [None]:
# The previous data trainied the model with ALL the data

# Using train_test_split will randomly split up your data into
# training and testing sets
# Remember, splitting up your data is also a hyperparameter!

# test_size = % data used (yes bit high)
# random_state = how it splits up data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
# This block of code is almost the same as above
# The difference is we don't use the FULL data set
# only the training subset

# experimenting with different n values
k_range = list(range(1,50))
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    scores.append(metrics.accuracy_score(y_test, y_pred))
    
plt.plot(k_range, scores)
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')
plt.show()

# Is the data better or worse? What is the "best" values of k?
# What happens if you use different sizes of X_train (ie, more or less than what you are currently using?)

In [None]:
# The logreg is in itself a model. We are using it
# to evaluate our testing data compared to your ground truth
# We are asking how oftern does our model predict correctly 

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

# metrics.accuracy_score(y_test, y_pred) / 1? closer to 1 the better it is
# Try re-running the logreg test for the model that uses ALL the data, how well does it do?

In [None]:
# Look at that! You've worked to optimize a model to predict 
# iris species!

# Run the prediction block (this) and see how well it guesses your iris species

knn = KNeighborsClassifier(n_neighbors=12)
knn.fit(X, y)

# make a prediction for an example of an out-of-sample observation
pred = knn.predict([[5, 3, 1, .5]])
print(pred)

For the very last test, take a look at the data, choose a species (not the one above), and fill in the knn.predict([v1,v2,v3,v4]) with numbers representing a difference species.

If our model is ''ok'', it should accuratly guess it

In [None]:
data

In [None]:
# lets confirm that it can choose another variety of Iris
# Using the average values for sepal and petal lengths and widths or Virginica
# can our model predict the right species?
# ie ~ [7,3,5,2]

pred = knn.predict([[7,3,5,2]])

print(pred)