<a href="https://colab.research.google.com/github/mtdearing/BIS-intro-to-AI/blob/master/BIS_Intro_to_AI_Hello_Wolrd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A "Hello World" introduction to doing machine learning
## for BIS at Argonne National Laboratory.

### *Matthew T. Dearing, 2019*

> Adapted from [Jason Brownlee](https://machinelearningmastery.com/).



![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/Purple_iris_flower.JPG/320px-Purple_iris_flower.JPG)

The **Iris dataset** was used in R.A. Fisher's classic 1936 paper, [The Use of Multiple Measurements in Taxonomic Problems](http://rcs.chemometrics.ru/Tutorials/classification/Fisher.pdf), and can also be found in the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/).

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

## **The Iris Dataset**
# 1. Load the dataset.
# 2. Summarize the dataset.
# 3. Visualize the dataset.
# 4. Evaluate some algorithms.
# 5. Making some predictions.

We'll use 5 important libraries:

*   scipy https://www.scipy.org
*   numpy https://numpy.org/
*   matplotlib https://matplotlib.org/
*   pandas https://pandas.pydata.org/
*   sklearn https://scikit-learn.org/

In [0]:
import sys
print('Python: {}'.format(sys.version))

(These print statements are not necessary... they are just to get your fingers wet with a bit Python code and see some output. The *import statements* below, however, are critical.)

In [0]:
import scipy
print('scipy: {}'.format(scipy.__version__))
import numpy
print('numpy: {}'.format(numpy.__version__))
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
import pandas
print('pandas: {}'.format(pandas.__version__))
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

In [0]:
# Load all the modules, functions, and objects we will use:

# pandas for processing our data and a bit of visualization:
from pandas import read_csv
from pandas.plotting import scatter_matrix

# matplotlib for statistical visualizations:
from matplotlib import pyplot

# sklearn to bring in all the machine learning fun:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Now, we will load the Iris dataset, which contains (only) 150 observations of iris flowers. 

There are four columns of measurements in centimeters with a fifth column designating the *species of the flower*.

In [0]:
# Load in the dataset (thanks, Internet!)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] # To set labels to the column headers
dataset = read_csv(url, names=names)

In [0]:
# Shape - get a high level picture of your data
dataset.shape

In [0]:
# Head - take a quick peak at the first X rows of data.
dataset.head(20)

In [0]:
# Describe your dataset with some basic statistics
dataset.describe()

In [0]:
# Class distribution - run a count based on one of the columns
dataset.groupby('class').size()

In [0]:
# Visualize the statistics of your data with box and whisker plots - to look at medians (middle values), means (averages), and spreads in your data.
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
pyplot.show()

In [0]:
# Histograms - to count the distributions of each feature
dataset.hist()
pyplot.show()

In [0]:
# Scatter plot matrix - how are the features correlated? or, how do they interact?

# Diagonal groupings suggest a predictable relationship between two features. 
scatter_matrix(dataset)
pyplot.show()

We have first tried to understand our data.

So, now we can try to develop a machine model of the data.

In [0]:
# Split up your data into training and testing (or validation) datasets.
# We'll build the model with the training sub-set and test the performance of the model on the test sub-set.
array = dataset.values
X = array[:,0:4] # Separate out just the features into a set
Y = array[:,4] # Separate out just the labels into another set

# Automatically split out the training and validation sets, randomly selected into 20% for testing, 80% for training:
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=0.20, random_state=1)

In [0]:
# Spot check multiple algorithms -- remember, No Free Lunch!

# So, we'll build a model using each of these methods in a loop and compare the results based on accuracy at the same time:
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

# evaluate each model in turn
results = []
names = []
for name, model in models:
	# K-fold (10) Cross validation to score each model. This randomly splits the data up an runs the algorithms on each to check performance.
	kfold = StratifiedKFold(n_splits=10, random_state=1)
	cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())) # The average accuracy and the standard deviation.

In [0]:
# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()

In [0]:
# Make predictions on validation dataset using the model that seemed to do the best
model = SVC(gamma='auto')
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)

In [0]:
# Evaluate predictions
# Accuracy := the ratio of the number of correct predictions to the total number of input samples. 
accuracy_score(Y_validation, predictions)

But, accuracy alone isn't enough to be sure you have a good model (it depends on the data you are working with).

In [0]:
# Create a table that compares the success of each prediction, specifying True Positives, True Negatives, False Positives, and False Negatives.
confusion_matrix(Y_validation, predictions)

**Precision** :=  The number of correct positive results divided by the number of positive results predicted by the classifier.

**Recall** := The number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).

**F1 Score** := the harmonic mean (and average of rates) of precision and recall, between [0,1] that designates how precise the classifer is (how many it gets right) and how robust it is (it doesn't get too many wrong).

In [0]:
print(classification_report(Y_validation, predictions))