<a href="https://colab.research.google.com/github/pkrobinette/workshops/blob/main/workshop_3_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop 3: Machine Learning
**Acknowlegements:** Resources modified from [scikit learn tutorial](https://scikit-learn.org/stable/tutorial/basic/tutorial.html#machine-learning-the-problem-setting)

- Directions: Follow along in order with the following code blocks. To run a code block, do one of the following:
1.   Hover over the block and click the black circle
2.   Press shift -> enter

If more information is needed on Python, check out the following sites:

*   [Python Installation](https://docs.anaconda.com/anaconda/install/) using Anaconda
*   More information about [Python](https://www.python.org/)
*   A [great resource](https://swcarpentry.github.io/python-novice-inflammation/setup.html) for learning Python basics, includes tutorials and setup





## Loading Example Datasets


In [2]:
# Import necessary libraries for downloading datasets
from sklearn import datasets


`scikit-learn` comes with a few standard datasets, for instance the iris and digits datasets for classification and the diabetes dataset for regression.

The iris dataset:

* Consists of 3 different types of irises' (Setosa, Versicolour, and Virginica) petal and sepal length 
*  Rows are the respective samples
*  Columns are Sepal Length, Sepal Width, Petal Length and Petal Width

The digits dataset:
- Consists of 1797 8x8 images
- Each image is of a hand-written digit
- 8x8 image is often represented by vector of length 64
  

In [3]:
# loading the iris dataset
iris = datasets.load_iris()

# loading the digits dataset
digits = datasets.load_digits()

### Exercise 1:
---
> Using the above notation, load the `wine` dataset into a variable named `wine`. You can check your implementation in the `Answers` section.

In [97]:
# load the wine dataset


## Exploring Data

A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data member, which is a n_samples, n_features array. In the case of supervised problem, one or more response variables are stored in the .target member. 
- the rows are each of the samples
- the columns are each of the features


> | Dataset Functions| Info                                           |
|--------------------|------------------------------------------------|
| .data              | The samples and features for each sample       |
| .feature_names     | Access the names of the features of the dataset|
| .target            | Access the actual labels for each sample       |
| .target_names      | Access the label type for each of the targets  |
| .data.shape        | The shape of the data vector                   |
 

In [None]:
# the iris data
print(iris.data)

In [6]:
# the feature names
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


The targets are the actual labels of each of the samples. Each numerical label corresponds to flower type (Setosa, Versicolour, Virginica).

In [None]:
# print targets
print(iris.target)

In [20]:
# print target labels and names from iris dataset
print("Target labels and names:")
for item in range(len(iris.target_names)):
  print( "%s = %s" %(item, iris.target_names[item]))
  

Target labels and names:
0 = setosa
1 = versicolor
2 = virginica


### Exercise 2:
---

> 1. Print the data of the digits dataset
2. Print the shape of data in the digits dataset
3. Print only the first sample of the digits dataset
4. Print the targets of the digits dataset
5. Print the target names of the digits dataset

Check your implementation in the `Answers` section.

In [96]:
# print the digits data

# print the shape of data in the dataset

#print only the first sample of the digits dataset

#print the targets

# print the feature names


## Splitting the Data

To be able to evaluate your model, you need to split your data into a training and testing set. Some people will also include a validation set. The training set is used to fit a model using and the testing set is used to evaluate that model. A common split is 80% training and 20% testing.

The scikit-learn library gives us an easy way to split datasets with `train_test_split()`. This function needs 3 parameters: features, target, and test_set size. Additionally, you can use random_state to select records randomly.

In [80]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.2,random_state= 4)

This notation can be a little confusing at first. Below are what each of these new datasets are used for.

> X_train: the features of the training set.

> y_train: the actual labels of the data in X_train

> X_test: the features of the testing set

> y_test: the actual labels of the data in X_test

It helps to remember that we will use fit(X, y), which finds a function that maps inputs (X) to output labels (y).


### Exercise 3:
---
> 1. print the shape of each of the new datasets
2. Change the test_size and repeat
3. Change the random_state number and repeat

Check you answers in `Answers` section.

In [95]:
# print the shape of the training and testing data


## Learning and Predicting

In the case of the digits dataset, the task is to predict, given an image, which digit it represents. We are given samples of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the classes to which unseen samples belong.

In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T). *Fit is creating a function that maps inputs (X) to labels (y). Predict then takes this function and returns a predicted label for a given sample T.*

An example of an estimator is the class sklearn.svm.SVC, which implements support vector classification. The estimator’s constructor takes as arguments the model’s parameters.

For now, we will consider the estimator as a black box:






In [81]:
# import the support vector machine library (svm) from sklearn
from sklearn import svm

# clf = classifier
clf = svm.SVC(kernel = 'linear')

**Training**

In [82]:
# fit a function that maps inputs to outputs
clf.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

**Predict**

In [None]:
y_pred = clf.predict(X_test)

# now compare predictions to actual labels. 
# This will be True if the predicted value is equal to the actual value.
# Else it will be False.
results = [y_pred == y_test]

# print the results
print(results)

## Evaluate Model
Scikit-learn also has some pretty handy built-in functions for evaluating models.

> `metrics.accuracy_score(y_test, y_pred)` : returns the percentage of correct predictions compared to actual labels

> `metrics.precision_score(y_test, y_pred)` : What proportion of positive identifications was actually correct?

> `metrics.recall_score(y_test, y_pred)` : What proportion of actual positives was identified correctly?

> `metrics.confusion_matrix(y_test, y_pred)` : Each Column is the predicted label (0-9), each row is the actual label (0-9). This allows us to easily evaluate where our model is commonly making mistakes.

> `metrics.plot_confusion_matrix(y_test, y_pred)` : fancier version of above

> ` metrics.classification_report(y_test, y_pred, digits=10)`: Pretty much every metric combined.


In [None]:
from sklearn import metrics

# print accuracy
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))

# Print the confusion matrix
print(metrics.confusion_matrix(y_test, y_pred))

# fancier confusion matrix
disp = metrics.plot_confusion_matrix(clf, X_test, y_test)
disp.figure_.suptitle("Confusion Matrix")

# Print the precision and recall, among other metrics
print(metrics.classification_report(y_test, y_pred, digits=10))

### Exercise 4: 
---
The above model uses the 'linear' kernal to train but this parameter can also be:
- polynomial : `'poly'`
- sigmoid : `'sigmoid'`
- radial basis function : `'rbf'`

> 1. Change the kernel parameter to each of these values and see how your accuracy and confusion matrix changes.

The degree of the polynomial function can also be set.
> 2. Explore and evaluate different polynomial functions with varying degrees. How do these affect the model? e.g. `(clf = svm.SVC(kernel = 'poly', degree = 2)`

A common problem in training is overfitting. This is when the function that maps inputs to outputs overfits to the training data. When new predictions are made using the trained model, they do not perform well because they follow too closely to the training scenarios. 
> 3. For more information and examples on overfitting and underfitting, check out this [website](https://keeeto.github.io/blog/bias_variance/).

### Exercise 5:
---
> 1. load the `*breast_cancer*` dataset into a variable named cancer
> 2. Explore the datasets features and target names
> 2. Split the data into 70% training and 30% testing
> 3. make an instance of the svm.SVC classifier that uses the polynomial kernel with degree 2.
> 4. Train the model
> 5. Make predictions on the test set
> 5. Evaluate the model's accuracy, and confusion matrix
> 8. In this example, consider the following:
  - false positives : something is predicted as benign that is actually malignant
  - false negatives : something that is predicted as malignant but is actually benign
  - Which would you rather avoid in this particular example? It is important to note that you can train a model with these types of situations in mind.


Check answers in the `Answers` section.

In [93]:
# Load the cancer dataset


# explore the data


# split the data into training and testing


# initialize classifier with polynomial kernel of degree 2


# train the model


# predictions


# evaluate the model


---
# Answers
---

## Exercise 1:
> Using the above notation, load the `wine` dataset into a variable named `wine`. You can check your implementation in the `Answers` section.

In [None]:
# load the wine dataset
wine = datasets.load_wine()

## Exercise 2:
> 1. Print the data of the digits dataset
2. Print the shape of data in the digits dataset
3. Print only the first sample of the digits dataset
4. Print the targets of the digits dataset
5. Print the target names of the digits dataset

In [None]:
# print the digits data
print(digits.data)
# print the shape of data in the dataset
print(digits.data.shape)
#print only the first sample of the digits dataset
print(digits.data[0])
#print the targets
print(digits.target)
# print the feature names
print(digits.target_names)

## Exercise 3: 
> 1. print the shape of each of the new datasets
2. Change the test_size and repeat
3. Change the random_state number and repeat

In [None]:
# print the shape of the training data
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

## Exercise 4: 
The above model uses the 'linear' kernal to train but this parameter can also be:
- polynomial : `'poly'`
- sigmoid : `'sigmoid'`
- radial basis function : `'rbf'`

> 1. Change the kernel parameter to each of these values and see how your accuracy and confusion matrix changes.

The degree of the polynomial function can also be set.
> 2. Explore and evaluate different polynomial functions with varying degrees. How do these affect the model? e.g. `(clf = svm.SVC(kernel = 'poly', degree = 2)`

A common problem in training is overfitting. This is when the function that maps inputs to outputs overfits to the training data. When new predictions are made using the trained model, they do not perform well because they follow too closely to the training scenarios. 
> 3. For more information and examples on overfitting and underfitting, check out this [website](https://keeeto.github.io/blog/bias_variance/).

In [None]:
# examples
clf = svm.SVC(kernel='poly')
clf = svm.SVC(kernel='sigmoid')
clf = svm.SVC(kernel = 'rbf')

## Exercise 5: 
> 1. load the `*breast_cancer*` dataset into a variable named cancer
> 2. Explore the datasets features and target names
> 2. Split the data into 70% training and 30% testing
> 3. make an instance of the svm.SVC classifier that uses the polynomial kernel with degree 2.
> 4. Train the model
> 5. Make predictions on the test set
> 5. Evaluate the model's accuracy, and confusion matrix
> 8. In this example, consider the following:
  - false positives : something is predicted as benign that is actually malignant
  - false negatives : something that is predicted as malignant but is actually benign
  - Which would you rather avoid in this particular example? It is important to note that you can train a model with these types of situations in mind.

In [None]:
# Load the cancer dataset
cancer = datasets.load_breast_cancer()

# explore the data
print(cancer.feature_names)
print(cancer.target_names)
print(cancer.data.shape)
print(cancer.data)

# split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size = 0.3, random_state = 10)

# initialize classifier with polynomial kernel of degree 2
clf = svm.SVC(kernel = 'poly', degree=2)

# train the model
clf.fit(X_train, y_train)

# predictions
y_pred = clf.predict(X_test)

# evaluate the model
print("Accuracy: ", metrics.accuracy_score(y_pred, y_test))
disp = metrics.plot_confusion_matrix(clf, X_test, y_test)
disp.figure_.suptitle("Confusion Matrix")

print(metrics.classification_report(y_test, y_pred, digits=2))

8. In this instance, we would want to avoid false positives. It would be much worse to miss something cancerous than to misslabel something as cancerous.