<a href="https://colab.research.google.com/github/pkrobinette/workshops/blob/main/workshop_3_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop 3: Machine Learning
**Acknowlegements:** Resources modified from [scikit learn tutorial](https://scikit-learn.org/stable/tutorial/basic/tutorial.html#machine-learning-the-problem-setting)

- Directions: Follow along in order with the following code blocks. To run a code block, do one of the following:
1.   Hover over the block and click the black circle
2.   Press shift -> enter

If more information is needed on Python, check out the following sites:

*   [Python Installation](https://docs.anaconda.com/anaconda/install/) using Anaconda
*   More information about [Python](https://www.python.org/)
*   A [great resource](https://swcarpentry.github.io/python-novice-inflammation/setup.html) for learning Python basics, includes tutorials and setup





## Loading Example Datasets


In [25]:
# Import necessary libraries for downloading datasets
from sklearn import datasets


`scikit-learn` comes with a few standard datasets, for instance the iris and digits datasets for classification and the diabetes dataset for regression.

The iris dataset:

* Consists of 3 different types of irises' (Setosa, Versicolour, and Virginica) petal and sepal length 
*  Rows are the respective samples
*  Columns are Sepal Length, Sepal Width, Petal Length and Petal Width

The digits dataset:
- Consists of 1797 8x8 images
- Each image is of a hand-written digit
- 8x8 image is often represented by vector of length 64
  

In [2]:
# loading the iris dataset
iris = datasets.load_iris()

# loading the digits dataset
digits = datasets.load_digits()

### Exercise 1:
---
> Using the above notation, load the `wine` dataset into a variable named `wine`. You can check your implementation in the `Completed Workshop 3: Machine Learning` notebook. **note Make NOTEBOOK**

In [16]:
# load the wine dataset
wine = datasets.load_wine()

A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data member, which is a n_samples, n_features array. In the case of supervised problem, one or more response variables are stored in the .target member. 
- the rows are each of the samples
- the columns are each of the features

In [None]:
# the iris data
print(iris.data)

The targets are the actual labels of each of the samples. Each numerical label corresponds to flower type (Setosa, Versicolour, Virginica).

In [None]:
print(iris.target)

### Exercise 2:
---

> 1. Print the data of the digits dataset
2. Print the length of the digits dataset
3. Print only the first sample of the digits dataset
4. Print the targets of the digits dataset

Check your implementation in the `Completed Workshop 3: Machine Learning` section.

In [None]:
# print the digits data
print(digits.data)
# print the length of the dataset
print(len(digits.data))
#print only the first sample of the digits dataset
print(digits.data[0])
#print the targets
print(digits.target)

## Learning and Predicting

In the case of the digits dataset, the task is to predict, given an image, which digit it represents. We are given samples of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the classes to which unseen samples belong.

In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T). Fit is creating a function that maps inputs (X) to labels (y). Predict then takes this function and returns a predicted label for a given sample T.

An example of an estimator is the class sklearn.svm.SVC, which implements support vector classification. The estimator’s constructor takes as arguments the model’s parameters.

For now, we will consider the estimator as a black box:






In [40]:
# import the support vector machine library (svm) from sklearn
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)

In [41]:
clf.fit(digits.data[:-1], digits.target[:-1])

SVC(C=100.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [42]:
print(clf.predict(digits.data[-1:])==digits.target[-1:])

[ True]


In [36]:
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(iris.data[:-1], iris.target[:-1])

SVC(C=100.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [37]:
print(clf.predict(iris.data[-1:])==digits.target[-1:])

[False]
