## Example problem 
*Identify a flower based on its attributes, such as petal length, sepal width, ...* (sepal: đài hoa, petal: cánh hoa)

Ref: http://scikit-learn.org/stable/tutorial/basic/tutorial.html

![](img/Petal-sepal.jpg)
(*"which flower is this?" - photo from [bing](https://www.bing.com/images/search?view=detailV2&ccid=E8dlW334&id=690CFCA2C5419961A77E0534DC3B6AD1B61FA711&q=sepal&simid=608048017631022285&selectedIndex=5&ajaxhist=0))*
 

**Experience** i.e ***Training data***: `iris` dataset - contains data about different types of flower
    * for demonstration purpose, we are using a "clean" dataset (already processed and stored in structured format), so you don't have to do data pre-processing. Reminded that it is mostly **not** the case in practice.

***Note***: From now on, we will use consistent naming convention with the math notation
used in lecture. Click [here](https://hoamle.github.io/articles/17/machine-learning-appendix/#glossary) 
for summary of the notations. The link also lists common terminology in machine learning with their synonyms or strongly related terms.

First, study our data, so that we can "eye-balling", manipulate or extract relevant information from the data.

In [1]:
from sklearn import datasets 
iris_data = datasets.load_iris()

# print(iris_data)

* Read the "docs"

In [16]:
print(iris_data['DESCR'][:1000])  # the "doc" is quite long, so
                                  # I extract the first 1000 characters
                                  # for demonstration purpose    

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
  


* There are 150 examples (i.e. *data points*): 
* Each flower type is described by $P=4$ attributes (i.e. ***features*** or predictors): `sepal length` ($x_{1}$), 
`sepal width` ($x_{2}$), `petal length` ($x_{3}$), `petal width` ($x_{4}$). We write the features as a feature vector $x=\left(x_{1},x_{2},x_{3},x_{4}\right)^{T}$



In [4]:
iris_data['data'].shape

(150L, 4L)

* There are $K=3$ types (i.e. ***classes***) of flower, which are `setosa` (id: 0), `versicolor` (id: 1), `virginica`  (id: 2)

In [5]:
iris_data['target_names']

array(['setosa', 'versicolor', 'virginica'], 
      dtype='|S10')

In [6]:
iris_data['target']

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

To summarize, given *any* attribute $x$, we would like to predict a value $y$ in {`setosa`, `versicolor`, `virginica`} by a *ML model*. Computing the prediction is one of the ***inference tasks***  that a model need to perform.
![](img/week1-1.png)

To build a model, we need to *train* it by using the experience i.e. *training data* we have. This task is called ***learning task***, and is also commonly called *fitting*. 
![](img/week1-2.png)
- (*Note*: there are ML algorithms that virtually do not perform any kind of learning, e.g. k-nearest neighbour. These algorithms also have their own strengths and weaknesses.)


For demonstration purpose, we use only a subset of `iris` dataset to train our model, and assume the left-over are *new data* that we would like to predict. Those new data are called ***test data***, and are the main indicator for evaluating our model performance.

In [7]:
# We leave 20% of the examples as test data, the other 80% as training data

from sklearn.cross_validation import train_test_split
# note: sklearn 0.18 re-allocates `train_test_split` to 
# `sklearn.model_selection` module 
indicies = [i for i in xrange(len(iris_data['data']))]
train_idx, new_idx = train_test_split(indicies, train_size=0.8,
                                     random_state=1)

# I use capital `X` to indicate a set of data points
D_train = {'X': iris_data['data'][train_idx],
           'Y': iris_data['target'][train_idx]
           }
D_new = {'X': iris_data['data'][new_idx],
         'Y': iris_data['target'][new_idx]
        }
print("Test set size: {} examples".format(len(new_idx)))

Test set size: 30 examples


Now learn our model

In [8]:
# The first lecture intended to introduce high-level concepts in
# machine learning. Therefore, you do not need to understand e.g.
# specifically "what LogisticRegression is", "how to choose or
# desgin a model", or "how `fitting` works", etc. They are left 
# for the next lectures.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()  # initialize `model` i.e. "baby" model
model.fit(D_train['X'], D_train['Y']);  # after `.fit`, the model is trained

Learning finished! Now try indentifying a new flower with feature $x^{\left(\text{new}\right)}$

In [9]:
x_new = D_new['X'][0]
print(x_new)

[ 5.8  4.   1.2  0.2]


In [10]:
y_pred = model.predict(x_new)  
print(y_pred)

[0]




Our model `predict`s that the flower whose petal and sepal size are as described by $x^{\left(\text{new}\right)}$ is a "setosa" (`id: 0`). ***Is this true?***

In [11]:
print(y_pred == D_new['Y'][0])

[ True]


Good news! But what about prediction for other flowers?

In [15]:
Y_pred = model.predict(D_new['X'])

print("FlowerId\tPredict\tActual\tCorrect prediction?")
for i in xrange(len(new_idx)):
    print("{}\t\t{}\t{}\t{}\t".format(
            new_idx[i], 
            Y_pred[i],
            D_new['Y'][i],
            Y_pred[i] == D_new['Y'][i]
        ))

FlowerId	Predict	Actual	Correct prediction?
14		0	0	True	
98		1	1	True	
75		1	1	True	
16		0	0	True	
131		2	2	True	
56		2	1	False	
141		2	2	True	
44		0	0	True	
29		0	0	True	
120		2	2	True	
94		1	1	True	
5		0	0	True	
102		2	2	True	
51		1	1	True	
78		2	1	False	
42		0	0	True	
92		1	1	True	
66		2	1	False	
31		0	0	True	
35		0	0	True	
90		1	1	True	
84		2	1	False	
77		2	1	False	
40		0	0	True	
125		2	2	True	
99		1	1	True	
33		0	0	True	
19		0	0	True	
73		1	1	True	
146		2	2	True	


Our model correctly indentifies 25 examples and makes 5 errors out of 30 new examples, achieving an [*accuracy*](http://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score) of $\dfrac{1}{N}_{\text{test}}\sum_{n}\mathbb{I}\left(\hat{y}^{\left(n\right)}=y^{\left(n\right)}\right)=83.3\%$

In [14]:
sum(Y_pred == D_new['Y']) / float(len(new_idx))

0.83333333333333337

* Note: Accuracy is an appropriate measure of performance for this problem. However, other problems may require other ***evaluation metric(s)***. For example, in case you want to identify whether a patient has cancer or not, we may not not worry too much about incorrectly identify `Yes` (False Positive cases) and only want to miss as few true cases (actually have cancer - True Positive cases) as possible. In this case, [*Recall*](http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics) is a better measure of performance than Accuracy.