# At Last!  EENG 4350/5340 version
At last we do some start some small machine learning.  The first tool we will use is scikit-learn which is relatively easy to use.  Machine learning can be summarized by two major steps:  training and inference.  Of course, we will cover both thoroughly.  Scikit-learn has two methods that correspond to the major steps **fit** and **predict**.  This simple interface allows us to experiment with a variety of learning machines by simply remembering this paradigm.  Up up and away!

This is an example from the scikit-learn documentation that we walk through

In [1]:
from sklearn import datasets
from sklearn.metrics import accuracy_score

First, we load up the key modules and functions that we need

In [2]:
# import some data to play with
iris = datasets.load_iris()
X = iris.data 
y = iris.target

We grab the data from the iris dataset.  Rather than just mechanically going through the exercise, we should look at what the *features* mean.  Remember when we talked about them?  Let's expand on the documentation by looking at the shape of the dataset.  NOTE:  this data is already very nice and collated within scikit-learn.  Real data is **NOT** like this.

In [9]:
print iris.keys()
print '*******'
print iris['feature_names']
print '********************'
print iris
print '**********'


['target_names', 'data', 'target', 'DESCR', 'feature_names']
*******
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
********************
{'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='|S10'), 'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. ,

So we see that this is a dictionary with two keys of interest at first: target and data.  Recall that a dictionary is a data structure that holds heterogenous data.  We see that here in that we have strings and a "matrix". We want to take a look at the shape of the data so we look at the shape attribute under the data key of the iris dictionary.  There is a lot of other information that is extremely relevant and serves as the basis of the "science" behind the learning.  The 'target' key is the class label.  The target is numerical and corresponds to the index into the 'target_key' key.  Let's take a look at the size of the target array now too.

In [10]:
print iris['data'].shape,'Length: ',len(iris['target'])

(150, 4) Length:  150


So this looks all good.  We have a 150 rows and each row has a target or class label.  Nice!  This is drudgery, but is a nice quality check.  How would you like to spend hours (or days) coding and debugging a learning algorithm only to learn that the data is funky in the first place.  This is usually time well spent.  

By the way, here is the meaning of the 4 features.  Does this sound like a lot of fun to collect?
1. Sepal length in cm
2. Sepal width in cm
3. Petal length in cm
4. Petal width in cm

In [11]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)

We have just created a classifiction support vector machine object.  This object has a number of methods, the most important two being fit and predict.

In [12]:
clf.fit(X,y)

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [13]:
y_hat=clf.predict(X)

We have a prediction!  Let's see how it goes.

In [14]:
accuracy_score(y, y_hat)

0.98

We see that we have an accuracy of 82%.  This is not too bad for only 150 patterns.  We must note that we are using all of the training data and 82% is probably the high point.  Once we employ proper methodologies, I would expect the accuracy to go down quite a bit.  Of course there are other learning machines to try and other parameters to experiment with that may improve things.  The next few lectures will focus on looking at actual performance, evaluating and improving our classifier and peering deeper into the meaning of the results.  

In [16]:
print (y,y_hat)
print y.shape,y_hat.shape

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

In [21]:
no_matches=[]
for i in range(len(y_hat)):
    flag='no_match'
    if (y_hat[i]==y[i]):
        flag='match'
    else:
        no_matches.append(i)
    print i,y[i],y_hat[i],flag

 0 0 0 match
1 0 0 match
2 0 0 match
3 0 0 match
4 0 0 match
5 0 0 match
6 0 0 match
7 0 0 match
8 0 0 match
9 0 0 match
10 0 0 match
11 0 0 match
12 0 0 match
13 0 0 match
14 0 0 match
15 0 0 match
16 0 0 match
17 0 0 match
18 0 0 match
19 0 0 match
20 0 0 match
21 0 0 match
22 0 0 match
23 0 0 match
24 0 0 match
25 0 0 match
26 0 0 match
27 0 0 match
28 0 0 match
29 0 0 match
30 0 0 match
31 0 0 match
32 0 0 match
33 0 0 match
34 0 0 match
35 0 0 match
36 0 0 match
37 0 0 match
38 0 0 match
39 0 0 match
40 0 0 match
41 0 0 match
42 0 0 match
43 0 0 match
44 0 0 match
45 0 0 match
46 0 0 match
47 0 0 match
48 0 0 match
49 0 0 match
50 1 1 match
51 1 1 match
52 1 1 match
53 1 1 match
54 1 1 match
55 1 1 match
56 1 1 match
57 1 1 match
58 1 1 match
59 1 1 match
60 1 1 match
61 1 1 match
62 1 1 match
63 1 1 match
64 1 1 match
65 1 1 match
66 1 1 match
67 1 1 match
68 1 1 match
69 1 1 match
70 1 2 no_match
71 1 1 match
72 1 1 match
73 1 1 match
74 1 1 match
75 1 1 match
76 1 1 match
77 1 

In [24]:
print len(no_matches),no_matches

3 [70, 77, 83]


In [34]:
for i in no_matches:
    print iris.target_names[y_hat[i]],iris.target_names[y[i]],X[i]

virginica versicolor [5.9 3.2 4.8 1.8]
virginica versicolor [6.7 3.  5.  1.7]
virginica versicolor [6.  2.7 5.1 1.6]
