
# Classification

Three main problems in machine learning: regression, clustering and classification.

We have data.

From data, we have features:  X.  Statisticians call these "independent variables".

We also have some kind of target: y.  Statisticians call this "dependent variable".

Case 1: if y is unknown, what have is "clustering".  All we can do is group data into clusters. We don't know the names/labels of the clusters.

Case 2: if y is known and is continuous, then we have "regression".

Case 3: if y is known and discrete, then we have a "classification".
    - Example: iris species (y).  sepal & petal widths/lengths (X).


In [1]:
import pandas
iris = pandas.read_csv('/Users/vphan/Dropbox/datasets/iris.csv')

In [2]:
X = iris.drop('Species', axis=1)

In [3]:
X.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [4]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(X)
print(km.labels_)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2
 2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2
 2 0]


In [5]:
km.cluster_centers_[1]

array([5.006, 3.428, 1.462, 0.246])

# Nearest Centroid

In [6]:
from sklearn.neighbors.nearest_centroid import NearestCentroid

In [7]:
X = iris.drop('Species', axis=1)
y = iris.Species
nc = NearestCentroid()
nc.fit(X,y)

NearestCentroid(metric='euclidean', shrink_threshold=None)

In [8]:
iris.sample()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
133,6.3,2.8,5.1,1.5,virginica


In [9]:
nc.predict([  [6.1,3.1,4.2,1.33] ])

array(['versicolor'], dtype=object)

In [10]:
iris.sample()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
124,6.7,3.3,5.7,2.1,virginica


In [11]:
nc.predict([  [6.1,3.1,4.2,1.33],  [5.2,3.3,1.5,0.3] ])

array(['versicolor', 'setosa'], dtype=object)

In [12]:
df = pandas.read_csv('~/Dropbox/datasets/admission.csv')
df.head()

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


In [13]:
len(df)

400

In [14]:
df['rank'].value_counts()

2    151
3    121
4     67
1     61
Name: rank, dtype: int64

In [15]:
from sklearn.neighbors import KNeighborsClassifier

In [16]:
nc = NearestCentroid()
nc

NearestCentroid(metric='euclidean', shrink_threshold=None)

In [17]:
X = df[ ['gre', 'gpa', 'rank'] ]
y = df.admit
nc.fit(X,y)

NearestCentroid(metric='euclidean', shrink_threshold=None)

In [18]:
nc.predict([ [680, 2.8, 2], [400, 3.1, 3]  ])

array([1, 0])

# K-Nearest Neighbors Classification

In [19]:
knn = KNeighborsClassifier()
knn

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [20]:
knn.fit(X,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [21]:
knn.predict([ [680, 2.8, 2], [400, 3.1, 3]  ])

array([0, 0])

# Model evaluation

For classification, people usually use precision and recall to measure the quality of predictions.

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

We are talking about outcomes of predictions/tests.

Examples:
- Testing for a flu
    - a positive means you have the flu.
- Prediction of admission to ULCA
    - a positive means you are accepted.


- What is a positive? it means the test/prediction says it's the positive case of the two possible scenarios.

- What is a true positive?
    - the test/prediction says it's positve.  And the truth is also positive.
    
- What is a false positive?
    - the test/prediction says it's positive. But the truth is negative.
    
- What is a negative?
    - it means the test/prediction says it's the negative case of the two possible scenarios.
    
- What is a false negative?
    - the test/prediction says it's a negative.  But the truth is positive.
    
400 data points:
- 273 negatives
- 127 positives
    
A classifier has the following performance:
- 150 are positives: 120 are TP, 30 FP.
- 250 are negatives: 200 are TN, 50 FN.

Precision = 120 / (120 + 30)

Recall = 120 / (120 + 50)

If the classifier makes zero FP, precision is 1.



In [22]:
from sklearn.model_selection import train_test_split

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [24]:
len(X_train), len(X_test)

(200, 200)

In [25]:
nc.fit(X_train, y_train)

NearestCentroid(metric='euclidean', shrink_threshold=None)

In [26]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [27]:
y_predicted_nc = nc.predict(X_test)

In [28]:
y_predicted_knn = knn.predict(X_test)

In [29]:
from sklearn.metrics import precision_score, recall_score, classification_report

In [30]:
precision_score(y_predicted_nc, y_test), recall_score(y_predicted_nc, y_test)

(0.6190476190476191, 0.4148936170212766)

In [31]:
precision_score(y_predicted_knn, y_test), recall_score(y_predicted_knn, y_test)

(0.25396825396825395, 0.45714285714285713)

In [32]:
127/400

0.3175

In [33]:
print(classification_report(y_predicted_nc, y_test))

             precision    recall  f1-score   support

          0       0.60      0.77      0.67       106
          1       0.62      0.41      0.50        94

avg / total       0.61      0.60      0.59       200



In [34]:
print(classification_report(y_predicted_knn, y_test))

             precision    recall  f1-score   support

          0       0.86      0.72      0.78       165
          1       0.25      0.46      0.33        35

avg / total       0.76      0.67      0.70       200



# Decision Tree Classifier

In [35]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_predicted_dt = dt.predict(X_test)

In [36]:
print(classification_report(y_predicted_dt, y_test))

             precision    recall  f1-score   support

          0       0.66      0.69      0.67       131
          1       0.35      0.32      0.33        69

avg / total       0.55      0.56      0.55       200



In order to build a decision tree, we have to choose an attribute "to split" the data.

## Important concepts

### Expected value (of a random variable).

What is a random variable?
    - A random variable is a function: f : Events --> Real numbers
    - Examples: 
        (1) Flip two dice and add the points.
            possible values: 2, 3, ..., 12
        (2) Pick a random person and ask for his/her age.
        (3) The score of a football game.
        (4) Time to travel from Memphis to Little Rock.
        (5) Silhoutte score of Kmeans
        (6) Toss if coin, if H wins $100. If T, lose $50.

A random variable has an expected value (average/mean).

If we know the probability distribution of the events, we can compute the expected value of the random variable.
   
Example:
(1) Toss if coin, if H wins $100. If T, lose $50.
    P(H) = P(T) = 1/2

Expected Value is the weighted (by probability) sum of values of all events.

Expected Value = 0.5 * 100 + 0.5 * (-50) = 50 - 25 = 25.


(2) Flip two dice (assume perfect/uniform flips), add the points.
    Each event consists of the faces of two dice.  
    P( X = 2 ) = (1/6) * (1/6) = 1/36
    P( X = 12 ) = 1/36
    
Expected Value is the weighted (by probability) sum of values of all events.

Expected Value = 2*1/36 + 3*1/36 + .... + 12*1/36

### Entropy  (aka Shannon Entropy)

We have a probability distribution X.  For example: tossing a fair coin. In this case the distribution is (1/2, 1/2).

A biased die might have the following distribution (1/2, 1/4, 1/16, 1/16, 1/16, 1/16).

The sum of probabilities of all events must be 1.

In [37]:
sum((1/2, 1/4, 1/16, 1/16, 1/16, 1/16))

1.0

The Entropy(X) is equal to the expected value of the inverse of log2 of probilities.

Entropy((1/2, 1/2)) = 0.5*log(2) + 0.5 * log(2) = log(2) = 1

Entropy( (1/2, 1/4, 1/4) ) = 0.5*log(2) + 0.25*log(4) + 0.25*log(4).

### What does Entropy tell us?


In [38]:
import math

def Entropy(*probs):
    s = 0
    for p in probs:
        if p > 0:
            s += p * math.log2(1.0 / p)
    return s


In [39]:
Entropy(1/2, 1/2)

1.0

In [40]:
Entropy(1/4, 1/4, 1/4, 1/4)

2.0

In [41]:
Entropy(1, 0)

0.0

In [42]:
Entropy(1, 0, 0, 0)

0.0

In [43]:
Entropy(0, 1, 0, 0)

0.0

In [44]:
Entropy(0, 0.99, 0.01, 0)

0.08079313589591126

In [45]:
Entropy(0.25, 0.25, 0.25, 0.25)

2.0

#### Entropy is used to measure or quantify the level of uncertainty of a distribution.

If the distribution is (1, 0, 0, 0), there is zero uncertainty.



In [46]:
Entropy(0.9, 0.1)

0.46899559358928133

In [47]:
Entropy(0.9, 0.01, 0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)

0.8011884030780173

In [48]:
sum((0.9, 0.01, 0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01))

1.0

In [49]:
Entropy(1/11, 1/11, 1/11, 1/11, 1/11, 1/11, 1/11, 1/11, 1/11, 1/11, 1/11)

3.459431618637298

In [50]:
math.log2(11)

3.4594316186372973

##### Information

Information is the opposite of uncertainty.

##### Information gain

X is a distribution,  Entropy(X) is the uncertainty of X.

Entropy(X | a) is the uncertainty of X, when event a happens.

Information Gain(X, a) = Entropy(X) - Entropy(X | a)

In [51]:
import pandas
df = pandas.read_csv('DecisionTreeExampleData.csv')

In [52]:
df

Unnamed: 0,Where,When,Outcome
0,Home,5pm,W
1,Home,7pm,W
2,Away,7pm,L
3,Away,7pm,W
4,Home,5pm,L
5,Away,7pm,W
6,Home,7pm,W
7,Away,7pm,W
8,Home,9pm,L
9,Home,5pm,L


In [53]:
# Let's compute the entropy of the distribution of outcomes.

df.Outcome.value_counts()

W    10
L    10
Name: Outcome, dtype: int64

In [54]:
# Entropy(Outcome)

Entropy(10/20, 10/20)

1.0

In [55]:
# If we "choose" Where to split the data, we will compute two things:
# 1. Entropy(Outcome | where=Home), and 
# 2. Entropy(Outcome | where=Away)

h = df[ df.Where == 'Home' ]
a = df[ df.Where == 'Away' ]
h.Outcome.value_counts()

W    6
L    6
Name: Outcome, dtype: int64

In [56]:
# Entropy( Outcome | where=Home)

Entropy(6/12, 6/12)

1.0

In [57]:
a.Outcome.value_counts()

W    4
L    4
Name: Outcome, dtype: int64

In [58]:
# Entropy( Outcome | where=Away)

Entropy(4/8, 4/8)

1.0

Out of 20 games, there are 12 Home games and 8 Away games.

Expected Entropy of Outcome given Where = E( Entropy(Outcome | Where) ) =
    12/20 * Entropy(6/12, 6/12) + 8/20 * Entropy(4/8, 4/8)
    = 12/20 * 1 + 8/20 * 1 = 20/20 = 1
    

Information Gain = E(Outcome) - ExpectedEntropy(Outcome | Where) = 1 - 1 = 0

    
    

In [59]:
## Now, split using "When".

five_pm = df[ df.When == '5pm' ]
seven_pm = df[ df.When == '7pm' ]
nine_pm = df[ df.When == '9pm' ]

In [60]:
five_pm

Unnamed: 0,Where,When,Outcome
0,Home,5pm,W
4,Home,5pm,L
9,Home,5pm,L
18,Away,5pm,L


In [61]:
# Entropy( Outcome | When=5pm)
Entropy(1/4, 3/4)

0.8112781244591328

In [62]:
seven_pm.Outcome.value_counts()

W    9
L    3
Name: Outcome, dtype: int64

In [63]:
Entropy(9/12, 3/12)

0.8112781244591328

# Review

We discussed how to construction a decision tree.  Each node of a decision tree represents a decision.  A Yes/No question.  The answer to the question guides the decision "to the left" or "to the right" of the tree.

Each node represents a set of data points.

We start a root node and traverse down the tree.  The decision nodes help us determine where a new data point is.

To build a decision tree, starting from the root, we must choose an attribute to split the data into two or more sets.  Essentially, we go through all attributes and compute "the score" resulting from using that attribute to split the data.  In the end, we choose the attribute that gives the best score.

An example of scoring is "information gain".  We will choose the attribute the results in the highest information gain.

To understand "information gain", we have to understand entropy.  

Entropy measures uncertainty of distributions.  For example, distributions of "Outcomes" of the fake sport in our data.

We also need to understand "information gain".

In [65]:
df.Outcome.value_counts()

W    10
L    10
Name: Outcome, dtype: int64

In [66]:
Entropy(1,0)

0.0

In [67]:
Entropy(0.99, 0.01)

0.08079313589591126

In [68]:
Entropy(0.5, 0.5)

1.0

In [69]:
Entropy(.25, .25, .25, .25)

2.0

In [70]:
Entropy(.25, 0, .5, .25)

1.5

In [71]:
Entropy(.9, 0.05, 0.05)

0.5689955935892814

In [72]:
Entropy(1/3, 1/3, 1/3)

1.5849625007211559

Information Gain = Entropy(before split) - Entropy(after split)

This is a reduction of entropy due to the choice of an attribute to split the data.

More information gained means higher reduction of uncertainty.

In [74]:
def Entropy(df, outcome, condition=None):
	if condition is not None:
		df = df[ condition ]
	outcomes = df[outcome].value_counts()
	total = sum(outcomes.values)
	probs = [ v/total for v in outcomes.values ]
	E = 0
	for p in probs:
		if p > 0:
			E += p * math.log2(1.0 / p)
	print('Values:', outcomes.values, 'Probs:', probs, 'Entropy:', E)
	return E


In [76]:
df.Outcome.value_counts()

W    10
L    10
Name: Outcome, dtype: int64

In [77]:
Entropy(df, 'Outcome')

Values: [10 10] Probs: [0.5, 0.5] Entropy: 1.0


1.0

In [78]:
Entropy(df, 'Outcome', df.Where=='Home')

Values: [6 6] Probs: [0.5, 0.5] Entropy: 1.0


1.0

In [79]:
Entropy(df, 'Outcome', df.Where=='Away')

Values: [4 4] Probs: [0.5, 0.5] Entropy: 1.0


1.0

### How much information is gained from using the attribute "Where" to split the data?

In [80]:
# Information before splitting
Entropy(df, 'Outcome')

Values: [10 10] Probs: [0.5, 0.5] Entropy: 1.0


1.0

We have to compute the expected Entropy of splitting the data using "Where".

Expected Entropy of splitting using "Where" is:

```
Expected Entropy using "Where" = 12/20 * Entropy(Home) + 8/20 * Entropy(Way)
```

In [83]:
# Expected Entropy using "Where" is

EE_where = 12/20 * Entropy(df, 'Outcome', df.Where=='Home') + \
8/20 * Entropy(df, 'Outcome', df.Where=='Away')

Values: [6 6] Probs: [0.5, 0.5] Entropy: 1.0
Values: [4 4] Probs: [0.5, 0.5] Entropy: 1.0


In [84]:
# Information Gain is

Entropy(df, 'Outcome') - EE_where

Values: [10 10] Probs: [0.5, 0.5] Entropy: 1.0


0.0

In [85]:
# Entropy using "When"

Entropy(df, 'Outcome', df.When=='5pm')

Values: [3 1] Probs: [0.75, 0.25] Entropy: 0.8112781244591328


0.8112781244591328

In [86]:
Entropy(df, 'Outcome', df.When=='7pm')

Values: [9 3] Probs: [0.75, 0.25] Entropy: 0.8112781244591328


0.8112781244591328

In [87]:
Entropy(df, 'Outcome', df.When=='9pm')

Values: [4] Probs: [1.0] Entropy: 0.0


0.0

### What is the information gained from using the "When" attribute to splitting the data?

```
Expected Entropy of using When =
4/20 * 0.8112781244591328 + 12/20 * 0.8112781244591328 + 4/20 * 0
```

In [88]:
4/20 * 0.8112781244591328 + 12/20 * 0.8112781244591328 + 4/20 * 0

0.6490224995673063

Information Gain =  Entropy Before Split - Entropy After Split = 1.0 - 0.649 = 0.351

Conclusion: 

- Using "where" to split, we got 0 information gained.
- Using "when" to split, we got 0.351 information gained.




# Let's Visualize

In [89]:
admit = pandas.read_csv('~/Dropbox/datasets/admission.csv')

In [90]:
admit.head()

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


In [122]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='entropy', max_depth=5)
X = admit[ ['gre', 'gpa', 'rank'] ]
y = admit.admit
model.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [109]:
model.predict([ [800,4.0, 1], [750,3.9,4] ])

array([1, 0])

In [138]:
# Visualize this decision tree
visualize_tree(model, X.columns, 'admission')


ValueError: Length of feature_names, 4 does not match number of features, 3

In [119]:
model.feature_importances_

array([0.36721251, 0.51193329, 0.1208542 ])

In [111]:
iris = pandas.read_csv('~/Dropbox/datasets/iris.csv')
iris.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [134]:
X = iris[ ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth'] ]
y = iris.Species
model2 = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=5)
model2.fit(X,y)


DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [135]:
model2.predict([ [4.5, 3.0, 1.55, 0.25], [6.8, 3.2, 5.3, 1.9] ])

array(['setosa', 'virginica'], dtype=object)

In [136]:
iris.sample()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
9,4.9,3.1,1.5,0.1,setosa


In [141]:
from draw_tree import visualize_tree

visualize_tree(model2, X.columns, model2.labels_, 'iris')


AttributeError: 'DecisionTreeClassifier' object has no attribute 'labels_'

In [132]:
model2.feature_importances_

array([0.        , 0.        , 0.06844516, 0.93155484])

In [128]:
iris[ iris.PetalWidth < 0.8 ]

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa
