<img src="https://juniorworld.github.io/python-workshop-2018/img/portfolio/week10.jpg" width="350px">

---

# Supervised Machine Learning

- Train a model to give predictions based on **labeled data** (X + Y)
- Information Retrieval: KNN
- Regression: 
    - Linear regression
    - Generalize Linear Model: logistic (binary), poisson (count)
- Classification:
    - Binary Classification: Naive Bayesian classifier
    - Multiclass Classification: Multinomial Bayesian classifier, KNN
    - (Advanced) Deep Neural Network

## Procedure:
- STEP 1: Split dataset
    - 2 parts: train vs test, e.g. 60:40 or 70:30
    - 3 parts: train vs test vs validation: e.g. 60:30:10 or 70:20:10 [not usual]
- STEP 2: Train the model
- STEP 3: Test the model
- STEP 4: Parameter tuning. If the result is not satisfactory, retrain the model with new parameters and retest the newly trained model.
- STEP 5: Report the model performance with Test/Validation data.
- STEP 6: Apply the model to new data set.

## K-Nearest Neighbors (KNN)
- Distance-based Spatial Voting Model
- Purpose: Retrieve the most similar information from database + Classify a new point according its nearest neighbors
- Input: a set of data with labels
- Output: K nearest neighbors
    - Assign the category according K's labels
- Parameter: K

|Movie|#Fight scenes|#Kiss scenes|Genre|
|-----|:-----------:|:----------:|:---:|
|California Man|3|104|Love|
|He's not that into you|2|100|Love|
|Beautiful Woman|1|81|Love|
|Kevin Longblade|101|10|Action|
|Robo Slayer 3000|99|5|Action|
|Amped II|98|2|Action|
|<font style='color:blue'>XXXXX</font>|<font style='color:blue'>18</font>|<font style='color:blue'>90</font>|<font style='color:red'>?</font>|

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import *
from sklearn.datasets import load_iris

In [2]:
iris=load_iris()

In [3]:
train_index=np.random.choice(range(150),100,replace=False)
train_X=iris.data[train_index]      #extract 100 data records as our training data
train_Y=iris.target[train_index]    #training labels

In [4]:
test_index=[i for i in range(150) if i not in train_index]
test_X=iris.data[test_index]
test_Y=iris.target[test_index]

In [5]:
print(train_X.shape)
print(test_X.shape)

(100, 4)
(50, 4)


In [6]:
train_X[0] #first data point in training set

array([6. , 2.9, 4.5, 1.5])

In [None]:
test_X[0] #first test point

In [7]:
train_Y[0]

1

In [8]:
#distance between first test data and first train data
distance=np.linalg.norm(test_X[0]-train_X[0])

In [9]:
distance

3.531288716601915

In [10]:
#distances between first test data and ALL train data
distances=[]
for i in train_X:
    distance=np.linalg.norm(test_X[0]-i)
    distances.append(distance)

In [11]:
len(distances)

100

**Suppose K=4**

In [13]:
np.argsort([2,1,3,4])

array([1, 0, 2, 3], dtype=int64)

In [14]:
np.argsort(distances)[:4] #extract the indexes of K smallest distances

array([65, 26, 14, 98], dtype=int64)

In [15]:
#get the labels of those points
KNNs=train_Y[np.argsort(distances)[:4]]

In [16]:
KNNs

array([0, 0, 0, 0])

In [17]:
a=[1,2,3,4,5,5,4,4,4,4,2,1]
pd.Series(a).value_counts()

4    5
5    2
2    2
1    2
3    1
dtype: int64

In [18]:
#find the most frequent label and use it as your predicted label for first testing point
pd.Series(KNNs).value_counts().index[0]

0

In [23]:
#create a for loop to go over every testing data and predict their labels
predict_Y=[]
for j in test_X:
    distances=[]
    for i in train_X:
        distance=np.linalg.norm(j-i)
        distances.append(distance)
    KNN_index=np.argsort(distances)[:4]
    KNNs=train_Y[KNN_index]
    y=pd.Series(KNNs).value_counts().index[0]
    predict_Y.append(y)

## Performance diagnosis
- **Accuracy** rate: 
    - formula: true predictions/total sample size
- **Precision** rate:
    - formula: true positive/predicted positive
    - **macro**: calculate the precision of each label and get their means
    - micro: sum up the number of true positive and get the total precision [= accuracy]
- **Recall** rate:
    - formula: true positive/real positive
    - **macro**
    - micro
- **F1** score
<img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/057ffc6b4fa80dc1c0e1f2f1f6b598c38cdd7c23'>

In [27]:
diagnosis=pd.crosstab(np.array(predict_Y),test_Y, rownames=['predict'], colnames=['real'])

In [28]:
diagnosis

real,0,1,2
predict,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,18,0,0
1,0,13,1
2,0,0,18


In [29]:
np.diagonal(diagnosis) #extract numbers on diagnonal

array([18, 13, 18], dtype=int64)

In [30]:
#Metric1: ACC
accuracy_rate=sum(np.diagonal(diagnosis))/len(predict_Y)
accuracy_rate

0.98

In [31]:
#Metric2: precision
precisions=np.diagonal(diagnosis)/np.sum(diagnosis,axis=1) #diagnoals in row sums
precision_rate=np.mean(precisions)
precision_rate

0.9761904761904763

In [32]:
#Metric3: recall
recalls=np.diagonal(diagnosis)/np.sum(diagnosis,axis=0) #diagnoals in col sums
recall_rate=np.mean(recalls)
recall_rate

0.9824561403508771

In [33]:
#Metric4: F1
f1s=2*precisions*recalls/(precisions+recalls)
F1=np.mean(f1s)
F1

0.9786453119786453

In [34]:
print(accuracy_score(test_Y, predict_Y))
print(precision_score(test_Y, predict_Y, average='macro'))
print(recall_score(test_Y, predict_Y, average='macro'))
print(f1_score(test_Y, predict_Y, average='macro'))

0.98
0.9761904761904763
0.9824561403508771
0.9786453119786453


In [35]:
print(accuracy_score(test_Y, predict_Y))
print(precision_score(test_Y, predict_Y, average='micro'))
print(recall_score(test_Y, predict_Y, average='micro'))
print(f1_score(test_Y, predict_Y, average='micro'))

0.98
0.98
0.98
0.98


## Practice
Set K=6. Please apply KNN technique to digit data:
1. Split the data into training set (1000 samples) and testing set (797 samples).
2. Apply KNN
3. Report model performance metrics (accuracy, precision, recall, f1)

In [36]:
from sklearn.datasets import load_digits
digits = load_digits()

In [37]:
digits.data.shape

(1797, 64)

In [39]:
#WRITE YOUR CODE HERE
train_index=np.random.choice(range(1797),1000,replace=False)
train_X=digits.data[train_index]      #extract 100 data records as our training data
train_Y=digits.target[train_index]   

test_index=[i for i in range(1797) if i not in train_index]
test_X=digits.data[test_index]
test_Y=digits.target[test_index]




In [40]:
predict_Y=[]
for j in test_X:
    distances=[]
    for i in train_X:
        distance=np.linalg.norm(j-i)
        distances.append(distance)
    KNN_index=np.argsort(distances)[:6]
    KNNs=train_Y[KNN_index]
    y=pd.Series(KNNs).value_counts().index[0]
    predict_Y.append(y)

In [41]:
len(predict_Y)

797

In [42]:
print(accuracy_score(test_Y, predict_Y))
print(precision_score(test_Y, predict_Y, average='macro'))
print(recall_score(test_Y, predict_Y, average='macro'))
print(f1_score(test_Y, predict_Y, average='macro'))

0.972396486825596
0.9733726787749613
0.9725595238095238
0.9720897126628149


In [43]:
labels=['dem','rep','dem','rep','dem','dem']

In [46]:
unqiue_labels=list(np.unique(labels))

In [48]:
labels_num=[]
for label in labels:
    labels_num.append(unqiue_labels.index(label))

In [49]:
labels_num

[0, 1, 0, 1, 0, 0]

## Parameter Tuning
- Purpose: find the best parameter
- For KNN, the only parameter is K

In [51]:
#Wrap previous lines into a function
def KNN(K,train_X,train_Y,test_X,test_Y):
    predict_Y=[]
    for j in test_X:
        distances=[]
        for i in train_X:
            distance=np.linalg.norm(j-i)
            distances.append(distance)
        KNN_index=np.argsort(distances)[:K]
        KNNs=train_Y[KNN_index]
        y=pd.Series(KNNs).value_counts().index[0]
        predict_Y.append(y)
    predict_Y=np.array(predict_Y)
    return(predict_Y)

In [52]:
f1s=[]
for K in range(1,10):
    predict_Y=KNN(K,train_X,train_Y,test_X,test_Y)
    f1s.append(f1_score(test_Y, predict_Y, average='macro'))

In [53]:
f1s #K=1, nearest neighbor can inform us more accurately about labels

[0.982381858108638,
 0.9773558686173729,
 0.9811926273632473,
 0.9723463773774018,
 0.9723150334233359,
 0.9720897126628149,
 0.9710123114098834,
 0.969552431351873,
 0.9710697234172816]

In [54]:
movie_df=pd.read_csv('doc/movies.csv')

In [56]:
movie_df.head()

Unnamed: 0,Movie,Genre,Production Budget (millions),Box Office (millions),ROI,Rating IMDB
0,Avatar,Action,237,2784,11.7,8.0
1,The Blind Side,Drama,29,309,10.7,7.6
2,"The Chronicles of Narnia: The Lion, the Witch ...",Adventure,180,745,4.1,6.9
3,The Dark Knight,Action,185,1005,5.4,9.0
4,ET: The Extra-Terrestrial,Drama,11,793,75.5,7.9


In [62]:
len(np.unique(movie_df['Genre']))

6

In [58]:
train_index=np.random.choice(range(movie_df.shape[0]),20,replace=False)

In [64]:
train_X=movie_df.iloc[train_index,2:].get_values()
train_Y=movie_df.iloc[train_index,1].get_values()

In [65]:
test_index=[i for i in range(movie_df.shape[0]) if i not in train_index]
test_X=movie_df.iloc[test_index,2:].get_values()
test_Y=movie_df.iloc[test_index,1].get_values()

In [66]:
KNN(1,train_X,train_Y,test_X,test_Y)

array(['Drama', 'Adventure', 'Action', 'Drama', 'Action', 'Adventure'],
      dtype='<U9')

In [67]:
test_Y

array(['Drama', 'Adventure', 'Adventure', 'Adventure',
       'Thriller/Suspense', 'Action'], dtype=object)

---
## Break
---

## Decision Tree

<img src='https://cdn-images-1.medium.com/max/900/1*XMId5sJqPtm8-RIwVVz2tg.png' width='300px' align='left'>

- Decision Tree is a non-parametric supervised learning method.
- [Advanced] Random Forest
- STEP 1: Calculate the entropy of labels
- STEP 2: Choose a variable and split the dataset along that variable
- STEP 3: Calculate the entropy of splitting
- STEP 4: Repeat STEP 2-3 for all variables
- STEP 5: Find the variable with greatest information gain
- STEP 6: Add it to the root
- STEP 7: Repeat STEP 2-6 for all variables
- STEP 8: Build up the tree

### Entropy
- Entropy: Information load
    - formula: <img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/7de5d59a442f5305853d4392826b1f51dc43f6d0' width='200px'>
    

In [None]:
def entropy(data):
    freq=pd.Series(data).value_counts()
    freq=freq/sum(freq)
    H=sum(-freq*np.log2(freq))
    return(H)

In [None]:
a=[2,2,2,2,2,2,2,2] #one group
b=[1,1,1,1,2,2,2,2] #two balanced group
c=[1,2,2,2,2,2,2,2] #two unbalanced group
Ha=entropy(a)
Hb=entropy(b)
Hc=entropy(c)
print(Ha)
print(Hb)
print(Hc)

In [None]:
#to what degree one variable can inform us about another variable
var1=['red','blue','red','blue']
var2=[0,1,1,1]
df=pd.DataFrame({'color':var1,'number':var2})

In [None]:
df

In [None]:
#split the data into two parts and check the entropy of 'number'
H1=entropy(df[df['color']=='red']['number'])
H1

In [None]:
H2=entropy(df[df['color']=='blue']['number']) #exteremly certain
H2

In [None]:
H=H1+H2 #integrated entropy for two groups
H

In [None]:
iris_df=pd.DataFrame(iris.data,columns=iris.feature_names)
iris_df['label']=iris.target

In [None]:
iris_df.head()

In [None]:
#Use threshold of sepal length at 5.0 first and find out the entropy of this splitting
threshold=5
H1=entropy(iris_df[iris_df['sepal length (cm)']>=threshold]['label'])
H2=entropy(iris_df[iris_df['sepal length (cm)']<threshold]['label'])
H1+H2

In [None]:
iris_df.iloc[:,0] #reference by column index, rather than column name

In [None]:
#Find out the threshold that can produce minimal entropy
def best_threshold(data,col): #col is the column index that you want to apply threshold searching
    thresholds=np.unique(data.iloc[:,col]) #threshold candidates
    Hs=[]
    for threshold in thresholds:
        
        #WRITE YOUR CODE HERE
    
    best_threshold=                 #best threshold is the one with minimal entropy
    min_Hs=min(Hs)
    return(best_threshold,min_Hs)

In [None]:
#run over all columns and find out the best feature with smallest entropy
Hs=[]
thresholds=[]
for i in range(4):
    threshold,H=best_threshold(iris_df,i)
    thresholds.append(threshold)
    Hs.append(H)

In [None]:
Hs

In [None]:
np.argmin(Hs) #the root feature is petal length as it has the smallest entropy -> it can inform us about labels at most

In [None]:
def best_feature(data):
    Hs=[]
    thresholds=[]
    for i in range(4):
        threshold,H=best_threshold(data,i)
        thresholds.append(threshold)
        Hs.append(H)
    return(np.argmin(Hs),thresholds[np.argmin(Hs)])

## Build up Decision Tree
- RULE: Assign the feature according to their entropy in ascending order
- Create a dictionary about Tree:
    - Four elements
    - 'col' is the column index used at current level
    - 'threshold' is its threshold
    - 'greater' contains a dictionary about its child leaf for data points greater than the threshold
    - 'smaller' contains a dictionary about its child leaf for data points smaller than the threshold

>```python
{'col':2,
 'threshold':thresholds[2],
 'larger':{ #containing a dictionary for child leaf where data points greater than threshold will go
     'col': ..., 
     'threshold': ...,
     'larger': ...,
     'smaller': ...
 }, 
 'smaller':{ #containing a dictionary for child leaf where data points smaller than threshold will go
     'col': ..., 
     'threshold': ...,
     'larger': ...,
     'smaller': ...
 }  
}
```

In [None]:
def create_tree(data,max_group_size):
    if len(np.unique(data['label']))>1 and data.shape[0]>max_group_size:
        feature,threshold=best_feature(data)
        feature_name=data.columns[feature]
        tree_dict={'col':feature,'threshold':threshold}
        #split dataset into two parts: larger and smaller than threshold
        #this is the larger part
        subset1=data[data[feature_name]>=threshold]
        if subset1.shape[0] == data.shape[0]: #check whether dataset is really split into two part
            return(data['label'].value_counts().index[0])
        tree_dict['larger']=create_tree(subset1,max_group_size)
        #this is the smaller part
        subset2=data[data[feature_name]<threshold]
        if subset2.shape[0] == data.shape[0]:
            return(data['label'].value_counts().index[0])
        tree_dict['smaller']=create_tree(subset2,max_group_size)
        return(tree_dict)
    else:
        return(data['label'].value_counts().index[0])

In [None]:
train_iris=iris_df.sample(n=100)
test_iris=iris_df.sample(n=50)

In [None]:
tree_dict=create_tree(train_iris,50)
tree_dict

In [None]:
tree_dict=create_tree(train_iris,30)
tree_dict

In [None]:
def apply_tree(point,tree,predicts):
    if point[tree['col']]>=tree['threshold']:
        if np.isscalar(tree['larger']):
            predicts.append(tree['larger'])
        else:
            apply_tree(point,tree['larger'],predicts)
    else:
        if np.isscalar(tree['smaller']):
            predicts.append(tree['smaller'])
        else:
            apply_tree(point,tree['smaller'],predicts)

In [None]:
predicts=[]
for i in range(test_iris.shape[0]):
    apply_tree(test_iris.iloc[i],tree_dict,predicts)

In [None]:
accuracy_score(predicts,test_iris['label'])