# HW2

#### Machine Learning in Korea University
#### COSE362, Fall 2018
#### Due : 11/26 (TUE) 11:59 PM

#### In this assignment, you will learn various classification methods with given datasets.
* Implementation detail: Anaconda 5.3 with python 3.7
* Use given dataset. Please do not change train / valid / test split.
* Use numpy, scikit-learn, and matplotlib library
* You don't have to use all imported packages below. (some are optional). <br>
Also, you can import additional packages in "(Option) Other Classifiers" part. 
* <b>*DO NOT MODIFY OTHER PARTS OF CODES EXCEPT "Your Code Here"*</b>

In [1]:
# Basic packages
%matplotlib inline
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt

# Machine Learning Models
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Additional packages
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

In [2]:
# Import your own packages if you need(only in scikit-learn, numpy, pandas).
# Your Code Here
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import make_scorer
#End Your Code

## Process

> 1. Load "train.csv". It includes all samples' features and labels.
> 2. Training four types of classifiers(logistic regression, decision tree, random forest, support vector machine) and <b>validate</b> it in your own way. <b>(You can't get full credit if you don't conduct validation)</b>
> 3. Optionally, if you would train your own classifier(e.g. ensembling or gradient boosting), you can evaluate your own model on the development data. <br>
> 4. <b>You should submit your predicted results on test data with the selected classifier in your own manner.</b>

## Task & dataset description
1. 6 Features (1~6)<br>
Feature 2, 4, 6 : Real-valued<br>
Feature 1, 3, 5 : Categorical <br>

2. Samples <br>
>In development set : 2,000 samples <br>
>In test set : 1,500 samples

## Load development dataset
Load your development dataset. You should read <b>"train.csv"</b>. This is a classification task, and you need to preprocess your data for training your model. <br>
> You need to use <b>1-of-K coding scheme</b>, to convert categorical features to one-hot vector. <br>
> For example, if there are 3 categorical values, you can convert these features as [1,0,0], [0,1,0], [0,0,1] by 1-of-K coding scheme. <br>

In [3]:
# For training your model, you need to convert categorical features to one-hot encoding vectors.
# Your Code Here
df = pd.read_csv('./data/train.csv')
df_X = df.iloc[:, 0:6].as_matrix()
df_Y = df.iloc[:, 6:7]

y = df_Y.values
y = y.reshape(-1)

print("class frequency")
print('='*50)
unique, counts = np.unique(y, return_counts=True)
print (np.asarray((unique, counts)).T)
print('='*50)
            
# One-hot encoding
categorical_features = [0, 2, 4]
feature_dict = {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7}

for feature_num in categorical_features:
    feature_values_tr = df_X[:, feature_num]
    feature_set = set(feature_values_tr)
    
    for i, value in enumerate(feature_values_tr):
        feature_values_tr[i] = feature_dict[value]
        
    one_hot_matrix_tr = np.eye(len(feature_set))[feature_values_tr.astype(int)]
    df_X = np.concatenate((df_X, one_hot_matrix_tr), axis=1)

df_X = np.delete(df_X, categorical_features, 1)
X = df_X.astype(int)

# Oversampling
up_sample_idx = [4, 5, 8, 9, 10, 11, 14, 16, 17]
for idx in range(len(X)) :
    if y[idx] in up_sample_idx :
        for i in range(int(np.round(100/counts[y[idx]]))) :
            X = np.vstack([X, X[idx]])
            y = np.append(y, y[idx])

X.shape
# End Your Code

class frequency
[[  0 192]
 [  1 105]
 [  2 219]
 [  3 162]
 [  4  30]
 [  5  17]
 [  6 337]
 [  7 124]
 [  8   6]
 [  9  36]
 [ 10  48]
 [ 11  31]
 [ 12 119]
 [ 13 300]
 [ 14   4]
 [ 15 248]
 [ 16  19]
 [ 17   3]]


(2885, 23)

### Logistic Regression
Train and validate your <b>logistic regression classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [6]:
# Training your logistic regression classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here
valid_split = 1/10
cv = ShuffleSplit(n_splits=10, test_size=valid_split, random_state=0)

coefs = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

max_tr_score = 0
max_va_score = 0
optimum_coef = -1

for coef in coefs:
    print('='*50)
    print("coef value: ", coef)
    model = LogisticRegression(C=coef, solver='lbfgs', max_iter=500)

    train_score = []
    valid_score = []

    for train_index, test_index in cv.split(X) :
        X_tr, X_va = X[train_index], X[test_index]
        y_tr, y_va = y[train_index], y[test_index]
    
        model.fit(X_tr, y_tr)
        train_score.append(f1_score(y_tr, model.predict(X_tr), average='macro'))
        valid_score.append(f1_score(y_va, model.predict(X_va), average='macro'))
        
    print(np.mean(train_score))
    print(np.mean(valid_score))
    
    if np.mean(valid_score) > max_va_score :
        max_tr_score = np.mean(train_score)
        max_va_score = np.mean(valid_score)
        optimum_coef = coef

print('='*50)
print("max_tr_score", (max_tr_score))
print("max_va_score", (max_va_score))
print("optimum coef", (optimum_coef))
    
# End Your Code

coef value:  0.0001


  'precision', 'predicted', average, warn_for)


0.011613110035650399
0.011688620301740093
coef value:  0.001
0.06929917160805454
0.05847472223665208
coef value:  0.01
0.26326482815135954
0.22817332762421452
coef value:  0.1
0.40660603602463796
0.3651082709319086
coef value:  1.0
0.4395733182806293
0.402887158032201
coef value:  10.0
0.4501796083928432
0.4100008942879792
coef value:  100.0
0.4529534685532154
0.413422528623253
coef value:  1000.0
0.4532004499041137
0.41420250402680325
max_tr_score 0.4532004499041137
max_va_score 0.41420250402680325
optimum coef 1000.0


### Decision Tree
Train and validate your <b>decision tree classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [7]:
# Training your decision tree classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here
#tree = DecisionTreeClassifier(criterion='entropy', max_depth=k+1, random_state=0).fit(df_X, df_Y)
#cross_val_score(tree, df_X, df_Y, scoring='f1_macro', cv=cv)

max_tr_score = 0
max_va_score = 0
optimum_max_depth = -1

for k in range(10, 30) :
    print('='*50)
    print("max_depth value: ", k+1)
    model = DecisionTreeClassifier(criterion='entropy', max_depth=k+1, random_state=0)

    train_score = []
    valid_score = []

    for train_index, test_index in cv.split(X) :
        X_tr, X_va = X[train_index], X[test_index]
        y_tr, y_va = y[train_index], y[test_index]

        model.fit(X_tr, y_tr)
        train_score.append(f1_score(y_tr, model.predict(X_tr), average='macro'))
        valid_score.append(f1_score(y_va, model.predict(X_va), average='macro'))

    print(np.mean(train_score))
    print(np.mean(valid_score))
    
    if np.mean(valid_score) > max_va_score :
        max_tr_score = np.mean(train_score)
        max_va_score = np.mean(valid_score)
        optimum_max_depth = k+1

print('='*50)
print("max_tr_score", (max_tr_score))
print("max_va_score", (max_va_score))
print("optimum max_depth", (optimum_max_depth))
# End Your Code

max_depth value:  11
0.873807125795618
0.6577983176905782
max_depth value:  12
0.9148246913137579
0.6704339968529027
max_depth value:  13
0.9423735795394356
0.6834243495053387
max_depth value:  14
0.9634952650614619
0.6900164812800874
max_depth value:  15
0.977760824282085
0.691071805555851
max_depth value:  16
0.9868866535773833
0.6904586457086168
max_depth value:  17
0.9928450626803915
0.6865485610194609
max_depth value:  18
0.9965230959890409
0.693563740448218
max_depth value:  19
0.9986006066318552
0.6951824987538001
max_depth value:  20
0.9993426785603964
0.6889647368002778
max_depth value:  21
0.9996359645902853
0.6916462726305399
max_depth value:  22
0.9998602130911122
0.6909860625050441
max_depth value:  23
0.9999437181431677
0.691605150145566
max_depth value:  24
1.0
0.6934130884466628
max_depth value:  25
1.0
0.6934130884466628
max_depth value:  26
1.0
0.6934130884466628
max_depth value:  27
1.0
0.6934130884466628
max_depth value:  28
1.0
0.6934130884466628
max_depth value:  

### Random Forest
Train and validate your <b>random forest classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [8]:
# parameters for GridSearchCV
max_tr_score = 0
max_va_score = 0
optimum_n_estimators = -1

for k in range(1, 10) :
    print('='*50)
    print("n_estimators value: ", k*50)
    model = RandomForestClassifier(criterion='entropy', n_estimators=k*50, class_weight='balanced')

    train_score = []
    valid_score = []

    for train_index, test_index in cv.split(X) :
        X_tr, X_va = X[train_index], X[test_index]
        y_tr, y_va = y[train_index], y[test_index]

        model.fit(X_tr, y_tr)
        train_score.append(f1_score(y_tr, model.predict(X_tr), average='macro'))
        valid_score.append(f1_score(y_va, model.predict(X_va), average='macro'))

    print(np.mean(train_score))
    print(np.mean(valid_score))
    
    if np.mean(valid_score) > max_va_score :
        max_tr_score = np.mean(train_score)
        max_va_score = np.mean(valid_score)
        optimum_n_estimators = k*50

print('='*50)
print("max_tr", (max_tr_score))
print("max_va", (max_va_score))
print("optimum n_estimators", (optimum_n_estimators))


n_estimators value:  50
1.0
0.6919879629702801
n_estimators value:  100
1.0
0.6901972843359176
n_estimators value:  150
1.0
0.6916987375588611
n_estimators value:  200
1.0
0.694770023439437
n_estimators value:  250
1.0
0.6920249738834208
n_estimators value:  300
1.0
0.6941633814345533
n_estimators value:  350
1.0
0.6942671088868552
n_estimators value:  400
1.0
0.6964777698193587
n_estimators value:  450
1.0
0.6981490299128821
max_tr 1.0
max_va 0.6981490299128821
optimum n_estimators 450


### Support Vector Machine
Train and validate your <b>support vector machine classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [9]:
# Training your support vector machine classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here

coefs = [1.0, 10.0, 100.0, 1000.0]

max_tr_score = 0
max_va_score = 0
optimum_coef = -1

for coef in coefs:
    print('='*50)
    print("coef value: ", coef)
    model = SVC(C=coef, class_weight='balanced')

    train_score = []
    valid_score = []

    for train_index, test_index in cv.split(X) :
        X_tr, X_va = X[train_index], X[test_index]
        y_tr, y_va = y[train_index], y[test_index]
    
        model.fit(X_tr, y_tr)
        train_score.append(f1_score(y_tr, model.predict(X_tr), average='macro'))
        valid_score.append(f1_score(y_va, model.predict(X_va), average='macro'))
        
    print(np.mean(train_score))
    print(np.mean(valid_score))
    
    if np.mean(valid_score) > max_va_score :
        max_tr_score = np.mean(train_score)
        max_va_score = np.mean(valid_score)
        optimum_coef = coef

print('='*50)
print("max_tr_score", (max_tr_score))
print("max_va_score", (max_va_score))
print("optimum coef", (optimum_coef))

print('='*50)
params = [-2, -1, 0, 1, 2]

max_tr_score = 0
max_va_score = 0
final_coef = -1

for param in params:
    coef = optimum_coef+(optimum_coef*0.1*param)
    print('='*50)
    print("coef value: ", coef)
    model = SVC(C=coef, class_weight='balanced')

    train_score = []
    valid_score = []

    for train_index, test_index in cv.split(X) :
        X_tr, X_va = X[train_index], X[test_index]
        y_tr, y_va = y[train_index], y[test_index]
    
        model.fit(X_tr, y_tr)
        train_score.append(f1_score(y_tr, model.predict(X_tr), average='macro'))
        valid_score.append(f1_score(y_va, model.predict(X_va), average='macro'))
        
    print(np.mean(train_score))
    print(np.mean(valid_score))
    
    if np.mean(valid_score) > max_va_score :
        max_tr_score = np.mean(train_score)
        max_va_score = np.mean(valid_score)
        final_coef = coef

print('='*50)
print("max_tr_score", (max_tr_score))
print("max_va_score", (max_va_score))
print("final coef", (final_coef))

coef value:  1.0
0.6009615404764975
0.5301542576348145
coef value:  10.0
0.826102262787502
0.6760720035214024
coef value:  100.0
0.9693080586907703
0.7297174460647307
coef value:  1000.0
0.9990514673214937
0.7181679430918539
max_tr_score 0.9693080586907703
max_va_score 0.7297174460647307
optimum coef 100.0
coef value:  80.0
0.9632512447639792
0.7283102492162004
coef value:  90.0
0.9664980281749223
0.7283597183270102
coef value:  100.0
0.9693080586907703
0.7297174460647307
coef value:  110.0
0.9726880277859463
0.7302868778537863
coef value:  120.0
0.975260247250773
0.7282510579631164
max_tr_score 0.9726880277859463
max_va_score 0.7302868778537863
final coef 110.0


### (Option) Other Classifiers.
Train and validate other classifiers by your own manner.
> <b> If you need, you can import other models only in this cell, only in scikit-learn. </b>

In [349]:
# If you need additional packages, import your own packages below.
# Your Code Here
# End Your Code

## Submit your prediction on the test data.

* Select your model and explain it briefly.
* You should read <b>"test.csv"</b>.
* Prerdict your model in array form.
* Prediction example <br>
[2, 6, 14, 8, $\cdots$]
* We will rank your result by <b>F1 metric(with 'macro' option)</b>.
* <b> If you don't submit prediction file or submit it in wrong format, you can't get the point for this part.

In [10]:
# Explain your final model

# 18개의 class에 대한 sample 갯수가 고르지 못한 skewed data
# data preprocessing할 때 sample이 적은 class에 대해 oversampling
# SVM classifier의 f1 score가 가장 높아 final model로 선택
# class_weight option을 balanced로 맞추고 error term에 대한 penalty parameter C를 overfitting되지 않는 적절한 수치로 설정

print(final_coef)
print(X.shape)
print(y.shape)
final_model = SVC(C=final_coef, class_weight='balanced')
final_model.fit(X, y)

110.0
(2885, 23)
(2885,)


SVC(C=110.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [11]:
# Load test dataset.
# Your Code Here
df = pd.read_csv('./data/test.csv')
df_X_te = df.iloc[:, 0:6].as_matrix()
            
# One-hot encoding
categorical_features = [0, 2, 4]
feature_dict = {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7}

for feature_num in categorical_features:
    feature_values_te = df_X_te[:, feature_num]
    feature_set = set(feature_values_te)
    
    for i, value in enumerate(feature_values_te):
        feature_values_te[i] = feature_dict[value]
        
    one_hot_matrix_te = np.eye(len(feature_set))[feature_values_te.astype(int)]
    df_X_te = np.concatenate((df_X_te, one_hot_matrix_te), axis=1)
    
df_X_te = np.delete(df_X_te, categorical_features, 1)
X_te = df_X_te.astype(int)

X_te.shape
# End Your Code

(1500, 23)

In [17]:
# Predict target class
# Make variable "my_answer", type of array, and fill this array with your class predictions.
# Modify file name into your student number and your name.
# Your Code Here

my_answer = final_model.predict(X_te)

result_df = pd.read_csv('./data/test.csv')
result_df['my_answer'] = my_answer.tolist()

file_name = "HW2_2015410056_김지윤.csv"

# End Your Code
result_df.head()

Unnamed: 0,feature1,feature2,feature3,feature4,feature5,featrure6,my_answer
0,d,1,h,5,h,6,0
1,d,1,c,1,e,5,3
2,c,1,d,3,e,5,3
3,d,2,a,3,g,8,2
4,d,1,e,1,b,3,13


In [16]:
# This section is for saving predicted answers. DO NOT MODIFY.
pd.Series(my_answer).to_csv("./data/" + file_name, header=None, index=None)