# HW2

#### Machine Learning in Korea University
#### COSE362, Fall 2018
#### Due : 11/26 (TUE) 11:59 PM

#### In this assignment, you will learn various classification methods with given datasets.
* Implementation detail: Anaconda 5.3 with python 3.7
* Use given dataset. Please do not change train / valid / test split.
* Use numpy, scikit-learn, and matplotlib library
* You don't have to use all imported packages below. (some are optional). <br>
Also, you can import additional packages in "(Option) Other Classifiers" part. 
* <b>*DO NOT MODIFY OTHER PARTS OF CODES EXCEPT "Your Code Here"*</b>

In [1]:
# Basic packages
%matplotlib inline
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt

# Machine Learning Models
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Additional packages
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

In [2]:
# Import your own packages if you need(only in scikit-learn, numpy, pandas).
# Your Code Here
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import KFold
#End Your Code

## Process

> 1. Load "train.csv". It includes all samples' features and labels.
> 2. Training four types of classifiers(logistic regression, decision tree, random forest, support vector machine) and <b>validate</b> it in your own way. <b>(You can't get full credit if you don't conduct validation)</b>
> 3. Optionally, if you would train your own classifier(e.g. ensembling or gradient boosting), you can evaluate your own model on the development data. <br>
> 4. <b>You should submit your predicted results on test data with the selected classifier in your own manner.</b>

## Task & dataset description
1. 6 Features (1~6)<br>
Feature 2, 4, 6 : Real-valued<br>
Feature 1, 3, 5 : Categorical <br>

2. Samples <br>
>In development set : 2,000 samples <br>
>In test set : 1,500 samples

## Load development dataset
Load your development dataset. You should read <b>"train.csv"</b>. This is a classification task, and you need to preprocess your data for training your model. <br>
> You need to use <b>1-of-K coding scheme</b>, to convert categorical features to one-hot vector. <br>
> For example, if there are 3 categorical values, you can convert these features as [1,0,0], [0,1,0], [0,0,1] by 1-of-K coding scheme. <br>

In [4]:
# For training your model, you need to convert categorical features to one-hot encoding vectors.
# Your Code Here

# define datasets, dictionaries and functions
lab = LabelEncoder()
train_data = pd.read_csv("data/train.csv", index_col=0)
oneHot = OneHotEncoder()
train_values = train_data.reset_index().values
real_features = [1,3,5]
categorical_features = [0,2,4]
y_feature = 6
train_X = np.asarray(train_values)
train_y = train_values[:,y_feature]
        
# implement OneHotEncoding
for val in categorical_features:
    feature = train_values[:,val]
    feature = oneHot.fit_transform(feature.reshape(-1,1)).toarray()
    train_X = np.concatenate((train_X, feature), axis = 1)
# end OneHotEncoding

# implement delete categorical datas and refine training y data
train_X = np.delete(train_X, categorical_features, 1)
train_X = np.delete(train_X, y_feature, 1)
train_y = lab.fit_transform(train_y)
# End Your Code

### Logistic Regression
Train and validate your <b>logistic regression classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [7]:
# Training your logistic regression classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here

#define variables
opt_LogReg = LogisticRegression()
coefs = [0.01, 0.05, 0.1, 0.5, 1, 10]
max_cross_val = 0
max_f1_score = 0
cv = KFold(5, shuffle=True, random_state=0)

# implement regularization
for coef in coefs:
    LogReg = LogisticRegression(C=coef, solver='lbfgs', multi_class='multinomial', max_iter=5000)
    LogReg.fit(train_X, train_y)
    # implement cross-validation and get a f1_score
    cross_val = cross_val_score(LogReg, train_X, train_y, cv=cv)
    cross_val = np.mean(cross_val)
    cur_f1_score = f1_score(train_y, LogReg.predict(train_X), average='macro', labels=np.unique(LogReg.predict(train_X)))


    # get a minimum cross-validation error by f1_score
    if cur_f1_score > max_f1_score:
        opt_LogReg = LogReg
        max_cross_val = cross_val
        max_f1_score = cur_f1_score
    print("Maximum F1 Score : {}".format(max_f1_score))
    
# print validation error for my model 
validation_error = 1 - max_cross_val
print("My LogisticRegression's Validation error : {}".format(validation_error))

# End Your Code

[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  2  6]
Maximum F1 Score : 0.6969830051365752
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  2  7]
Maximum F1 Score : 0.6969830051365752
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  2  7]
Maximum F1 Score : 0.7325869328825546
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  2  7]
Maximum F1 Score : 0.8557353597367277
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  2  7]
Maximum F1 Score : 0.8873843119023351
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Maximum F1 Score : 0.9917216761519609
My LogisticRegression's Validation error : 0.02949999999999997


### Decision Tree
Train and validate your <b>decision tree classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [8]:
# Training your decision tree classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here

#define variables
opt_DecisionTree = DecisionTreeClassifier()
max_cross_val = 0
max_f1_score = 0
cv = KFold(5, shuffle=True, random_state=0)

# implement regularization
for depth in range(1,10):
    DecisionTree = DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=depth)
    DecisionTree.fit(train_X, train_y)
    # implement cross-validation and get a f1_score
    cur_f1_score = f1_score(train_y, DecisionTree.predict(train_X), average='macro', labels=np.unique(DecisionTree.predict(train_X)))
    cross_val = cross_val_score(DecisionTree, train_X, train_y, cv = cv)
    cross_val = np.mean(cross_val)

    # get a minimum cross-validation error by f1_score
    if cur_f1_score > max_f1_score:
        opt_DecisionTree = DecisionTree
        max_cross_val = cross_val
        max_f1_score = cur_f1_score
    print("Maximum F1 Score : {}".format(max_f1_score))

# print validation error for my model 
validation_error = 1 - max_cross_val
print("My DecisionTree's Validation error : {}".format(validation_error))
# End Your Code

[ 3 15  6 ... 15  1  6]
[ 6 13  6 ... 13  6  6]
Maximum F1 Score : 0.4832126801858239
[ 3 15  6 ... 15  1  6]
[ 6 13  6 ... 13  2  6]
Maximum F1 Score : 0.6384802131902313
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  0  6]
Maximum F1 Score : 0.9018430082135607
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Maximum F1 Score : 0.9520794217231852
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Maximum F1 Score : 1.0
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Maximum F1 Score : 1.0
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Maximum F1 Score : 1.0
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Maximum F1 Score : 1.0
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Maximum F1 Score : 1.0
My DecisionTree's Validation error : 0.0015000000000000568


### Random Forest
Train and validate your <b>random forest classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [9]:
# Training your random forest classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here

#define variables
opt_RandomForests = RandomForestClassifier()
max_cross_val = 0
max_f1_score = 0
cv = KFold(5, shuffle=True, random_state=0)

# implement regularization
for depth in range(1,15):
    for estimators in range(1,15):
        RandomForest = RandomForestClassifier(bootstrap=True, criterion='entropy', n_estimators=estimators, random_state=0, max_depth=depth)
        RandomForest.fit(train_X, train_y)
        # implement cross-validation and get a f1_score
        cross_val = cross_val_score(RandomForest, train_X, train_y, cv = cv)
        cross_val = np.mean(cross_val)
        cur_f1_score = f1_score(train_y, RandomForest.predict(train_X), average='macro', labels=np.unique(RandomForest.predict(train_X)))
    
        # get a minimum cross-validation error by f1_score
        if cur_f1_score > max_f1_score:
            opt_RandomForests = RandomForest
            max_cross_val = cross_val
            max_f1_score = cur_f1_score
    print("Maximum F1 Score : {}".format(max_f1_score))

# print validation error for my model 
validation_error = 1 - max_cross_val
print("My RandomForests's Validation error : {}".format(validation_error))
# End Your Code

[ 3 15  6 ... 15  1  6]
[1 1 6 ... 6 6 6]
[ 3 15  6 ... 15  1  6]
[1 6 6 ... 6 6 6]
[ 3 15  6 ... 15  1  6]
[6 6 6 ... 6 6 6]
[ 3 15  6 ... 15  1  6]
[ 6  6  6 ... 13 13 13]
[ 3 15  6 ... 15  1  6]
[ 6  6  6 ... 13 13 13]
[ 3 15  6 ... 15  1  6]
[ 6  6  6 ... 13 13 13]
[ 3 15  6 ... 15  1  6]
[ 6  6  6 ... 13 13 13]
[ 3 15  6 ... 15  1  6]
[ 6 13  6 ... 13  6  6]
[ 3 15  6 ... 15  1  6]
[ 6 13  6 ... 13  6  6]
[ 3 15  6 ... 15  1  6]
[ 6 13  6 ... 13  6  6]
[ 3 15  6 ... 15  1  6]
[ 6 13  6 ... 13  6  6]
[ 3 15  6 ... 15  1  6]
[ 6 13  6 ... 13  6  6]
[ 3 15  6 ... 15  1  6]
[ 6 13  6 ... 13  6  6]
[ 3 15  6 ... 15  1  6]
[ 6 13  6 ... 13  6  6]
Maximum F1 Score : 0.4710198428336308
[ 3 15  6 ... 15  1  6]
[1 1 6 ... 6 6 6]
[ 3 15  6 ... 15  1  6]
[6 6 6 ... 6 2 6]
[ 3 15  6 ... 15  1  6]
[6 6 6 ... 6 2 6]
[ 3 15  6 ... 15  1  6]
[ 6  6  6 ... 13  2 13]
[ 3 15  6 ... 15  1  6]
[ 6  6  6 ... 13  2 13]
[ 3 15  6 ... 15  1  6]
[ 6  6  6 ... 13  2 13]
[ 3 15  6 ... 15  1  6]
[ 6  6  6 ... 

[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Maximum F1 Score : 0.9992647997031194
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  0  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Maximum F1 Score : 0.9995898838004101
[ 3 15  6 ..

### Support Vector Machine
Train and validate your <b>support vector machine classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [10]:
# Training your support vector machine classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here

#define variables
opt_SVM = SVC()
max_cross_val = 0
max_f1_score = 0
kernels = ["linear", "poly", "rbf"]
coefs = [0.001, 0.01, 0.1, 0.0]
Cs = [0.01, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0]
cv = KFold(5, shuffle=True, random_state=0)

# implement regularization
for kernel in kernels:
    for coef in coefs:
        for C in Cs:
            SVM = SVC(C=C, cache_size=200, class_weight=None, coef0=coef,
            decision_function_shape='ovr', degree=3, gamma='auto', kernel=kernel,
            max_iter=-1, probability=False, random_state=None, shrinking=True,
            tol=0.001, verbose=False)
            SVM.fit(train_X, train_y)
            # implement cross-validation and get a f1_score
            cross_val = cross_val_score(SVM, train_X, train_y, cv = cv)
            cross_val = np.mean(cross_val)
            cur_f1_score = f1_score(train_y, SVM.predict(train_X), average='macro', labels=np.unique(SVM.predict(train_X)))

            # get a minimum cross-validation error by f1_score
            if cur_f1_score > max_f1_score:
                opt_SVM = SVM
                max_cross_val = cross_val
                max_f1_score = cur_f1_score
    print("Maximum F1 Score : {}".format(max_f1_score))

# print validation error for my model 
validation_error = 1 - max_cross_val
print("My SVM's Validation error : {}".format(validation_error))

# End Your Code

[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  2  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  2  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  2  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15

### (Option) Other Classifiers.
Train and validate other classifiers by your own manner.
> <b> If you need, you can import other models only in this cell, only in scikit-learn. </b>

In [11]:
# If you need additional packages, import your own packages below.
# Your Code Here
#define variables
opt_model = GradientBoostingClassifier()
max_cross_val = 0
max_f1_score = 0
myModel = ""
learning_rates = [1.0, 0.1, 0.01]
cv = KFold(5, shuffle=True, random_state=1)

for learning_rate in learning_rates:
    for depth in range(1,5):
        GradientBoosting = GradientBoostingClassifier(learning_rate=learning_rate, max_depth=depth)
        GradientBoosting = GradientBoosting.fit(train_X, train_y)
        # implement cross-validation and get a f1_score
        cross_val = cross_val_score(GradientBoosting, train_X, train_y, cv = cv)
        cross_val = np.mean(cross_val)
        cur_f1_score = f1_score(train_y, GradientBoosting.predict(train_X), average='macro', labels=np.unique(GradientBoosting.predict(train_X)))
        # get a minimum cross-validation error by f1_score
        if cur_f1_score > max_f1_score:
            opt_model = GradientBoosting
            max_cross_val = cross_val
            max_f1_score = cur_f1_score
            myModel = "GradientBoosting"
        
    Bagging = BaggingClassifier()
    Bagging = Bagging.fit(train_X, train_y)
    # implement cross-validation and get a f1_score
    cross_val = cross_val_score(Bagging, train_X, train_y, cv = cv)
    cross_val = np.mean(cross_val)
    cur_f1_score = f1_score(train_y, Bagging.predict(train_X), average='macro', labels=np.unique(Bagging.predict(train_X)))
    # get a minimum cross-validation error by f1_score
    if cur_f1_score > max_f1_score:
        opt_model = Bagging
        max_cross_val = cross_val
        max_f1_score = cur_f1_score
        myModel = "Bagging"
    print (train_y)
    print (GradientBoosting.predict(train_X))
    print ("Current Optimal Model is : {}".format(myModel))
    print("Maximum F1 Score : {}".format(max_f1_score))

# print validation error for my model
validation_error = 1 - max_cross_val
print("My model is : {}".format(myModel))
print("My model's Validation error : {}".format(validation_error))
# End Your Code

[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Current Optimal Model is : GradientBoosting
Maximum F1 Score : 1.0
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Current Optimal Model is : GradientBoosting
Maximum F1 Score : 1.0
[ 3 15  6 ... 15  1  6]
[ 3 15  6 ... 15  1  6]
Current Optimal Model is : GradientBoosting
Maximum F1 Score : 1.0
My model is : GradientBoosting
My model's Validation error : 0.0


## Submit your prediction on the test data.

* Select your model and explain it briefly.
* You should read <b>"test.csv"</b>.
* Prerdict your model in array form.
* Prediction example <br>
[2, 6, 14, 8, $\cdots$]
* We will rank your result by <b>F1 metric(with 'macro' option)</b>.
* <b> If you don't submit prediction file or submit it in wrong format, you can't get the point for this part.

In [67]:
# Explain your final model
'''위의 모델들 중 제가 추가적으로 구현한 모델인 Gradient Boosting, Bagging 중 Gradient Boosting 모델이 validation error
가 가장 낮으므로, Gradient Boosting 모델을 선택했습니다. 모든 모델들을 regularization 및 cross validation 을 사용하여, 최적의 모델을
찾았지만, 그 중에서도 제 모델이 가장 좋은 성능을 보였습니다.'''

'설명'

In [15]:
# Load test dataset.
# Your Code Here
test_data = pd.read_csv("data/test.csv", index_col=0)
test_values = test_data.reset_index().values
real_features = [1,3,5]
categorical_features = [0,2,4]
test_X = np.asarray(test_values)

        
# implement OneHotEncoding
for val in categorical_features:
    feature = test_values[:,val]
    feature = oneHot.fit_transform(feature.reshape(-1,1)).toarray()
    test_X = np.concatenate((test_X, feature), axis = 1)
# end OneHotEncoding

# implement delete categorical datas and refine training y data
test_X = np.delete(test_X, categorical_features, 1)
# End Your Code

In [16]:
# Predict target class
# Make variable "my_answer", type of array, and fill this array with your class predictions.
# Modify file name into your student number and your name.
# Your Code Here
my_answer = opt_model.predict(test_X)
file_name = "HW2_2013190709_김민상.csv"
# End Your Code

In [17]:
# This section is for saving predicted answers. DO NOT MODIFY.
pd.Series(my_answer).to_csv("./data/" + file_name, header=None, index=None)