# Kickstarter Projects
# Day 2 (Rev-1) Homework for a Machine Learning Course
Author：Hiroki Miyamoto

# Links to my homeworks
- Kaggle
    - Day1
        - Objective of Day 1 : Build a supervised machine learning model based on the lecture on Day 1. Don't care about the accuracy for now.
        - https://www.kaggle.com/hmiyamoto/day-1-homework-for-a-machine-learning-course/
    - Day2
        - Objective of Day 2 : Improve the accuracy of your supervised machine learning model applying the algorithms introduced on Day 2.
        - https://www.kaggle.com/hmiyamoto/day-2-homework-for-a-machine-learning-course/
    - Day2 Appendix-1
        - Check the contribution of explanatory variables for prediction by investigating the weight
        - https://www.kaggle.com/hmiyamoto/day-2-homework-appendix-1
    - Day2 Appendix-2
        - Check the contribution of explanatory variables for prediction by investigating the weight again
            - "backers" and "usd_pledged_real" are removed from explanatory variables for success prediction because these variables are results of funding.
        - https://www.kaggle.com/hmiyamoto/day-2-homework-appendix-2
- GitHub
    - https://github.com/hmiyamoto1/skillupai_ml

## Objective of Day 2 : Improve the accuracy of your supervised machine learning model applying the algorithms introduced on Day 2.

### "backers" and "usd_pledged_real" are removed from explanatory variables for success prediction because these variables are results of funding, as examined in the notebook Day 2 Appendix-2. 

### 2-variables 'category_dummy' and 'usd_goal_real_log10' are applied as explanatory variables because these variables are highly contributed to predict, as examined in the notebook Day 2 Appendix-2. 

### Table of Contents (Day 2)
1. Devide dataframe into train data(train & validation) and test data(final check)
1. Parameter study for Logistic Regression (L1)
1. Parameter study for Logistic Regression (L2)
1. Parameter study for SVM (Linear)
1. Parameter study for SVM (RBF)

# Summary

SVM was not so as good accuracy as logistic regression in this case.  As can be seen 2D graphs in this notebook, SVM couldn't create a distinct border surface because labels of the objective variable data are overlapped on a map created by selected explanatory variables.

| Model | Final Test Accuracy   |
|------|------|
|   Logistic Regression (L1)  | 68.929% |
|   Logistic Regression (L2)  | 68.932% |
|   SVM (Linear)  | 61.953% |
|   SVM (RBF)  | 60.256% |

- Objective variable
    - state_dummy (successful = 1, other = 0)
- Explanatory variables
    - usd_goal_real_log10
        - Second highly contributed variable to predict
    - category_dummy
        - is created by aggregating success rate of each category in training phase
        - First highly contributed variable to predict
        


# 0. Preparation

In [None]:
# data analysis and wrangling
import pandas as pd
import numpy as np

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix, precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.svm import SVC

In [None]:
from matplotlib.colors import ListedColormap


def plot_decision_regions(X, y, classifier, resolution=0.02):

    # setup marker generator and color map
    markers = ( 'x', '.', 'o', 's', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
                       
    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    # plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=colors[idx],
                    edgecolor=None,
                    marker=markers[idx], 
                    label=cl)

# 1. Devide dataframe into train data(trian & validation) and test data(final check)
- DataFrame
    - df_TRAIN : 80%
        - This will be further devided into train and validation data in holdout or cross-validation training phase.
            - train : 80%
            - valid : 20%
    - df_TEST : 20%
        - This is blind data for final test.

## Acquire data

In [None]:
df_kick = pd.read_csv("../input/ks-projects-201801.csv")

## Preview the data

In [None]:
display(df_kick.head())

## Create dataframe with selected features

In [None]:
df_kick['state_dummy'] = df_kick['state']
df_kick['state_dummy'].loc[df_kick['state_dummy'] != 'successful'] = 0
df_kick['state_dummy'].loc[df_kick['state_dummy'] == 'successful'] = 1

# display(df_kick.head())

In [None]:
# epsilon = 1e-8
epsilon = 1

df_kick['usd_goal_real_log10'] = np.log10(df_kick['usd_goal_real'] + epsilon)

display(df_kick.head())

In [None]:
df_ALL = df_kick.loc[:, ['state_dummy', 'usd_goal_real_log10', 'category']]
df_ALL.head()

In [None]:
df_TRAIN, df_TEST = train_test_split(df_ALL, test_size=0.2, random_state=1234)
df_TRAIN.head()
display(df_TRAIN.describe())
display(df_TEST.describe())

# 2. Parameter study for Logistic Regression (L1)
- Model
    - Variables
        - Objective variable
            - state_dummy (successful = 1, other = 0)
        - Explanatory variables
            - usd_goal_real_log10
                - Second highly contributed variable to predict
            - category_dummy
                - will be created by aggregating success rate of each category in training phase
                - First highly contributed variable to predict
    - Cross validation (Train:80%/Valid:20% - 5 times) is applied.
    - **L1 regularization is applied here.**
- Cross Validation Result
    - Best parameter
        - **alpha = 1e-3  (1e-8 <= alpha <= 1e-1)**
        - Actually, alpha = 1e-2 was best prediction score. However, alpha = 1e-3 would be more stable.
    - Best score
        - **Cross Validation Log-likelihood = -10.886**
        - **Cross Validation Accuracy = 68.483%**
- Final Test Result (Applied best parameter)
    - Test score
        - **Test Log-likelihood = -10.732**
        - **Test Accuracy = 68.929%**

## Cross-Validation

In [None]:
penalty = 'l1'

alphas_multiply = np.array(range(-8,0))
alphas = 10.0 ** alphas_multiply

L1_accuracy = []
L1_log_likelihood = []
L1_weight_abs_max = []
L1_weight_abs_min = []


for alpha in alphas:
    
    print('='*100)
    print('penalty =', penalty)
    print('alpha =', alpha)
    print()

    y = df_TRAIN["state_dummy"].values
    X = df_TRAIN[["usd_goal_real_log10", "category"]].values

    n_split = 5 # Number of group

    cross_valid_log_likelihood = 0
    cross_valid_accuracy = 0
    split_num = 1

    # Cross Validation
    for train_idx, valid_idx in KFold(n_splits=n_split, random_state=1234).split(X, y):
        X_train, y_train = X[train_idx], y[train_idx] # Train data
        X_valid, y_valid = X[valid_idx], y[valid_idx] # Validation data

        df_X_train = pd.DataFrame(X_train,
                                 columns=["usd_goal_real_log10", "category"])
        df_y_train = pd.DataFrame(y_train,
                                 columns=["state_dummy"])

        df_X_valid = pd.DataFrame(X_valid,
                                 columns=["usd_goal_real_log10", "category"])
        df_y_valid = pd.DataFrame(y_valid,
                                 columns=["state_dummy"])




        # Create dummy variables for category using train data
        # Replace category to category_success_rate
        category_success_rate = {}
        df_category_successful_count = df_X_train['category'][df_y_train['state_dummy'] == 1].value_counts()
        df_category_all_count = df_X_train['category'].value_counts()
        for category in df_category_all_count.keys():
            category_success_rate[category] = df_category_successful_count[category] / df_category_all_count[category]
        df_X_train['category_dummy'] = df_X_train['category'].replace(category_success_rate)
        df_X_valid['category_dummy'] = df_X_valid['category'].replace(category_success_rate)


        print("Fold %s"%split_num)

        X_train = df_X_train[["usd_goal_real_log10", "category_dummy"]].values
        X_valid = df_X_valid[["usd_goal_real_log10", "category_dummy"]].values
        
        # Normaliztion
        stdsc = StandardScaler()
        X_train = stdsc.fit_transform(X_train)
        X_valid = stdsc.transform(X_valid)

        clf = SGDClassifier(loss='log', penalty=penalty, alpha=alpha, max_iter=100, fit_intercept=True, random_state=1234)
        clf.fit(X_train, y_train)

        # Weight
        w0 = clf.intercept_[0]
        w1 = clf.coef_[0, 0]
        w2 = clf.coef_[0, 1]
        print('w0 = {:.3f}, w1 = {:.3f}, w2 = {:.3f}'.format(w0, w1, w2))


        # Predict labels
        y_est_valid = clf.predict(X_valid)

        # Log-likelihood
        log_likelihood = - log_loss(y_valid, y_est_valid)    
        cross_valid_log_likelihood += log_likelihood    
        print('Log-likelihood = {:.3f}'.format(log_likelihood))

        # Accuracy
        accuracy = accuracy_score(y_valid, y_est_valid)
        cross_valid_accuracy += accuracy   
        print('Accuracy = {:.3f}%'.format(100 * accuracy))
        print()
        
        if split_num == n_split:
            plot_decision_regions(X_valid, y_valid, classifier=clf)
            plt.title('(Fold %s)  L1 alpha = %s' %(split_num,alpha))
            plt.xlabel('usd_goal_real_log10_stdsc')
            plt.ylabel('category_dummy_stdsc')
            plt.axes().set_aspect('equal', 'datalim')
            plt.legend(loc='upper right')
            plt.tight_layout()
            plt.show()

        split_num += 1

    # Generalization performance
    final_log_likelihood = cross_valid_log_likelihood / n_split
    print("Cross Validation Log-likelihood = %s"%round(final_log_likelihood, 3))
    final_accuracy = cross_valid_accuracy / n_split
    print('Cross Validation Accuracy = {:.3f}%'.format(100 * final_accuracy))
    
    L1_accuracy.append(final_accuracy)
    L1_log_likelihood.append(final_log_likelihood)
    L1_weight_abs_max.append(np.max(np.abs(clf.coef_)))
    L1_weight_abs_min.append(np.min(np.abs(clf.coef_)))
    

In [None]:
plt.plot(alphas_multiply, L1_accuracy, marker='o')
plt.title("L1 alpha")
plt.xlabel("Log10(alpha)")
plt.ylabel("Accuracy")
plt.show()

In [None]:
plt.plot(alphas_multiply, L1_log_likelihood, marker='o')
plt.title("L1 alpha")
plt.xlabel("Log10(alpha)")
plt.ylabel("Log-likelihood")
plt.show()

In [None]:
plt.plot(alphas_multiply, L1_weight_abs_max, marker='o', label='Weight_abs_max')
plt.plot(alphas_multiply, L1_weight_abs_min, marker='o', label='Weight_abs_min')
plt.title("L1 alpha")
plt.xlabel("Log10(alpha)")
plt.ylabel("Weight_abs_max_min")
plt.legend()
plt.show()

## Final Test

In [None]:
penalty = 'l1'

# Best Parameter
alpha = 1e-3


print('penalty =', penalty)
print('alpha =', alpha)
print()

# TRAIN data
df_y_train = df_TRAIN[["state_dummy"]]
df_X_train = df_TRAIN[["usd_goal_real_log10", "category"]]

# TEST data
df_y_test = df_TEST[["state_dummy"]]
df_X_test = df_TEST[["usd_goal_real_log10", "category"]]


# df_X_train = pd.DataFrame(X_train,
#                          columns=["usd_goal_real_log10", "category"])
# df_y_train = pd.DataFrame(y_train,
#                          columns=["state_dummy"])

# df_X_test = pd.DataFrame(X_test,
#                          columns=["usd_goal_real_log10", "category"])
# df_y_test = pd.DataFrame(y_test,
#                          columns=["state_dummy"])


# Create dummy variables for category using train data
# Replace category to category_success_rate
category_success_rate = {}
df_category_successful_count = df_X_train['category'][df_y_train['state_dummy'] == 1].value_counts()
df_category_all_count = df_X_train['category'].value_counts()
for category in df_category_all_count.keys():
    category_success_rate[category] = df_category_successful_count[category] / df_category_all_count[category]
df_X_train['category_dummy'] = df_X_train['category'].replace(category_success_rate)
df_X_test['category_dummy'] = df_X_test['category'].replace(category_success_rate)



X_train = df_X_train[["usd_goal_real_log10", "category_dummy"]].values
X_test = df_X_test[["usd_goal_real_log10", "category_dummy"]].values

y_train = df_y_train[["state_dummy"]].values
y_test = df_y_test[["state_dummy"]].values

# Normaliztion
stdsc = StandardScaler()
X_train = stdsc.fit_transform(X_train)
X_test = stdsc.transform(X_test)

clf = SGDClassifier(loss='log', penalty=penalty, alpha=alpha, max_iter=100, fit_intercept=True, random_state=1234)
clf.fit(X_train, y_train)

# Weight
w0 = clf.intercept_[0]
w1 = clf.coef_[0, 0]
w2 = clf.coef_[0, 1]
print('w0 = {:.3f}, w1 = {:.3f}, w2 = {:.3f}'.format(w0, w1, w2))


# Predict labels
y_est_test = clf.predict(X_test)

# Log-likelihood
log_likelihood = - log_loss(y_test, y_est_test)       
print('Test Log-likelihood = {:.3f}'.format(log_likelihood))

# Accuracy
accuracy = accuracy_score(y_test, y_est_test)  
print('Test Accuracy = {:.3f}%'.format(100 * accuracy))
print()

plot_decision_regions(X_test, y_test.flatten(), classifier=clf)
plt.title('(Final Test)  L1 alpha = %s' %alpha)
plt.xlabel('usd_goal_real_log10_stdsc')
plt.ylabel('category_dummy_stdsc')
plt.axes().set_aspect('equal', 'datalim')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()

    

# 3. Parameter study for Logistic Regression (L2)
- Model
    - Variables
        - Objective variable
            - state_dummy (successful = 1, other = 0)
        - Explanatory variables
            - usd_goal_real_log10
                - Second highly contributed variable to predict
            - category_dummy
                - will be created by aggregating success rate of each category in training phase
                - First highly contributed variable to predict
    - Cross validation (Train:80%/Valid:20% - 5 times) is applied.
    - **L2 regularization is applied here.**
- Cross Validation Result
    - Best parameter
        - **alpha = 1e-3  (1e-8 <= alpha <= 1e-1)**
    - Best score
        - **Cross Validation Log-likelihood = -10.885**
        - **Cross Validation Accuracy = 68.484%**
- Final Test Result (Applied best parameter)
    - Test score
        - **Test Log-likelihood = -10.731**
        - **Test Accuracy = 68.932%**

## Cross-Validation

In [None]:
penalty = 'l2'

alphas_multiply = np.array(range(-8,0))
alphas = 10.0 ** alphas_multiply

L2_accuracy = []
L2_log_likelihood = []
L2_weight_abs_max = []
L2_weight_abs_min = []


for alpha in alphas:
    
    print('='*100)
    print('penalty =', penalty)
    print('alpha =', alpha)
    print()

    y = df_TRAIN["state_dummy"].values
    X = df_TRAIN[["usd_goal_real_log10", "category"]].values

    n_split = 5 # Number of group

    cross_valid_log_likelihood = 0
    cross_valid_accuracy = 0
    split_num = 1

    # Cross Validation
    for train_idx, valid_idx in KFold(n_splits=n_split, random_state=1234).split(X, y):
        X_train, y_train = X[train_idx], y[train_idx] # Train data
        X_valid, y_valid = X[valid_idx], y[valid_idx] # Validation data

        df_X_train = pd.DataFrame(X_train,
                                 columns=["usd_goal_real_log10", "category"])
        df_y_train = pd.DataFrame(y_train,
                                 columns=["state_dummy"])

        df_X_valid = pd.DataFrame(X_valid,
                                 columns=["usd_goal_real_log10", "category"])
        df_y_valid = pd.DataFrame(y_valid,
                                 columns=["state_dummy"])




        # Create dummy variables for category using train data
        # Replace category to category_success_rate
        category_success_rate = {}
        df_category_successful_count = df_X_train['category'][df_y_train['state_dummy'] == 1].value_counts()
        df_category_all_count = df_X_train['category'].value_counts()
        for category in df_category_all_count.keys():
            category_success_rate[category] = df_category_successful_count[category] / df_category_all_count[category]
        df_X_train['category_dummy'] = df_X_train['category'].replace(category_success_rate)
        df_X_valid['category_dummy'] = df_X_valid['category'].replace(category_success_rate)


        print("Fold %s"%split_num)

        X_train = df_X_train[["usd_goal_real_log10", "category_dummy"]].values
        X_valid = df_X_valid[["usd_goal_real_log10", "category_dummy"]].values
        
        # Normaliztion
        stdsc = StandardScaler()
        X_train = stdsc.fit_transform(X_train)
        X_valid = stdsc.transform(X_valid)

        clf = SGDClassifier(loss='log', penalty=penalty, alpha=alpha, max_iter=100, fit_intercept=True, random_state=1234)
        clf.fit(X_train, y_train)

        # Weight
        w0 = clf.intercept_[0]
        w1 = clf.coef_[0, 0]
        w2 = clf.coef_[0, 1]
        print('w0 = {:.3f}, w1 = {:.3f}, w2 = {:.3f}'.format(w0, w1, w2))


        # Predict labels
        y_est_valid = clf.predict(X_valid)

        # Log-likelihood
        log_likelihood = - log_loss(y_valid, y_est_valid)    
        cross_valid_log_likelihood += log_likelihood    
        print('Log-likelihood = {:.3f}'.format(log_likelihood))

        # Accuracy
        accuracy = accuracy_score(y_valid, y_est_valid)
        cross_valid_accuracy += accuracy   
        print('Accuracy = {:.3f}%'.format(100 * accuracy))
        print()
        
        if split_num == n_split:
            plot_decision_regions(X_valid, y_valid, classifier=clf)
            plt.title('(Fold %s)  L2 alpha = %s' %(split_num,alpha))
            plt.xlabel('usd_goal_real_log10_stdsc')
            plt.ylabel('category_dummy_stdsc')
            plt.axes().set_aspect('equal', 'datalim')
            plt.legend(loc='upper right')
            plt.tight_layout()
            plt.show()

        split_num += 1

    # Generalization performance
    final_log_likelihood = cross_valid_log_likelihood / n_split
    print("Cross Validation Log-likelihood = %s"%round(final_log_likelihood, 3))
    final_accuracy = cross_valid_accuracy / n_split
    print('Cross Validation Accuracy = {:.3f}%'.format(100 * final_accuracy))
    
    L2_accuracy.append(final_accuracy)
    L2_log_likelihood.append(final_log_likelihood)
    L2_weight_abs_max.append(np.max(np.abs(clf.coef_)))
    L2_weight_abs_min.append(np.min(np.abs(clf.coef_)))
    

In [None]:
plt.plot(alphas_multiply, L2_accuracy, marker='o')
plt.title("L2 alpha")
plt.xlabel("Log10(alpha)")
plt.ylabel("Accuracy")
plt.show()

In [None]:
plt.plot(alphas_multiply, L2_log_likelihood, marker='o')
plt.title("L2 alpha")
plt.xlabel("Log10(alpha)")
plt.ylabel("Log-likelihood")
plt.show()

In [None]:
plt.plot(alphas_multiply, L2_weight_abs_max, marker='o', label='Weight_abs_max')
plt.plot(alphas_multiply, L2_weight_abs_min, marker='o', label='Weight_abs_min')
plt.title("L2 alpha")
plt.xlabel("Log10(alpha)")
plt.ylabel("Weight_abs_max_min")
plt.legend()
plt.show()

## Final Test

In [None]:
penalty = 'l2'

# Best Parameter
alpha = 1e-3


print('penalty =', penalty)
print('alpha =', alpha)
print()

# TRAIN data
df_y_train = df_TRAIN[["state_dummy"]]
df_X_train = df_TRAIN[["usd_goal_real_log10", "category"]]

# TEST data
df_y_test = df_TEST[["state_dummy"]]
df_X_test = df_TEST[["usd_goal_real_log10", "category"]]


# df_X_train = pd.DataFrame(X_train,
#                          columns=["usd_goal_real_log10", "category"])
# df_y_train = pd.DataFrame(y_train,
#                          columns=["state_dummy"])

# df_X_test = pd.DataFrame(X_test,
#                          columns=["usd_goal_real_log10", "category"])
# df_y_test = pd.DataFrame(y_test,
#                          columns=["state_dummy"])


# Create dummy variables for category using train data
# Replace category to category_success_rate
category_success_rate = {}
df_category_successful_count = df_X_train['category'][df_y_train['state_dummy'] == 1].value_counts()
df_category_all_count = df_X_train['category'].value_counts()
for category in df_category_all_count.keys():
    category_success_rate[category] = df_category_successful_count[category] / df_category_all_count[category]
df_X_train['category_dummy'] = df_X_train['category'].replace(category_success_rate)
df_X_test['category_dummy'] = df_X_test['category'].replace(category_success_rate)



X_train = df_X_train[["usd_goal_real_log10", "category_dummy"]].values
X_test = df_X_test[["usd_goal_real_log10", "category_dummy"]].values

y_train = df_y_train[["state_dummy"]].values
y_test = df_y_test[["state_dummy"]].values

# Normaliztion
stdsc = StandardScaler()
X_train = stdsc.fit_transform(X_train)
X_test = stdsc.transform(X_test)

clf = SGDClassifier(loss='log', penalty=penalty, alpha=alpha, max_iter=100, fit_intercept=True, random_state=1234)
clf.fit(X_train, y_train)

# Weight
w0 = clf.intercept_[0]
w1 = clf.coef_[0, 0]
w2 = clf.coef_[0, 1]
print('w0 = {:.3f}, w1 = {:.3f}, w2 = {:.3f}'.format(w0, w1, w2))


# Predict labels
y_est_test = clf.predict(X_test)

# Log-likelihood
log_likelihood = - log_loss(y_test, y_est_test)       
print('Test Log-likelihood = {:.3f}'.format(log_likelihood))

# Accuracy
accuracy = accuracy_score(y_test, y_est_test)  
print('Test Accuracy = {:.3f}%'.format(100 * accuracy))
print()

plot_decision_regions(X_test, y_test.flatten(), classifier=clf)
plt.title('(Final Test)  L2 alpha = %s' %alpha)
plt.xlabel('usd_goal_real_log10_stdsc')
plt.ylabel('category_dummy_stdsc')
plt.axes().set_aspect('equal', 'datalim')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()

    

# 4. Parameter study for SVM (Linear)
- Model
    - Variables
        - Objective variable
            - state_dummy (successful = 1, other = 0)
        - Explanatory variables
            - usd_goal_real_log10
                - Second highly contributed variable to predict
            - category_dummy
                - will be created by aggregating success rate of each category in training phase
                - First highly contributed variable to predict
    - **SVM (Linear) is applied here.**
    - **Holdout method is applied for SVM parameter study instead of cross validation to shorten the calculation time.**
- Cross Validation Result
    - Best parameter
        - **C = 1e-4  (1e-8 <= C <= 1e3)**
    - Best score
        - **Holdout Log-likelihood = -13.752**
        - **Holdout Accuracy = 60.184%**
- Final Test Result (Applied best parameter)
    - Test score
        - **Test Log-likelihood = -13.141**
        - **Test Accuracy = 61.953%**

## Holdout Method

In [None]:
kernel = 'linear'
# Cs = [0.001, 0.01, 0.1, 1, 10]
Cs = np.logspace(-8, 3, 12, base=10)

SVM_accuracy = []
SVM_log_likelihood = []
SVM_weight_abs_max = []
SVM_weight_abs_min = []


for C in Cs:
    
    print('='*100)
    print('kernel =', kernel)
    print('C =', C)
    print()
    
    y = df_TRAIN["state_dummy"].values
    X = df_TRAIN[["usd_goal_real_log10", "category"]].values

    # Holdout method
    test_size = 0.2        # 20%
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=test_size, random_state=1234) # Holdout


    df_X_train = pd.DataFrame(X_train,
                             columns=["usd_goal_real_log10", "category"])
    df_y_train = pd.DataFrame(y_train,
                             columns=["state_dummy"])

    df_X_valid = pd.DataFrame(X_valid,
                             columns=["usd_goal_real_log10", "category"])
    df_y_valid = pd.DataFrame(y_valid,
                             columns=["state_dummy"])


    # Create dummy variables for category using train data
    # Replace category to category_success_rate
    category_success_rate = {}
    df_category_successful_count = df_X_train['category'][df_y_train['state_dummy'] == 1].value_counts()
    df_category_all_count = df_X_train['category'].value_counts()
    for category in df_category_all_count.keys():
        category_success_rate[category] = df_category_successful_count[category] / df_category_all_count[category]
    df_X_train['category_dummy'] = df_X_train['category'].replace(category_success_rate)
    df_X_valid['category_dummy'] = df_X_valid['category'].replace(category_success_rate)

    X_train = df_X_train[["usd_goal_real_log10", "category_dummy"]].values
    X_valid = df_X_valid[["usd_goal_real_log10", "category_dummy"]].values
    
    # Normaliztion
    stdsc = StandardScaler()
    X_train = stdsc.fit_transform(X_train)
    X_valid = stdsc.transform(X_valid)

    clf = SVC(C=C,kernel=kernel, max_iter=2000, random_state=1234)
    clf.fit(X_train, y_train)

    # Predict labels
    y_est_valid = clf.predict(X_valid)

    # Log-likelihood
    holdout_log_likelihood = - log_loss(y_valid, y_est_valid)         

    # Accuracy
    holdout_accuracy = accuracy_score(y_valid, y_est_valid)
    
    plot_decision_regions(X_valid, y_valid, classifier=clf)
    plt.title('(Holdout)  SVM Linear C = %s' %C)
    plt.xlabel('usd_goal_real_log10_stdsc')
    plt.ylabel('category_dummy_stdsc')
    plt.axes().set_aspect('equal', 'datalim')
    plt.legend(loc='upper right')
    plt.tight_layout()
    plt.show()

    # Generalization performance
    final_log_likelihood = holdout_log_likelihood
    print("Holdout Log-likelihood = %s"%round(final_log_likelihood, 3))
    final_accuracy = holdout_accuracy
    print('Holdout Accuracy = {:.3f}%'.format(100 * final_accuracy))
    
    SVM_accuracy.append(final_accuracy)
    SVM_log_likelihood.append(final_log_likelihood)
    SVM_weight_abs_max.append(np.max(np.abs(clf.coef_)))
    SVM_weight_abs_min.append(np.min(np.abs(clf.coef_)))
    

In [None]:
plt.plot(Cs, SVM_accuracy, marker='o')
plt.title("SVM kernel = linear")
plt.xlabel("C")
plt.xscale('log')
plt.ylabel("Accuracy")
plt.show()

In [None]:
plt.plot(Cs, SVM_log_likelihood, marker='o')
plt.title("SVM kernel = linear")
plt.xlabel("C")
plt.xscale('log')
plt.ylabel("Log-likelihood")
plt.show()

In [None]:
plt.plot(Cs, SVM_weight_abs_max, marker='o', label='Weight_abs_max')
plt.plot(Cs, SVM_weight_abs_min, marker='o', label='Weight_abs_min')
plt.title("SVM kernel = linear")
plt.xlabel("C")
plt.xscale('log')
plt.ylabel("Weight_abs_max_min")
plt.legend()
plt.show()

## Final Test

In [None]:
kernel = 'linear'

# Best Parameter
C = 1e-4


print('kernel =', kernel)
print('C =', C)
print()

# TRAIN data
df_y_train = df_TRAIN[["state_dummy"]]
df_X_train = df_TRAIN[["usd_goal_real_log10", "category"]]

# TEST data
df_y_test = df_TEST[["state_dummy"]]
df_X_test = df_TEST[["usd_goal_real_log10", "category"]]

# Create dummy variables for category using train data
# Replace category to category_success_rate
category_success_rate = {}
df_category_successful_count = df_X_train['category'][df_y_train['state_dummy'] == 1].value_counts()
df_category_all_count = df_X_train['category'].value_counts()
for category in df_category_all_count.keys():
    category_success_rate[category] = df_category_successful_count[category] / df_category_all_count[category]
df_X_train['category_dummy'] = df_X_train['category'].replace(category_success_rate)
df_X_test['category_dummy'] = df_X_test['category'].replace(category_success_rate)



X_train = df_X_train[["usd_goal_real_log10", "category_dummy"]].values
X_test = df_X_test[["usd_goal_real_log10", "category_dummy"]].values

y_train = df_y_train[["state_dummy"]].values
y_test = df_y_test[["state_dummy"]].values

# print(X_test.shape)
# print(y_test.flatten().shape)

# print(X_valid.shape)
# print(y_valid.shape)

# Normaliztion
stdsc = StandardScaler()
X_train = stdsc.fit_transform(X_train)
X_test = stdsc.transform(X_test)

clf = SVC(C=C,kernel=kernel, max_iter=2000, random_state=1234)
clf.fit(X_train, y_train)

# Predict labels
y_est_test = clf.predict(X_test)

# Log-likelihood
log_likelihood = - log_loss(y_test, y_est_test)       
print('Test Log-likelihood = {:.3f}'.format(log_likelihood))

# Accuracy
accuracy = accuracy_score(y_test, y_est_test)  
print('Test Accuracy = {:.3f}%'.format(100 * accuracy))
print()

plot_decision_regions(X_test, y_test.flatten(), classifier=clf)
plt.title('(Final Test)  SVM Linear C = %s' %C)
plt.xlabel('usd_goal_real_log10_stdsc')
plt.ylabel('category_dummy_stdsc')
plt.axes().set_aspect('equal', 'datalim')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()

    

# 5. Parameter study for SVM (RBF)
- Model
    - Variables
        - Objective variable
            - state_dummy (successful = 1, other = 0)
        - Explanatory variables
            - usd_goal_real_log10
                - Second highly contributed variable to predict
            - category_dummy
                - will be created by aggregating success rate of each category in training phase
                - First highly contributed variable to predict
    - **SVM (RBF) is applied here.**
    - **Holdout method is applied for SVM parameter study instead of cross validation to shorten the calculation time.**
- Cross Validation Result
    - Best parameter
        - **C = 1e2  (1e-4 <= C <= 1e2)**
        - **gamma = 1e-8  (1e-10 <= gamma <= 1e-4)**
    - Best score
        - **Holdout Log-likelihood = -10.894**
        - **Holdout Accuracy = 68.460%**
- Final Test Result (Applied best parameter)
    - Test score
        - **Test Log-likelihood = -13.727**
        - **Test Accuracy = 60.256%**

## Holdout Method

In [None]:
kernel = 'rbf'
Cs = np.logspace(-4, 2, 7, base=10)
gammas = np.logspace(-10, -4, 7, base=10)

SVM_accuracy = []
SVM_log_likelihood = []
SVM_weight_abs_max = []
SVM_weight_abs_min = []


for C in Cs:
    for gamma in gammas:

        print('='*100)
        print('kernel =', kernel)
        print('C =', C)
        print('gamma =', gamma)
        print()

        y = df_TRAIN["state_dummy"].values
        X = df_TRAIN[["usd_goal_real_log10", "category"]].values

        # Holdout method
        test_size = 0.2        # 20%
        X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=test_size, random_state=1234) # Holdout


        df_X_train = pd.DataFrame(X_train,
                                 columns=["usd_goal_real_log10", "category"])
        df_y_train = pd.DataFrame(y_train,
                                 columns=["state_dummy"])

        df_X_valid = pd.DataFrame(X_valid,
                                 columns=["usd_goal_real_log10", "category"])
        df_y_valid = pd.DataFrame(y_valid,
                                 columns=["state_dummy"])


        # Create dummy variables for category using train data
        # Replace category to category_success_rate
        category_success_rate = {}
        df_category_successful_count = df_X_train['category'][df_y_train['state_dummy'] == 1].value_counts()
        df_category_all_count = df_X_train['category'].value_counts()
        for category in df_category_all_count.keys():
            category_success_rate[category] = df_category_successful_count[category] / df_category_all_count[category]
        df_X_train['category_dummy'] = df_X_train['category'].replace(category_success_rate)
        df_X_valid['category_dummy'] = df_X_valid['category'].replace(category_success_rate)

        X_train = df_X_train[["usd_goal_real_log10", "category_dummy"]].values
        X_valid = df_X_valid[["usd_goal_real_log10", "category_dummy"]].values

        # Normaliztion
        stdsc = StandardScaler()
        X_train = stdsc.fit_transform(X_train)
        X_valid = stdsc.transform(X_valid)

        clf = SVC(C=C,kernel=kernel, gamma=gamma, max_iter=2000, random_state=1234)
        clf.fit(X_train, y_train)

        # Predict labels
        y_est_valid = clf.predict(X_valid)

        # Log-likelihood
        holdout_log_likelihood = - log_loss(y_valid, y_est_valid)         

        # Accuracy
        holdout_accuracy = accuracy_score(y_valid, y_est_valid)

        plot_decision_regions(X_valid, y_valid, classifier=clf)
        plt.title('(Holdout)  SVM RBF C = %s, gamma = %s' %(C, gamma))
        plt.xlabel('usd_goal_real_log10_stdsc')
        plt.ylabel('category_dummy_stdsc')
        plt.axes().set_aspect('equal', 'datalim')
        plt.legend(loc='upper right')
        plt.tight_layout()
        plt.show()

        # Generalization performance
        final_log_likelihood = holdout_log_likelihood
        print("Holdout Log-likelihood = %s"%round(final_log_likelihood, 3))
        final_accuracy = holdout_accuracy
        print('Holdout Accuracy = {:.3f}%'.format(100 * final_accuracy))

        SVM_accuracy.append(final_accuracy)
        SVM_log_likelihood.append(final_log_likelihood)

    

In [None]:
SVM_accuracy_2d = np.array(SVM_accuracy).reshape(7, 7)
Cs_str = list(map(str, Cs))
gammas_str = list(map(str, gammas))
df_SVM_accuracy_2d = pd.DataFrame(data=SVM_accuracy_2d, index=Cs_str, columns=gammas_str)
df_SVM_accuracy_2d

In [None]:
sns.heatmap(df_SVM_accuracy_2d, cmap='rainbow')
plt.ylabel("C")
plt.xlabel("gamma")
plt.title('Accuracy')
plt.show()

## Final Test

In [None]:
kernel = 'rbf'

# Best Parameter
C = 100
gamma = 1e-8


print('kernel =', kernel)
print('C =', C)
print('gamma =', gamma)
print()

# TRAIN data
df_y_train = df_TRAIN[["state_dummy"]]
df_X_train = df_TRAIN[["usd_goal_real_log10", "category"]]

# TEST data
df_y_test = df_TEST[["state_dummy"]]
df_X_test = df_TEST[["usd_goal_real_log10", "category"]]

# Create dummy variables for category using train data
# Replace category to category_success_rate
category_success_rate = {}
df_category_successful_count = df_X_train['category'][df_y_train['state_dummy'] == 1].value_counts()
df_category_all_count = df_X_train['category'].value_counts()
for category in df_category_all_count.keys():
    category_success_rate[category] = df_category_successful_count[category] / df_category_all_count[category]
df_X_train['category_dummy'] = df_X_train['category'].replace(category_success_rate)
df_X_test['category_dummy'] = df_X_test['category'].replace(category_success_rate)



X_train = df_X_train[["usd_goal_real_log10", "category_dummy"]].values
X_test = df_X_test[["usd_goal_real_log10", "category_dummy"]].values

y_train = df_y_train[["state_dummy"]].values
y_test = df_y_test[["state_dummy"]].values

# print(X_test.shape)
# print(y_test.flatten().shape)

# print(X_valid.shape)
# print(y_valid.shape)

# Normaliztion
stdsc = StandardScaler()
X_train = stdsc.fit_transform(X_train)
X_test = stdsc.transform(X_test)

clf = SVC(C=C,kernel=kernel, gamma=gamma, max_iter=2000, random_state=1234)
clf.fit(X_train, y_train)

# Predict labels
y_est_test = clf.predict(X_test)

# Log-likelihood
log_likelihood = - log_loss(y_test, y_est_test)       
print('Test Log-likelihood = {:.3f}'.format(log_likelihood))

# Accuracy
accuracy = accuracy_score(y_test, y_est_test)  
print('Test Accuracy = {:.3f}%'.format(100 * accuracy))
print()

plot_decision_regions(X_test, y_test.flatten(), classifier=clf)
plt.title('(Final Test)  SVM RBF C = %s, gamma = %s' %(C, gamma))
plt.xlabel('usd_goal_real_log10_stdsc')
plt.ylabel('category_dummy_stdsc')
plt.axes().set_aspect('equal', 'datalim')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()

    