# Task Instructions

Step 0. Import **ALL** packages you need in **ONE** cell   

Step 1. Load Data

Step 2. Model Comparison and Discussion 

Step 3. Conclusion

 

# Step 0. Import **ALL** packages you need in **ONE** cell  

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score,accuracy_score
from sklearn.metrics import precision_score
from sklearn.model_selection import train_test_split

import csv
import time

def generate_csv(ypredict, filename = 'result.csv'):
    with open(filename,'w',newline = '') as fd:
        writer = csv.writer(fd)
        writer.writerow(['index','default.payment.next.month'])

    for index,pred in enumerate(ypredict):
        with open(filename,'a',newline = '') as fd:
            writer = csv.writer(fd)
            writer.writerow([index,pred])
            
def show_result(groud_truth, prediction):
    print("Accuracy: ", accuracy_score(groud_truth, prediction))
    print("Precision: ",precision_score(groud_truth, prediction))
    print("Recall: ",recall_score(groud_truth, prediction))
    print("Confusion Matrix: ")
    print(confusion_matrix(groud_truth, prediction))
    return accuracy_score(groud_truth, prediction)

# Step 1. Load Data

In [2]:
train_x = pd.read_csv('data_train.csv')
train_x = train_x.iloc[:,1:-1]
train_y = pd.read_csv('answer_train.csv')
train_y = train_y.iloc[:,-1]
test_x = pd.read_csv('data_test.csv')
test_x = test_x.iloc[:,1:-1]
columns_to_keep = [col for col in train_x.columns if 'PAY' in col]

train_split_x, valid_x, train_split_y, valid_y = train_test_split(train_x, train_y, test_size = 0.2, random_state = 1)

# Step 2. Algorithms Comparison and Discussion 

**"In addition to the parameters listed, please provide an analytical discussion for each model as described below. You can also supplement any other parameters that were found to have an impact on the model during the process."**

* Linear Regression: L1/L2   - The weight difference of different features under L1 and L2
* Decision Tree: IG/Gini - The difference between the results of two different index pairs
* Support Vector Machine: Gamma/C - The effect of each of the two parameters on the model
* K-Nearest Neighbor: K  - Effect of different K values on the model


### Linear Regression ###

In [6]:
#L1
model=Lasso(alpha = 0.1)
model.fit(train_split_x, train_split_y)

ypredict = model.predict(valid_x).round()
ypredict[ypredict>1] = 1
ypredict[ypredict<1] = 0
show_result(valid_y, ypredict)

model=Lasso(alpha = 0.1)
model.fit(train_x, train_y)
ypredict=model.predict(test_x).round()
ypredict[ypredict>1] = 1
ypredict[ypredict<1] = 0
generate_csv(ypredict, 'L1_Result.csv')

print("Coefficients:")
for feature, coef in zip(train_x.columns, model.coef_):
    print(f"{feature}: {coef}")

Accuracy:  0.7895833333333333
Precision:  0.0
Recall:  0.0
Confusion Matrix: 
[[3790    0]
 [1010    0]]


  _warn_prf(average, modifier, msg_start, len(result))


Coefficients:
LIMIT_BAL: -3.8621394530226406e-07
SEX: -0.0
EDUCATION: -0.0
MARRIAGE: -0.0
AGE: 0.0003404761198594449
PAY_0: 0.022401881075170705
PAY_2: 0.0
PAY_3: 0.0
PAY_4: 0.0
PAY_5: 0.0
PAY_6: 0.0
BILL_AMT1: -7.330477628529095e-07
BILL_AMT2: 4.883057226018422e-07
BILL_AMT3: 8.583291115355316e-08
BILL_AMT4: -3.255685105450045e-08
BILL_AMT5: -9.743645601268303e-10
BILL_AMT6: 5.131143588389602e-07
PAY_AMT1: -1.2832670245627844e-06
PAY_AMT2: -3.7693352700557333e-07
PAY_AMT3: -1.81737142041726e-07
PAY_AMT4: -5.309567091377829e-07
PAY_AMT5: -8.237826765092356e-07


In [15]:
#L2
model=Ridge(alpha = 100)
model.fit(train_split_x, train_split_y)

ypredict = model.predict(valid_x).round()
ypredict[ypredict>1] = 1
ypredict[ypredict<1] = 0
show_result(valid_y, ypredict)

model=Ridge(alpha = 0.1)
model.fit(train_x, train_y)
ypredict=model.predict(test_x).round()
ypredict[ypredict>1] = 1
ypredict[ypredict<1] = 0
generate_csv(ypredict, 'L2_Result.csv')

print("Coefficients:")
for feature, coef in zip(train_x.columns, model.coef_):
    print(f"{feature}: {coef}")

Accuracy:  0.80875
Precision:  0.7277227722772277
Recall:  0.14554455445544554
Confusion Matrix: 
[[3735   55]
 [ 863  147]]
Coefficients:
LIMIT_BAL: -7.429199713771168e-08
SEX: -0.01425680187717235
EDUCATION: -0.014601901052716571
MARRIAGE: -0.024057311147190133
AGE: 0.0013108988209303875
PAY_0: 0.09524280305010707
PAY_2: 0.018284892840090366
PAY_3: 0.014216678008888185
PAY_4: 0.0009659921663431533
PAY_5: 0.008554273190546993
PAY_6: 0.0007388271315333011
BILL_AMT1: -7.652112485557357e-07
BILL_AMT2: 2.752711639393941e-07
BILL_AMT3: 7.374113850745717e-08
BILL_AMT4: -1.1944062769853967e-07
BILL_AMT5: -1.672200671223425e-07
BILL_AMT6: 2.736489642666652e-07
PAY_AMT1: -8.473424323207162e-07
PAY_AMT2: -2.391331199043437e-07
PAY_AMT3: 3.884958050195843e-08
PAY_AMT4: -2.005214981109745e-07
PAY_AMT5: -5.806097865513612e-07


### Decision Tree ###

In [10]:
#information gain
model=DecisionTreeClassifier(criterion = 'entropy', max_depth = 4, random_state=1)
model.fit(train_split_x, train_split_y)

ypredict = model.predict(valid_x)
show_result(valid_y, ypredict)

model=DecisionTreeClassifier(criterion = 'entropy', max_depth = 4, random_state=1)
model.fit(train_x, train_y)
ypredict=model.predict(test_x)
generate_csv(ypredict, 'DTIG_Result.csv')

Accuracy:  0.82875
Precision:  0.6721611721611722
Recall:  0.36336633663366336
Confusion Matrix: 
[[3611  179]
 [ 643  367]]


In [11]:
#gini impurity
model=DecisionTreeClassifier(criterion = 'gini', max_depth = 4, random_state=1)
model.fit(train_split_x, train_split_y)

ypredict = model.predict(valid_x)
show_result(valid_y, ypredict)

model=DecisionTreeClassifier(criterion = 'gini', max_depth = 4, random_state=1)
model.fit(train_x, train_y)
ypredict=model.predict(test_x)
generate_csv(ypredict, 'DTgini_Result.csv')

Accuracy:  0.8277083333333334
Precision:  0.6678899082568808
Recall:  0.3603960396039604
Confusion Matrix: 
[[3609  181]
 [ 646  364]]


### Support Vector Machine ###

In [None]:
# Training the SVM model
start = time.time()
model = SVC(kernel='linear', random_state=1)
model.fit(train_split_x, train_split_y)
print("Training time: ", time.time() - start)
print()

# Evaluating the SVM model on the validation set
ypredict = model.predict(valid_x)
show_result(valid_y, ypredict)

# Generating CSV file for the SVM model's predictions on the test set
model = SVC(kernel='linear', random_state=1)
model.fit(train_x, train_y)
ypredict = model.predict(test_x)
generate_csv(ypredict, 'SVM_Result.csv')

In [13]:
# Training the SVM model
start = time.time()
model = SVC(kernel='rbf', class_weight = 'balanced', random_state=1)
model.fit(train_split_x, train_split_y)
print("Gamma: ", model._gamma)
print("Training time: ", time.time() - start)
print()

# Evaluating the SVM model on the validation set
ypredict = model.predict(valid_x)
show_result(valid_y, ypredict)

# Generating CSV file for the SVM model's predictions on the test set
model = SVC(kernel='rbf', class_weight = 'balanced', random_state=1)
model.fit(train_x, train_y)
ypredict = model.predict(test_x)
generate_csv(ypredict, 'SVM_rbf_Result.csv')

Gamma:  1.3202465493710435e-11
Training time:  35.70031189918518

Accuracy:  0.5747916666666667
Precision:  0.2872472141972761
Recall:  0.689108910891089
Confusion Matrix: 
[[2063 1727]
 [ 314  696]]


### K-Nearest Neighbor ###

In [4]:
neighbors = [5,10,20,50,100,200]
best_acc = 0
for i in neighbors:
    model=KNeighborsClassifier(n_neighbors = i)
    model.fit(train_split_x, train_split_y)
    print("N =", i)

    ypredict = model.predict(valid_x)
    acc = show_result(valid_y, ypredict)
    print()
    if acc > best_acc:
        best_acc = acc
        best_n = i    

model=KNeighborsClassifier(n_neighbors = 20)
model.fit(train_x, train_y)
ypredict=model.predict(test_x)
generate_csv(ypredict, 'KNN_Result.csv')

N = 5
Accuracy:  0.7666666666666667
Precision:  0.38866396761133604
Recall:  0.1900990099009901
Confusion Matrix: 
[[3488  302]
 [ 818  192]]

N = 10
Accuracy:  0.7875
Precision:  0.4772727272727273
Recall:  0.10396039603960396
Confusion Matrix: 
[[3675  115]
 [ 905  105]]

N = 20
Accuracy:  0.7885416666666667
Precision:  0.48366013071895425
Recall:  0.07326732673267326
Confusion Matrix: 
[[3711   79]
 [ 936   74]]

N = 50
Accuracy:  0.7902083333333333
Precision:  0.5151515151515151
Recall:  0.0504950495049505
Confusion Matrix: 
[[3742   48]
 [ 959   51]]

N = 100
Accuracy:  0.7910416666666666
Precision:  0.5573770491803278
Recall:  0.033663366336633666
Confusion Matrix: 
[[3763   27]
 [ 976   34]]

N = 200
Accuracy:  0.7902083333333333
Precision:  0.6153846153846154
Recall:  0.007920792079207921
Confusion Matrix: 
[[3785    5]
 [1002    8]]



# Step 3. Conclusion #

Conduct a comparison among the four algorithms, considering factors such as performance, efficiency, and any additional insights you would like to share regarding this assignment.


## Model Comparison

### Linear Regression
* 由於資料集中有imbalnce的情況，使用L1時模型全部都猜Label 0，無法有效進行預測。改為L2時，有稍微改善此狀況，但還是幾乎都預測為0，因此表現仍然不佳。

### Decision Tree
* 使用Gini 或 Information Gain對模型表現並無明顯影響。
* 將模型max depth設為4，會比未做限制時有更好的表現，推測是限制深度減少了overfitting。

### SVM
* Linear Kernel: Gamma = None, C = 1, 訓練時間非常長，可能需要PCA等方法降維來提升速度，模型表現也不佳。
* RBF Kernel: Gamma = 'scale', C = 1, 訓練時間約30 sec, 不用balanced_weight的話，會因為資料分布不均而導致模型都猜同一個label；使用balanced_weight後，模型對於test set的表現會有明顯提升。

### K-Nearest Neighbor
* 設置越大的n值，模型就越傾向於預測label 0。
* 雖然訓練時，n = 100有最好的Accuracy，但那是由於大量猜0而造成的，最終選擇n = 50會得到更好的表現。

## Result Comparison

### Accuracy
* Decision Tree(64.73%) > SVM(61.14%) > Linear Regression(56.73%) > KNN (54.39%)

### Run-time
* SVM >> others

根據實驗結果，在此task中Decision Tree會是一個比較合適的選項，在擁有較短的訓練時間的同時，能得到最好的表現。
