<a href="https://colab.research.google.com/github/jiao-xx/stats/blob/main/1_SVM%26KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The dataset provided contains information about credit card applications. Each row in the dataset represents an individual credit card application, and the columns represent various features or attributes of those applications (like income, age, etc.). The last column indicates whether the application was approved (positive) or not (negative).



In [1]:
# Third-Party Libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Google Colab-specific
from google.colab import drive


drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd drive/MyDrive/math

/content/drive/MyDrive/math


In [9]:
dataset = pd.read_csv('credit_card_data-headers.txt', sep = '\t')

dataset.head()

Unnamed: 0,A1,A2,A3,A8,A9,A10,A11,A12,A14,A15,R1
0,1,30.83,0.0,1.25,1,0,1,1,202,0,1
1,0,58.67,4.46,3.04,1,0,6,1,43,560,1
2,0,24.5,0.5,1.5,1,1,0,1,280,824,1
3,1,27.83,1.54,3.75,1,0,5,0,100,3,1
4,1,20.17,5.625,1.71,1,1,0,1,120,0,1


# Task 1

Support Vector Machine (SVM):


Purpose: Use an SVM to determine whether a credit card application will be approved or not based on the given features. In other words, trying to create a decision boundary or rule that can separate the approved applications from the rejected ones.


Implication: Using SVM on this dataset means you're trying to find the best "line" (or in more complex scenarios, a "plane" or "boundary") that separates the data into two categories: approved and not approved. Imagine plotting all the applications on a graph; SVM will try to draw a line so that all the approvals are on one side and the rejections are on the other.

1. First, we'll handle the Support Vector Machine (SVM) task using the SVC class from the scikit-learn library.

2. We'll follow the guidelines mentioned in the question, using a linear kernel and experimenting with different values of the hyperparameter C.

3. We'll also scale the features as suggested in the question.

In [10]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Split the dataset into features (X) and labels (y)
# X 是特征数据，取 dataset 的所有行和除了最后一列的所有列。
# y 是标签数据，取 dataset 的所有行的最后一列。
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create a SVM model with a linear kernel
# Experiment with different values of C (e.g., 100) to find a good classifier
C_value = 100
svm_model = SVC(kernel='linear', C=C_value)
svm_model.fit(X_scaled, y)

# Get the predictions
svm_predictions = svm_model.predict(X_scaled)

# Calculate the accuracy of the model
svm_accuracy = accuracy_score(y, svm_predictions)
svm_accuracy


0.863914373088685

The Support Vector Machine (SVM) model with a linear kernel and
C = 100 achieved an accuracy of approximately 86.39% on the full dataset.

C = 100: 这是 SVM 的一个重要参数，称为正则化参数。它控制模型的复杂度和容错能力：

C 值较小: 模型会更简单，容忍一些误分类，但可能不会过度拟合数据。

C 值较大: 模型会尽量减少训练数据的误分类，但可能会过度拟合，导致在新数据上的表现不佳。
在这里，C=100 可以被认为是一个相对较大的值，这意味着模型会尽量减少训练数据的误分类。

Given the equation:
{decision\_function} = a_1 * x_1 + a_2 * x_2 + ... + a_m * x_m + a_0 \]


Where:
- ( a_i ) (for ( i = 1, 2, ..., m )) are the coefficients.
- ( a_0 ) is the intercept.

In [14]:
# Extract the coefficients (a1...am) and intercept (a0) of the model
coefficients = svm_model.coef_[0]
intercept = svm_model.intercept_[0]

# Display the coefficients and intercept
coefficients, intercept


(array([-8.54344736e-04, -1.54748550e-03, -1.37028989e-03,  2.67436548e-03,
         1.00473936e+00, -2.37745609e-03,  8.49701231e-05, -5.32250557e-04,
        -1.53192990e-03,  1.06293893e-01]),
 0.08135935298850694)

The equation of the SVM classifier can be represented as:
\begin{align*}
\text{{decision\_function}} & = (-8.54 \times 10^{-4}) \cdot x_1 + (-1.55 \times 10^{-3}) \cdot x_2 + (-1.37 \times 10^{-3}) \cdot x_3 + (2.67 \times 10^{-3}) \cdot x_4 \\
& \quad + (1.00) \cdot x_5 + (-2.38 \times 10^{-3}) \cdot x_6 + (8.50 \times 10^{-5}) \cdot x_7 + (-5.32 \times 10^{-4}) \cdot x_8 \\
& \quad + (-1.53 \times 10^{-3}) \cdot x_9 + (0.11) \cdot x_{10} + 0.0814
\end{align*}
\]

# Task 2

k-Nearest Neighbors (k-NN):

Purpose: This is another method to predict whether a credit card application will be approved or not. It works by looking at the 'k' most similar applications and then making a decision based on their outcomes. For example, if
k = 3, and for a particular application, 2 out of the 3 nearest applications were approved and 1 was rejected, the method would predict an approval for this application.


Implication: Using k-NN means you're trying to predict the outcome of an application by looking at the outcomes of the most similar applications in the dataset. It's like asking a group of your closest friends for advice and then going with the majority opinion.


1. We will use the KNeighborsClassifier class from the scikit-learn library.

2. As mentioned in the question, to estimate the performance, we will perform a leave-one-out cross-validation, where we predict the label for each data point using all other points as the training set.

3. For each value of k, we will calculate the accuracy to determine the best value for k.

However, this is time consuming.

In [12]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import LeaveOneOut

# Initialize the LeaveOneOut cross-validator
loo = LeaveOneOut()

# Define a range of k values to test
k_values = list(range(1, 51))  # Testing k values from 1 to 50
accuracies = []

# For each value of k, perform leave-one-out cross-validation
for k in k_values:
    knn_model = KNeighborsClassifier(n_neighbors=k)
    correct_predictions = 0
    for train_index, test_index in loo.split(X_scaled):
        X_train, X_test = X_scaled[train_index], X_scaled[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        knn_model.fit(X_train, y_train)
        y_pred = knn_model.predict(X_test)
        if y_pred == y_test.values:
            correct_predictions += 1
    accuracy = correct_predictions / len(y)
    accuracies.append(accuracy)

# Get the best value of k and its corresponding accuracy
best_k = k_values[accuracies.index(max(accuracies))]
best_accuracy = max(accuracies)

best_k, best_accuracy


(22, 0.8532110091743119)

1. We'll split the dataset into training and test sets, maintaining a sufficient amount of data for training.

2. We'll use the training set to train the k-NN model and the test set for validation.

3. We'll still iterate over a range of k values to find the best one.


In [13]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

accuracies_split = []

# For each value of k, train on the training set and validate on the test set
for k in k_values:
    knn_model = KNeighborsClassifier(n_neighbors=k)
    knn_model.fit(X_train, y_train)
    y_pred = knn_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies_split.append(accuracy)

# Get the best value of k and its corresponding accuracy from the split dataset approach
best_k_split = k_values[accuracies_split.index(max(accuracies_split))]
best_accuracy_split = max(accuracies_split)

best_k_split, best_accuracy_split


(32, 0.8625954198473282)

1. **Leave-One-Out Cross-Validation (LOOCV)**:
    - **Pros**: Uses the entire dataset for both training and validation, making the most out of the available data. It's especially useful when the dataset is small.
    - **Cons**: Computationally expensive, especially for algorithms like k-NN. The validation error rate can have high variance since only one observation is left out for validation.
  
2. **Train-Test Split**:
    - **Pros**: Faster and computationally less expensive. Can give a more realistic estimate of the model's performance on unseen data.
    - **Cons**: The validation performance can be sensitive to the split of the data. This can be mitigated using techniques like k-fold cross-validation.

Given the results you obtained using LOOCV (\( k = 22 \) with 85% accuracy) and the results from the train-test split approach (\( k = 32 \) with 86.26% accuracy), here's how you can decide:

- **If computational efficiency is a concern**: Consider the train-test split result. It's faster and still gives a reasonably good estimate of model performance.
  
- **If maximizing the use of available data is a priority**: Use the LOOCV result. It uses the entire dataset for evaluation, which might be especially useful if the dataset is not large.

- **Consider the application**: If false positives/negatives have significant consequences in the application of your model, it might be worth investing the computational time in LOOCV or even k-fold cross-validation to get a more reliable estimate of performance.

Both results are reasonably close in terms of accuracy. Might consider validating both models on that new data to see which one performs better in practice.