#### Import Packages

In [19]:
library(ISLR)
library(regtools)
library(qeML)

options(warn = -1) # Disable warnings globally

#### Import data

In [20]:
data("Credit", package = "ISLR")

In [21]:
# inpsect data
head(Credit)

Unnamed: 0_level_0,ID,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
Unnamed: 0_level_1,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<fct>,<fct>,<int>
1,1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
2,2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
3,3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
4,4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
5,5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331
6,6,80.18,8047,569,4,77,10,Male,No,No,Caucasian,1151


In [22]:
Credit$Utilization <- round((Credit$Balance/Credit$Limit)*100, 2)

#### KNN Methods

##### Method 1: Manual

In [23]:
### lets estimate their utilization rate based on income of 100
inc <- Credit$Inc # isolate income feature
inc_pred <- 100

dists <- abs(inc-inc_pred) # distances to 100

dist_10 <- order(dists)[1:10] # Identify closest 10 data points

print(dists[dist_10])

 [1] 1.485 1.788 3.893 4.483 4.593 5.807 6.025 6.961 7.614 7.841


In [24]:
closest_10 <- Credit$Utilization[dist_10]
print(closest_10)

 [1] 14.04 10.57  9.02  8.17  8.20 11.17 13.59 10.12 10.97 15.38


In [25]:
util_pred <- mean(closest_10)
print(util_pred)

[1] 11.123


Not bad for a manual run of our data. We can see that for an income of 100, an average utilization rate is roughly 11%. However, if we review the 10 closest, we can see there is signficant range between the values - from 8.17% to 15.38%. And we can see five values cluster between 8 and 10%. 

This raises the question, are we sure we have actually chosen the best k for our estimate? Lets use a loss function, MAPE, to ensure we have identified the best set of k for determining our output.

In [40]:
mape_function <- function(actual, predict) {
    mape <- mean(abs((actual - predict) / actual)) * 100
}

In [41]:
mape <- mape_function(closest_10, util_pred)
print(mape)

[1] 17.86722


In [28]:
k_range = c(3, 4, 5, 6, 7, 8, 9, 10)

for (k in k_range) {
    # rerun dist calcs
    dists <- abs(inc-inc_pred) # distances to 100
    dist_k <- order(dists)[1:k] # Identify closest 10 data points

    # identify closest k utilization rates
    closest_k <- Credit$Utilization[dist_k]
    mean_k_pred <- mean(closest_k)
    
    mape_w_k <- round(mape_function(closest_k, mean_k_pred), 2)
    print(paste0("k at ",k," | MAPE: ", mape_w_k," | Pred Utilization: ", mean_k_pred,"%" ))
}

[1] "k at 3 | MAPE: 16.83 | Pred Utilization: 11.21%"
[1] "k at 4 | MAPE: 17.62 | Pred Utilization: 10.45%"
[1] "k at 5 | MAPE: 17.88 | Pred Utilization: 10%"
[1] "k at 6 | MAPE: 16.97 | Pred Utilization: 10.195%"
[1] "k at 7 | MAPE: 18.59 | Pred Utilization: 10.68%"
[1] "k at 8 | MAPE: 16.68 | Pred Utilization: 10.61%"
[1] "k at 9 | MAPE: 15.29 | Pred Utilization: 10.65%"
[1] "k at 10 | MAPE: 17.87 | Pred Utilization: 11.123%"


As we optimize for the best k, we can use our MAPE measurement as a guide. For example, if we loop through a range of k between 3 and 10, we see that the lowest MAPE is 15.29 at k=9. As a result, for this manually constructed KNN, we see that predicted utilization rate for someone with an income of 100 is 10.65%. 

##### Method 2: Leverage regtools and qeML

1) Basic Implementation - Single Feature with No Holdout

In [29]:
set.seed(999) #establish a seed for our model

In [30]:
#First, isolate columns we are interest in
Credit_income <- Credit[c("Income","Utilization")]
Credit_income <- as.data.frame(Credit_income)


In [31]:
# Leverage qe KNN package for ease of use
knnout <- qeKNN(Credit_income, "Utilization", k=5, holdout=NULL)

In [32]:
inc <- data.frame(Income=100)
inc_pred <- predict(knnout, inc)

print(paste0("At k=5: ", round(inc_pred, 2),"%"))

[1] "At k=5: 10.98%"


Now that we have a pre-packaged algorithm, and we use k = 5, we can see that the estiamted utilization rate is 10.98%. Given we opted for no holdout, we don't have a MAPE to calculate on a test set.

2) Add a categorical feature with Holdout

In [33]:
Credit_income_ed <- Credit[c("Income", "Education", "Utilization")]
Credit_income_ed <- as.data.frame(Credit_income_ed)

In [34]:
set.seed(999) #establish a seed for our model
knnout <- qeKNN(Credit_income_ed, "Utilization", k=5)
inc_ed <- data.frame(Income=100, Education=14)
inc_ed_pred <- predict(knnout, inc_ed)
mape <- knnout$testAcc

print(paste0("At k=5: ", round(inc_ed_pred, 2),"% | MAPE: ", round(mape, 2)))

[1] "At k=5: 10.65% | MAPE: 6.08"


After adding the variable, Education, we see that the estimated utilization rate stays the same at 10.65% and the MAPE is 5.54. However, just we did above, we need to optimize for k.

3) Optimizing k

In [35]:
set.seed(999) #establish a seed for our model
k_range <- c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)

for (k in k_range) {
    set.seed(999) #establish a seed for our model
    knnout <- qeKNN(Credit_income_ed, "Utilization", k=k)
    inc_ed_pred <- predict(knnout, inc_ed)
    mape <- knnout$testAcc
    print(paste0("At k=",k,": ", round(inc_ed_pred, 2),"% | MAPE: ", round(mape, 2)))
}


[1] "At k=5: 10.65% | MAPE: 6.08"
[1] "At k=6: 10.92% | MAPE: 5.73"
[1] "At k=7: 11.4% | MAPE: 5.54"
[1] "At k=8: 11.37% | MAPE: 5.63"
[1] "At k=9: 11.75% | MAPE: 5.57"
[1] "At k=10: 11.58% | MAPE: 5.5"
[1] "At k=11: 11.63% | MAPE: 5.47"
[1] "At k=12: 11.77% | MAPE: 5.47"
[1] "At k=13: 11.38% | MAPE: 5.53"
[1] "At k=14: 10.59% | MAPE: 5.66"
[1] "At k=15: 9.88% | MAPE: 5.73"


After executing this for loop, we see that k = 11 and k = 12 result in a MAPE of 5.47%. In this regard, lets go with k = 11 as our optimized choice of prediction parameters.