#### Import Packages

In [1]:
library(ISLR)
library(regtools)
library(qeML)

"package 'ISLR' was built under R version 4.5.1"
"package 'regtools' was built under R version 4.5.1"
Loading required package: FNN

"package 'FNN' was built under R version 4.5.1"
Loading required package: gtools

"package 'gtools' was built under R version 4.5.1"





*********************



Latest version of regtools at GitHub.com/matloff


Type ?regtools to see function list by category





"package 'qeML' was built under R version 4.5.1"
Loading required package: rmarkdown

"package 'rmarkdown' was built under R version 4.5.1"
Loading required package: tufte

"package 'tufte' was built under R version 4.5.1"





*********************



  Navigating qeML:

      Type vignette("Quick_Start") for a quick overview!

      Type vignette("Function_List") for a categorized function list

      Type vignette("ML_Overview") for an introduction to machine learning



#### Import data

In [2]:
data("Credit", package = "ISLR")

In [3]:
# inpsect data
head(Credit)

Unnamed: 0_level_0,ID,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
Unnamed: 0_level_1,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<fct>,<fct>,<int>
1,1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
2,2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
3,3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
4,4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
5,5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331
6,6,80.18,8047,569,4,77,10,Male,No,No,Caucasian,1151


In [4]:
Credit$Utilization <- round((Credit$Balance/Credit$Limit)*100, 2)

#### KNN Methods

##### Method 1: Manual

In [5]:
### lets estimate their utilization rate based on income of 100
inc <- Credit$Inc # isolate income feature
inc_pred <- 100

dists <- abs(inc-inc_pred) # distances to 100

dist_10 <- order(dists)[1:10] # Identify closest 10 data points

print(dists[dist_10])

 [1] 1.485 1.788 3.893 4.483 4.593 5.807 6.025 6.961 7.614 7.841


In [6]:
closest_10 <- Credit$Utilization[dist_10]
print(closest_10)

 [1] 14.04 10.57  9.02  8.17  8.20 11.17 13.59 10.12 10.97 15.38


In [7]:
util_pred <- mean(closest_10)
print(util_pred)

[1] 11.123


Not bad for a manual run of our data. We can see that for an income of 100, an average utilization rate is roughly 11%. However, if we review the 10 closest, we can see there is signficant range between the values - from 8.17% to 15.38%. And we can see five values cluster between 8 and 10%. 

This raises the question, are we sure we have actually chosen the best k for our estimate? Lets use a loss function, MAPE, to ensure we have identified the best set of k for determining our output.

In [8]:
mape_function <- function(actual, predict) {
    mape <- mean(abs((actual - predict) / actual))*100
}

In [9]:
mape <- mape_function(closest_10, util_pred)
print(mape)

[1] 17.86722


In [10]:
k_range = c(3, 4, 5, 6, 7, 8, 9, 10)

for (k in k_range) {
    # rerun dist calcs
    dists <- abs(inc-inc_pred) # distances to 100
    dist_k <- order(dists)[1:k] # Identify closest 10 data points

    # identify closest k utilization rates
    closest_k <- Credit$Utilization[dist_k]
    mean_k_pred <- mean(closest_k)
    

    mape_w_k <- round(mape_function(closest_k, mean_k_pred), 2)
    print(paste0("k at ",k," | MAPE: ", mape_w_k," | Pred Utilization: ", mean_k_pred,"%" ))
}

[1] "k at 3 | MAPE: 16.83 | Pred Utilization: 11.21%"
[1] "k at 4 | MAPE: 17.62 | Pred Utilization: 10.45%"
[1] "k at 5 | MAPE: 17.88 | Pred Utilization: 10%"
[1] "k at 6 | MAPE: 16.97 | Pred Utilization: 10.195%"
[1] "k at 7 | MAPE: 18.59 | Pred Utilization: 10.68%"
[1] "k at 8 | MAPE: 16.68 | Pred Utilization: 10.61%"
[1] "k at 9 | MAPE: 15.29 | Pred Utilization: 10.65%"
[1] "k at 10 | MAPE: 17.87 | Pred Utilization: 11.123%"


##### Method 2: Leverage regtools and qeML

In [11]:
#First, isolate columns we are interest in
Credit_income = Credit[c("Income","Utilization")]

# Leverage qe KNN package for ease of use
knnout <- qeKNN(Credit_income, "Utilization", k=5)


ERROR: Error in 1:nrow(x): argument of length 0


In [None]:
dim(Credit_income)