## Methods (Pt A - Requires Update)
1. Import the data set. 
2. Clean and wrangle data set to have a tidydata format
3. Visualize relationships between variables of interest: 

    A. Investment Activity vs Label (<= 50k, > 50k annual income) 
        a. Bar graph 
        b. X-Axis: Label 
        c. Y-Axis: Count of Capital Gains and Capital Losses 
    B. Capital Gains vs Age 
        a. Scatter plot 
        b. X-Axis: Age 
        c. Y-Axis: Capital Gains (USD) 
    C. Working Hours per Week vs Age 
        a. Scatter plot 
        b. X-Axis: Age 
        c. Y-Axis: Working Hours per Week 
4. Summarize the data set and address class imbalance if one label is more prevalent then the other. 

## Methods (Pt B)
1. Tune our classification model (k-nearest neighbours) using predictors of interest.

    A. Our dataset provided training and testing data, so we do not have to split our data set. 
    
    B. Pre-process our training data (standardize, center and upsample for class imbalance). 
    
    C. Create a 5 fold cross validation data split using vfold. 
    
    D. Determine specifications for the nearest neighbour function. 
    
        a. weight_func = "rectangular" 
        b. neighbors = tune() 
    E. Fit our model for each fold in our cross validation. 
    
        a. tune_grid(resamples=vold,grid=10) 
    F. Create a scatter plot of Accuracy vs k to determine the best k 
    
2. Retrain our classification model (k-nearest neighbours) using our tuned k value and predictors of interest. 
3. Predict labels on our testing data aset and evaluate the estimated accuracy of our classification model.  
4. Create a bar chart to visualize our results

## Tuning our Classification Model
### Selecting Relevant Data
From our analysis above, we have determined capital gain as the most suitable predictor for predicting our label. \
Our first step is to select the relevant columns we need from the tidy format of our data.

In [2]:
set.seed(1000)
## Selecting Relevant Data
adult_relevant <- adult_tidy %>%
    select(label, capital_gain)

head(adult_relevant)

ERROR: Error in adult_tidy %>% select(label, capital_gain): could not find function "%>%"


### Standardize, Center and Upsample
In this step we normalize our data to have values between [-1,1] and centered around 0. \
We also upsample to mitigate the class imbalance in our data set. 

In [3]:
## Standardize, Center and Upsample
adult_recipe <- recipe(label~capital_gain,data=adult_relevant) %>%
    step_scale(all_predictors()) %>%
    step_center(all_predictors()) %>%
    step_upsample(label, over_ratio = 1)

In [6]:
## Knn Model
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("classification")
knn_spec

### Cross Validation
For tuning, we will be using 5 fold cross validation. \
This will shuffle our data into 5 different sets and allow us to compute the average accuracy.\
Cross validation is used have a more reprsentative accuracy as it is not based off of one set of data.

In [8]:
## Cross Validation
adult_vfold <- vfold_cv(adult_relevant, v = 5, strata = label)

ERROR: Error in vfold_cv(adult_relevant, v = 5, strata = label): could not find function "vfold_cv"


In [9]:
## Results of Tuning 
knn_results <- workflow() %>%
    add_recipe(adult_recipe)%>%
    add_model(knn_spec)%>%
    tune_grid(resamples = adult_vfold, grid = 20)%>%
    collect_metrics()

### Accuracy vs K Nearest Neighbours
We plot accuracy vs K nearest neighbours to visualize the minimum K that will yield the highest accuracy.
From our plot we will choose K = 11.

In [11]:
## Accuracy vs K Plot
accuracies <- knn_results %>%
  filter(.metric == "accuracy")

accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate") +
  theme(text = element_text(size = 18))

accuracy_vs_k

## Retraining Our Classifier
In this step we will retrain our Knn model with our optimal K value.

In [12]:
## Knn Model
knn_spec_2 <- nearest_neighbor(weight_func="rectangular", neighbors=11) %>%
    set_engine("kknn") %>%
    set_mode("classification")

In [13]:
## Fit Model
knn_results_2 <- workflow() %>%
    add_recipe(adult_recipe)%>%
    add_model(knn_spec_2)%>%
    fit(data = adult_testing_tidy)

## Accuracy of Our Classifier
Now that we have our trained classifier, we will predict labels using our tidied test data set.

In [None]:
## Import Testing Data Set
adult_testing <- read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", delim=",",col_names=c("age", "workclass", "fnl_wgt","education",
    "education_num","marital_status","occupation","relationship","race","sex","capital_gain","capital_loss",
    "hrs_per_week","native_country","label"))

head(adult_data)

In [14]:
## Cleaning and Wrangling Testing Data
adult_testing_tidy <- adult_testing %>%
    mutate(label=as_factor(label), capital_gain = as.numeric(capital_gain)) %>%
    filter_all(all_vars(. != " ?")) %>%
    select(capital_gain,label)

head(adult_testing_tidy)

In [15]:
## Predict Labels
adult_test_predictions <- predict(knn_results_2, adult_testing_tidy) %>%
       bind_cols(adult_testing_tidy)

head(adult_test_predictions)

In [16]:
## Compute Accuracy
adult_prediction_accuracy <- adult_test_predictions %>%
         metrics(truth = label, estimate = .pred_class)             

adult_prediction_accuracy

In [None]:
## Visualizing Accuracy