# ML Challenge #1 : Recommendation
#### by JP Palacios
This notebook details the process of implementing an auto-complete feature for citizen science checklist submissions.

## Introduction
This notebook will cover several phases of implementing an auto-complete feature: configuration, input, pre-processing, test case configuration, recommender algorithm, cross-validation, and output phases.

## Configuration
Before we begin, we have some file and library dependencies to sort out.
This feature heavily relies on the `numpy` and `numba` libraries for data array operations and faster performance, respectively.
The following python code imports all the necessary

In [2]:
import numpy as np
from numba           import jit, cuda
from rich.progress   import track

# import custom scripts
from common          import *
from main            import *

# note: added this to suppress numba deprecation warnings
import sys
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

cannot be imported, run as script!


### Hyper-Parameter Tuning
This notebook provides the user with the ability to tune hyper-parameters to test the effects of parameter combination on the output score.
Our approach included three pre-processing methods, three distance metric functions, and a range of k values.

In [3]:
# hyper-parameter lists
all_preprocess_methods = ['normalizing', 'logarithms', 'clipping']
selected_preprocess_method = all_preprocess_methods[1]

metrics  = ['euclidean', 'cosine', 'jaccard']
k_values = [20, 40, 60, 80, 100]

## Reading In Training & Testing Data Sets
Now, we can read in our training and testing sets using our custom `read_input` function found in the `common.py` script.
This function returns a dictionary with an `numpy` array and its dimensions as individual keys.
As we will see later on, we will use `non_changed_test_set` for cross-validation.
We also create our binary-encoded K-Nearest Neighbor (KNN) graph.

In [4]:
print(f'reading in our train and test sets...')
train_set = read_input('../train/cv_train/train_cvset_2.csv')
test_set  = read_input('../test/rand_test/test_randset_2.csv')

print(f'reading in our cross-validated test set...')
non_changed_test_set = read_input('../test/cv_test/test_cvset_2.csv')

knn_graph = np.zeros((train_set['width'], test_set['width']), dtype = int)

reading in our train and test sets...
reading in our cross-validated test set...


## Pre-Processing
Once we have our inputs stored, we can begin applying our first hyper-parameter: pre-processing method.
This notebook will demonstrate performance using the `logarithms` method.

In [5]:
print(f'pre-processing our data to use {selected_preprocess_method} on our train and test sets...')
train_set['data'] = preprocess(train_set['data'], selected_preprocess_method)
test_set['data']  = preprocess(test_set['data'], selected_preprocess_method)

pre-processing our data to use logarithms on our train and test sets...


## Methodology
### Test Case Configuration

### KNN Algorithm

### Distance function

## Writing Out Recommendation Dataset
Python's makes it easy to write out our recommendation data set for Kaggle submission.


## Assessing Performance & Future Work
Cross-validation helped assess the performance of our model locally.
The following image shows the script-version's command line output when testing 3 cross-validated data sets.
![assessment](./images/cross_validation.png)
The test cases each ran with its own set of parameters to get a better idea of how well our KNN algorithm worked.
The final step is to run the program with the best set of hyper-parameters to submit a Kaggle entry.
![submission](./images/kaggle_ready.png)

## References
[1]“NumPy Reference — NumPy v1.23 Manual,” numpy.org. https://numpy.org/doc/stable/reference/index.html#reference

[2]“Running Python script on GPU.,” GeeksforGeeks, Aug. 21, 2019. https://www.geeksforgeeks.org/running-python-script-on-gpu/
