# CIS 678 Challenge #2: Prediction

## Motivation
The purpose of this challenge is to predict multi-modal data by using information about relationships between features across samples in each modality [citation needed].

The two modalities are ADT and RNA, each with 25 and 600+ features, respectively.
Given the training ADT and RNA assay, as well as the testing RNA assay, predict the testing ADT.

## Notebook Setup
Before attempting to solve this parallel multivariate problem, a few python libraries are required.
The `numpy` library allows for multi-dimensional data storage and manipulation.
The `numba` library speeds up `np.array` operations.


In [None]:
import numpy as np
from numba import jit, cuda

from notebook_utils import *

# message level set to debugging mode
configure_logging(logging.DEBUG)

## Reading Inputs
First, we have to read in our `.csv` files using a custom function imported from our `notebook_utils` module.
We also create our predicted matrix, `test_adt`, which will be used to write out our final Kaggle submission.

In [None]:
# read in our training and testing data sets
train_rna = read_input('../train/training_set_rna.csv')
train_adt = read_input('../train/training_set_adt.csv')
test_rna  = read_input('../test/test_set_rna.csv')

# creating our predicted testing ADT set
test_adt  = create_matrix(train_adt['height'], test_rna['width'])

x = linear_solve(train_rna, train_adt)

#> selecting our model hyper-parameters
models    = ['linear_regression']
parameter = [1, 2, 3]

test_cases = []

for test_case in track(enumerate(test_cases):
    test_case = run()

#> model.fit(train_rna, train_adt)

# predict
test_adt = predict_using(test_rna)

# write our output
write_output('../out/debug/kaggle_challenge_2.csv', test_adt['data'])


# Processing Data
This section details the procedures used to pre-process our training and testing data sets.

### Solver Method: Linear Algebra (Check this method)
This problem is a parallel multivariate system of equations.
<!-- Let $A_{0}$ be our `train_rna` data set, $y_{0}$ be our `train_adt` data set, $A_{1}$ be our `test_rna` data set, and $y_{1}$ be our `test_adt`.
We can use a multivariate system of equations to solve for $y_{1}$ such that,

$$A_{0}x = y_{0}$$

and

$$A_{1}x = y_{1}$$

We can solve for $x$ by using the relationship,

$$ax=b$$

such that,

$$a := A_{0} A_{0}^{T}$$

and

$$b := A_{0}^{T} y_{0}$$

Now, $x$ can be solved:

$$x = \frac{b}{a}$$

The final step is to return to our second training set equation and plug in our newly found $x$:

$$y_{1} = A_{1}\frac{b}{a}$$

This expands to the following using only data set variables:

$$y_{1} = A_{1}\frac{A_{0}^{T} y_{0}}{A_{0}  A_{0}^{T}}$$

Which solves our missing `test_adt` data set. -->

In [None]:
# x = linear_solve(train_rna, train_adt)


### Solver Method: Gradient Descent
Testing our gradient descent function

<!-- #### Algorithm
$$x \leftarrow 0$$
$$\forall i \in [0, b'length):$$
$$\beta \leftarrow$$ -->


In [None]:
# todo: finish this task
# todo: figure out how multivariate systems work for this challenge
# todo: cross-validation (higher score is better)
# todo: implement test-case framework

# Writing Outputs
Lastly, we have to write our `.csv` Kaggle submission

In [None]:
# write_output('../out/debug/kaggle_challenge_2.csv', test_adt['data'])