# Classifier Experiment
This notebook serves to explore the provided dataset, as well as introduce options for neural network architecture and the training process.

I have tried to structure individual components in a way where it's easy for you to reuse the functions in your own modules, should you wish.

---
## 1. Setup
This notebook comes with a `requirements.txt` file containing dependencies for you to install.
It was run on Python 3.10.6; dependencies will be forward compatible but some of the typing features used in the code provided here require **at least** Python 3.10, so the code is not backward compatible.

To set up an environment to run these experiments (assuming your operating system is Windows), run through the following steps:

1. Create a Python environment.
```bash
python -m venv venv
```

2. Activate the environment and install dependencies.
```bash
venv\Scripts\activate
pip install -r requirements.txt
```
A side note: Windows may ~~bitch and moan about~~ deny access to the provided script without you modifying privileges. If so, follow the instructions provided in the terminal. Thanks, Windows.

3. Have fun with the Jupyter notebook. If you have questions about Jupyter, let me know.

> A quick side note: this setup will download the GPU-enabled binaries for PyTorch, which requires you to have CUDA set up (and a CUDA-enabled GPU in your rig, which I think you do, you have an RTX3060 right?).
> You can either set up CUDA on your machine or not bother; the code will pick the best device available, so it will also run on CPU. It will be significantly slower but since we're working with fairly simple data, I don't anticipate that that will be an issue.

---
## 2. Exploring the Data
In this section, we will load all the data into memory and explore it a bit.
I will also create an artifact in persistent storage that you can use moving forward so you don't have to apply the same preprocessing steps every time.

In [2]:
import logging
from uuid import UUID
from os import walk, path
from tqdm import tqdm
import pandas as pd

# Note that I have configured the logger not to print debug statements while actually using some.
# If you want to debug further, set to `logging.DEBUG`.
logging.basicConfig(level=logging.INFO)

LOGGER = logging.getLogger("experiment_notebook")
ARTIFACT_NAME = "complete_data.csv"


def load_data(
    root: str,
    dir_positives: str,
    dir_negatives: str,
    cache: bool = True,
) -> pd.DataFrame:
    """
    Load data from the provided directories and stores the compiled dataframe in your file system.
    If such a cache is present, the function will load from it instead.
    """
    try:
        data = pd.read_csv(path.join(root, ARTIFACT_NAME))
        LOGGER.info("Loaded dataset from cache.")
        return data
    except FileNotFoundError:
        LOGGER.info(
            "Could not find dataset in cache, will attempt to build from scratch."
        )
    positives = read_raw_data_from_directory(path.join(root, dir_positives))
    positives["cancer"] = 1
    negatives = read_raw_data_from_directory(path.join(root, dir_negatives))
    negatives["cancer"] = 0
    data = pd.concat([positives, negatives])
    LOGGER.info("Finished building dataset from scratch.")
    if cache:
        dataset_filepath = path.join(root, ARTIFACT_NAME)
        if not path.exists(dataset_filepath):
            LOGGER.info("Persisting dataset in '%s'.", dataset_filepath)
            data.to_csv(dataset_filepath, index=None)
    return data


def read_raw_data_from_directory(dirname: str) -> pd.DataFrame:
    """
    Convenience function for traversing all subdirectories from the provided starting directory,
    loading all relevant files into memory, and into a usable format.

    @Josh: You could conceivably use the filepath as the index (or just add it as a column) so you
    can later map individual rows in your dataset onto the source datasets. I just didn't bother
    because I didn't really see a value here.

    Parameters
    ----------
    dirname : str
        The name of the directory (absolute or relative) containing data.

    Returns
    -------
    A single DataFrame containing all data from the directory.
    """
    LOGGER.info("Walking data directory %s and reading in files...", dirname)
    rows = []
    for root, _, files in tqdm(
        walk("data/miRNA Files - Lung Cancer", topdown=False)
    ):
        for filename in files:
            # Ugly workaround for identifying UUID because I couldn't be FUCKED to write a regex rn.
            try:
                _ = UUID(filename.split(".")[0])
            except ValueError:
                LOGGER.debug(
                    "Skipped file because it did not start with a UUID-like string: %s",
                    filename,
                )
                continue
            rows.append(read_txt_data_file(root, filename))
    data = pd.DataFrame(rows).reset_index()
    # Drop the index column that no longer serves as the index because pandas be funky.
    del data["index"]
    return data


def read_txt_data_file(directory: str, filename: str) -> pd.Series:
    """
    Read a miRNA file into memory.

    Parameters
    ----------
    directory : str
        The directory that contains the file. Could be nested, so this string could be multiple and
        it always contains the root directory.
    filename : str
        The name of the file to be loaded, including file ending.

    Returns
    -------
    A series of floats where each index corresponds to a MiRNA ID.
    """
    filepath = path.join(directory, filename)
    LOGGER.debug("Loading data from file %s", filepath)
    # We assume consistently tabular style in all txt files. This is technically a bad assumption to
    # make but I'm not writing code for a product.
    data = pd.read_csv(filepath, sep="\t", index_col="miRNA_ID")
    # We can drop all irrelevant data to return a simple feature vector.
    return data["reads_per_million_miRNA_mapped"]


Please note that we will use lowercase variable names in the following but we are creating variables in the global namespace.

**PEP8 dictates that global variable names are uppercase in Python.** We are not doing that here because Jupyter notebooks are often treated a little differently.

I encourage you to be mindful of this fact regardless.

In [6]:
data = load_data(
    root="Databases",
    dir_positives="miRNA Files - Lung Cancer",
    dir_negatives="miRNA Files - Normal",
)

# Let us look at what the data looks like.
data.head()
data.tail()

INFO:experiment_notebook:Loaded dataset from cache.


Unnamed: 0,cancer


Let's now go a little deeper into the data at hand.

In [7]:
print(f"Dataset contains a total of {len(data)} samples.")
label_occurrences = data["cancer"].value_counts()

print(
    "If you use 20% of your data for validation, this will leave you with "
    f"{round(len(data)*0.8)} training samples."
)

num_negatives = label_occurrences[0]
print(
    f"{num_negatives} of the samples ({round(num_negatives/len(data)*100, 2)}%) are negative."
)

num_positives = label_occurrences[1]
print(
    f"{num_positives} of the samples ({round(num_positives/len(data)*100, 2)}%) are positive."
)

Dataset contains a total of 0 samples.
If you use 20% of your data for validation, this will leave you with 0 training samples.


  num_negatives = label_occurrences[0]


IndexError: index 0 is out of bounds for axis 0 with size 0

Okay, we have a perfectly balanced dataset. This is neat! That means we don't need to bother with fancy sampling techniques, class weights or sample weights.

Unfortunately, 3576 samples is... not a lot.

Next, let's look at the features.

In [None]:
feature_cols = [col for col in data if col.startswith("hsa")]

print(
    f"Our input feature vector will likely be {len(feature_cols)}-dimensional."
)

Uh-oh. That's... a lot of features. Especially given the size of our dataset. Let's look if we have columns that are "dead", meaning they have no variance. These columns carry no information for the classifier to learn from.

In [None]:
dead_cols = []
for col in feature_cols:
    std = data[col].std()
    if std == 0:
        dead_cols.append(col)
print(f"{len(dead_cols)} columns are dead!")

This is good because it reduces the feature space but it still leaves us with

In [None]:
len(feature_cols) - len(dead_cols)

feature columns. That's still a lot!

## 2.1 Next Steps
What are our options? Well, you have a few. You could:


### 2.1.1 Just Throw It All In And See What Happens
Honestly, why the fuck not. If it performs like shit, you can still follow a different approach.

### 2.1.2 Do Feature Selection
Considering the ratio of dimensions to samples, your classifier is unlikely to properly cut through the noise (assuming that some miRNA IDs carry no indicative value for cancer). You could identify the correlation coefficient between each individual feature and the label, and set something like a "dead zone" where you kick out values that do not correlate at all with your target label.

Be mindful, though, as it may be that there _is_ an underlying correlation between a certain combination of miRNA ID values. You would obviously be missing that in your preliminary analysis and potentially remove meaningful data.

### 2.1.3 Do Dimensionality Reduction
Technically, neural networks already kinda sorta do this. But you _could_ do some of it yourself as a preprocessing step and then use the resulting data in your classification problem.

### 2.1.4 Do Fancy Architectures
A [cursory glance at relevant literature in the biomedical space](https://www.nature.com/articles/s42256-023-00744-z) suggests a bunch of options you could follow already. If there is an actual spatial component to the individual miRNA IDs (like, some sort of relative distance to one another) you could, for example, consider a convolutional approach.
There is some intersection with `2.1.3` but whateverrrrrr.

### Final Words
I am very tired. Please excuse any inaccuracies that might be in this bitch. I hope you've found it instructive and I would suggest you just try option `2.1.1` at first and work forward from there.