## Goal
To see if a generalized Dino model trained on 1-dimensional channels can perform as well as the specialized model trained on 4-dimensional images. The HPA FOV 4-channel images are the only dataset used. 

### Dataset
Images were downloaded using the `custom_scripts/HPA_IMG_download.py` script, and then were cleaned with scripts `img_corrupted_check.py` and `img_corrupted_check.py`

### Package Environment
I’ve made a Conda environment to freeze the libraries that I’m using: `conda_cuda12.yaml`


However, to do data visualization and analysis, a much larger set of libraries was needed in addition, for which I made an extended environment:
`conda_cuda12_notebook_analysis.yaml`


### Compute Environment
This was done on Google Cloud using their V100 GPU instances, and then the data was transferred to their cpu-only instances under free acount nickmarveaux@gmail.com. 

## Experimental Design
Create a 4d model and a 1d model of Dino, trained on the same ~75,000 images that have been resized to 512x512, and all other parameters being kept the same (trained for the same number of epochs, 50, and the same transformer Vit_Tiny). Then, generate feature embeddings of those ~75,000 images using each Dino model independently, and train a classifier on them independently, with again all parameteres being equal. One notable difference is that the input to the classifier trained on 4-D images is vectors of size 192, whereas the input to the second classifier is a stack of 4 1-Dimensional embeddings of that size, so is a vector of size 768. Finally, the 2 classifiers are evaluated on the same test data from Kaggle's HPA competition using the same process: embeddings are generated by the two dino models, stacked in the case of the 1-D Dino model, then passed to the classifiers. 

### Configs

All configs for this project were placed in directory exploratory_configs

The 4d Dino model was generated with yaml —  and is located at — on instance-11

The 1d Dino model was generated with yaml —  and is located at — on instance-11

The 1d classifier was generated with yaml and is located at on instance-11

The 4d classifier was generated with yaml and is located at on instance-11

### Evalutation

The program prepare_kaggle_submission.py was run by 

It was modified to output the highest probability class even when there is uncertainty, in this commit: https://github.com/nickdeveaux/Dino4Cells_analysis/commit/e17db1a9815c95de2a4f4d79fb2d83c44df22f9e

The submission.csv received these scores:

#### 1-D
Score: 0.21374
Private score: 0.189
Submitted: January 18 2024

#### 4-D
Score: 0.21157
Private score: 0.18859
Submitted: January 7 20224

Submitted here on Kaggle https://www.kaggle.com/c/human-protein-atlas-image-classification/data under account nrdeveaux@gmail.com

### Learnings:
* Was originally seeing close to Zero nvidia-smi usage, due to the images being too large, and so the majority of time was spent on the cpu resiziing the images down to 512x512. 
* The data cleanliness was also an issue, due to using a script to download the data, had to specifically clean any corrupted files, and had to do that again when splitting it into 1d pngs, the list of vallid 1d pngs for training was not 1:1 with the training labels in the training csv leading to useless classifiers that would do well on the training set only to output nonsense due to being trained on jumbled labels

## Appendix

### Useful commands
* `watch -n 0.5 nvidia-smi` for viewing usage of gpu
* `sudo su nick` had to be run on the nickmarveaux account to get to the proper account. 
* branch `origin/ndv_run_end_to_end_without_pretrained_features` was used as of December to make modifications to the run_get_features.py scripts, since I had low confidence of fully understanding the impact of those code changes.
* `gcloud auth login` needed to switch between google cloud accounts
* `gcloud compute ssh --zone "us-east4-c" "instance-11" --project "focal-slice-407815"` focal-slice-40781 is the ID of the GCP project, and this is the ssh command to get onto the instance