# NVIDIA Rapids correlation Analysis
NVIDA Rapids' cuDF is a pandas like library that is optimized for GPUs based on CUDA.

This series of steps to setup NVIDIA Rapids in google colab can be found [here](https://www.analyticsvidhya.com/blog/2021/06/running-pandas-on-gpu-taking-it-to-the-moon/#:~:text=Pandas%20can%20handle%20a%20significant%20amount%20of%20data,a%20huge%20amount%20of%20data%20on%20the%20fly.).

## Install NVIDA Rapids

In [1]:
! nvidia-smi

Thu Oct 13 13:44:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install pynvml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pynvml
  Downloading pynvml-11.4.1-py3-none-any.whl (46 kB)
[K     |████████████████████████████████| 46 kB 2.2 MB/s 
[?25hInstalling collected packages: pynvml
Successfully installed pynvml-11.4.1


In [4]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 300, done.[K
remote: Counting objects: 100% (129/129), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 300 (delta 74), reused 99 (delta 55), pack-reused 171[K
Receiving objects: 100% (300/300), 87.58 KiB | 985.00 KiB/s, done.
Resolving deltas: 100% (136/136), done.
***********************************************************************
Woo! Your instance has the right kind of GPU, a Tesla T4!
***********************************************************************



In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

In [1]:
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:32
🔁 Restarting kernel...


In [1]:
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

✨🍰✨ Everything looks OK!


In [None]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'

## Imports

In [18]:
import cudf
import cuml
import pandas as pd
import matplotlib.pyplot as plt
import library as lb

## Connect to Google Drive

In [4]:
from google.colab import drive
drive.mount('/content/drive/')
%cd /content/drive/My\ Drive/DSI_Delta/Capstone/

Mounted at /content/drive/
/content/drive/My Drive/DSI_Delta/Capstone


## Load Data

In [5]:
X_train = cudf.read_hdf('./data/train_test_split/X_train_cite_seq.h5')
X_test = cudf.read_hdf('./data/train_test_split/X_test_cite_seq.h5')
Y_train = cudf.read_hdf('./data/train_test_split/Y_train_cite_seq.h5')
Y_test = cudf.read_hdf('./data/train_test_split/Y_test_cite_seq.h5')

  "Using CPU via Pandas to read HDF dataset, this may "


## Find Correlations between genes and proteins

In [6]:
type(X_train), type(X_test), type(Y_train), type(Y_test)

(cudf.core.dataframe.DataFrame,
 cudf.core.dataframe.DataFrame,
 cudf.core.dataframe.DataFrame,
 cudf.core.dataframe.DataFrame)

In [7]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((56790, 22050), (14198, 22050), (56790, 145), (14198, 145))

### Find Correlations between each protein and the set of genes

In [None]:
dict_of_protein_gene_corrs = {}

loop = 0
for protein in Y_train.columns[4:]:
    gene_corrs = []
    for gene in X_train.columns:
        gene_corrs.append(Y_train[protein].corr(X_train[gene]))

    dict_of_protein_gene_corrs[protein] = gene_corrs

    print(protein)

    loop += 1
    if loop % 14 == 0:
        print(f'{loop/14}% complete')

df_protein_gene_corrs_train = pd.DataFrame(dict_of_protein_gene_corrs, index = X_train.columns)

df_protein_gene_corrs_train.to_csv('./data/train_test_split/cite_seq_train_protein_gene_corrs.csv')

In [15]:
df_protein_gene_corrs_train.shape

(22050, 141)

This produces a table of correlations