<a target="_blank" href="https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/rapids-pip-colab-template.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Install RAPIDS into Colab"/>
</a>

# RAPIDS cuDF is now already on your Colab instance!
RAPIDS cuDF is preinstalled on Google Colab and instantly accelerates Pandas with zero code changes. [You can quickly get started with our tutorial notebook](https://nvda.ws/rapids-cudf). This notebook template is for users who want to utilize the full suite of the RAPIDS libraries for their workflows on Colab.  

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

You can check the output of `!nvidia-smi` to check which GPU you have.  Please uncomment the cell below if you'd like to do that.  Currently, RAPIDS runs on all available Colab GPU instances.

In [None]:
# !nvidia-smi

#Setup:
This set up script:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Pip Installs the RAPIDS' libraries, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuxFilter
  1. cuCIM
  1. xgboost

# Controlling Which RAPIDS Version is Installed
This line in the cell below, `!python rapidsai-csp-utils/colab/pip-install.py`, kicks off the RAPIDS installation script.  You can control the RAPIDS version installed by adding either `latest`, `nightlies` or the default/blank option.  Example:

`!python rapidsai-csp-utils/colab/pip-install.py <option>`

You can now tell the script to install:
1. **RAPIDS + Colab Default Version**, by leaving the install script option blank (or giving an invalid option), adds the rest of the RAPIDS libraries to the RAPIDS cuDF library preinstalled on Colab.  **This is the default and recommended version.**  Example: `!python rapidsai-csp-utils/colab/pip-install.py`
1. **Latest known working RAPIDS stable version**, by using the option `latest` upgrades all RAPIDS labraries to the latest working RAPIDS stable version.  Usually early access for future RAPIDS+Colab functionality - some functionality may not work, but can be same as the default version. Example: `!python rapidsai-csp-utils/colab/pip-install.py latest`
1. **the current nightlies version**, by using the option, `nightlies`, installs current RAPIDS nightlies version.  For RAPIDS Developer use - **not recommended/untested**.  Example: `!python rapidsai-csp-utils/colab/pip-install.py nightlies`


**This will complete in about 5-6 minutes**

In [5]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py


Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 592, done.[K
remote: Counting objects: 100% (158/158), done.[K
remote: Compressing objects: 100% (76/76), done.[K
remote: Total 592 (delta 125), reused 82 (delta 82), pack-reused 434 (from 3)[K
Receiving objects: 100% (592/592), 194.79 KiB | 8.12 MiB/s, done.
Resolving deltas: 100% (299/299), done.
Installing RAPIDS remaining 25.04 libraries
Using Python 3.11.11 environment at: /usr
Resolved 160 packages in 926ms
Downloading cuspatial-cu12 (4.1MiB)
Downloading raft-dask-cu12 (274.9MiB)
Downloading pylibcugraph-cu12 (2.0MiB)
Downloading rmm-cu12 (1.5MiB)
Downloading cudf-cu12 (1.7MiB)
Downloading libcudf-cu12 (538.8MiB)
Downloading libcugraph-cu12 (1.4GiB)
Downloading cugraph-cu12 (3.0MiB)
Downloading cucim-cu12 (5.6MiB)
Downloading cuml-cu12 (9.4MiB)
Downloading librmm-cu12 (2.9MiB)
Downloading libcuml-cu12 (404.9MiB)
Downloading cuproj-cu12 (1.1MiB)
Downloading libcuspatial-cu12 (31.1MiB)
Downloading dask (1.3MiB)
D

# RAPIDS is now installed on Colab.  
You can copy your code into the cells below or use the below to validate your RAPIDS installation and version.  
# Enjoy!

In [6]:
import cudf
cudf.__version__

'25.02.01'

In [10]:
import cuml
cuml.__version__

'25.02.01'

In [11]:
# import cugraph
# cugraph.__version__

In [13]:
# import cuspatial
# cuspatial.__version__

In [15]:
# import cuxfilter
# cuxfilter.__version__

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib

In [20]:
import torch
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc, silhouette_score
from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

from cuml.manifold import TSNE as cuTSNE
import cupy as cp
from cuml.manifold import TSNE as cuTSNE
from cuml.cluster import DBSCAN as cuDBSCAN


import warnings
warnings.filterwarnings('ignore')

# print("Is CUDA available?", torch.cuda.is_available())
# print("Device name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")

In [18]:
from os import X_OK
class Ensemble:
    def __init__(self):
        self.__df = None     # data on CPU
        self.__tensor = None # data on GPU
        self.__labels = None
        self.__centroids = None
        self.__PCA_components = 2
        self.X_tsne = None
        self.X_pca = None

    def __batched_silhouette_score(self, data, labels, batch_size=5000):
        n_samples = data.shape[0]
        n_batches = (n_samples + batch_size - 1) // batch_size
        scores = []

        for i in range(n_batches):
            start = i * batch_size
            end = min((i + 1) * batch_size, n_samples)
            data_batch = data[start:end]
            labels_batch = labels[start:end]

            # Only compute if at least 2 unique labels in batch
            if len(np.unique(labels_batch)) > 1:
                try:
                    score = silhouette_score(data_batch, labels_batch)
                    scores.append(score)
                except:
                    continue

        if scores:
            return np.mean(scores)
        else:
            return None

    def __tensorfy_data(self, X):
        # Convert to PyTorch tensor and move to GPU
        data = torch.tensor(X, dtype=torch.float32).cuda()

        # Convert to cuDF for RAPIDS
        X_cudf = cudf.DataFrame.from_records(X)

        self.__tensor = data
        return data

    def __scale_data(self):
        scaler = StandardScaler()
        self.__df = scaler.fit_transform(self.__df)
        return self

    def __drop_features(self, features):
        self.__df = self.__df.drop(columns=features, axis=1)
        return self

    def get_data(self, count=5):
        if count == "*":
            return self.__df

        return self.__df.head(count)

    def get_labels(self):
        return self.__labels

    def get_centroids(self):
        return self.__centroids

    def get_components_count(self):
        return self.____PCA_components

    def load_data(self, filepath):
        df = pd.read_csv(datasource)
        self.__df = df

    def append_lables(self, title="clusters"):
        # bring the labels back to CPU
        labels = self.__labels.cpu()
        self.__df[title] = labels.numpy()

    def export_to_excel(self, filepath):
        # Export the new data to excel
        self.__df.to_csv(index=False)

    def initial_PCA(self, threshold=0.95):
        # Fit PCA without reducing dimensionality yet
        pca = PCA()

        X = self.__drop_features(["time"]).__scale_data().get_data(count="*")
        print("Time is dropped and the rest of the data is scaled: \n", X)
        pca.fit(X)

        # Cumulative explained variance
        cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

        # add the number of components to the global scope
        self.____PCA_components = np.argmax(cumulative_variance >= threshold) + 1

        # Plot
        plt.figure(figsize=(8, 5))
        plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o')
        plt.axhline(y=0.95, color='r', linestyle='--', label='95% Variance')
        plt.axhline(y=0.99, color='g', linestyle='--', label='99% Variance')
        plt.title('Cumulative Explained Variance by PCA Components')
        plt.xlabel('Number of Principal Components')
        plt.ylabel('Cumulative Explained Variance')
        plt.grid(True)
        plt.legend()
        plt.tight_layout()
        plt.show()

    def visualize_PCA(self, title=""):
        if self.X_pca is None:
            pca = PCA(n_components=self.____PCA_components)
            X_pca = pca.fit_transform(self.__df)
            self.X_pca = X_pca
        else:
            X_pca = self.X_pca

        plt.figure(figsize=(8,6))
        plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.7, edgecolors='k')
        plt.xlabel('Primary Voltage (A)')
        plt.ylabel('Secondary Voltage (A)')
        plt.title(title)
        plt.grid(True)
        plt.show()

    def visualize_TSNE(self, title=""):
        if self.X_tsne is None:
            tsne = TSNE(
                n_components=2,
                perplexity=30,
                metric="euclidean",
                n_jobs=-1,           # multicore speed
                random_state=42,
                verbose=True
            )
            X_tsne = tsne.fit_transform(self.__df)
            self.X_tsne = X_tsne
        else:
            X_tsne = self.X_tsne

        # Plot
        plt.figure(figsize=(8,6))
        plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=10, alpha=0.7)
        plt.title(title)
        plt.xlabel("Dim 1")
        plt.ylabel("Dim 2")
        plt.grid(True)
        plt.show()

    def dbscan(self, eps, min_samples=10):
        db = DBSCAN(eps=eps, min_samples=min_samples)
        self.__labels = db.fit_predict(self.X_tsn)

    def kmeans_torch(self, num_clusters=5, num_iters=100):
        X = self.__tensorfy_data(self.X_pca)

        N, D = X.shape
        # Initialize centroids randomly from the dataset
        centroids = X[torch.randperm(N)[:num_clusters]]

        for _ in range(num_iters):
            # Compute distances and assign clusters
            distances = torch.cdist(X, centroids)
            labels = torch.argmin(distances, dim=1)

            # Update centroids
            for k in range(num_clusters):
                mask = labels == k
                if mask.sum() == 0:
                    continue  # Avoid empty cluster
                centroids[k] = X[mask].mean(dim=0)

        self.__labels = labels
        self.__centroids = centroids

    def evaluate(self, model_type, batch_size=5000):
        data_cpu = self.__tensor.detach().cpu().numpy()  # shape: (N, D)
        labels_cpu = self.__labels.detach().cpu().numpy()  # shape: (N,)
        sil_score = self.__batched_silhouette_score(data_cpu, labels_cpu, batch_size=5000)
        print(f"Silhouette Score: {sil_score:.3f}")

        if model_type == "kmeans":
            self.visualize_PCA("'K-Means Clusters (PCA projection)'")
        else:
            n_clusters = len(set(self.__labels)) - (1 if -1 in self.__labels else 0)
            n_noise = list(self.__labels).count(-1)
            title = f"DBSCAN Clustering (eps={eps})\nClusters: {n_clusters}, Noise: {n_noise}"
            self.visualize_TSNE(title)
