# Environment Setup and imports

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [1]:
!nvidia-smi
!apt-get install python-cffi

Sun Mar  5 23:13:34 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P0    29W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

##Setup for Rapids (GPU accelleration of Umap and HBDSCAN for Bertopic
Set up script installs
1. Updates gcc in Colab
1. Installs Conda
1. Install RAPIDS' current stable version of its libraries, as well as some external libraries including:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuSignal
  1. BlazingSQL
  1. xgboost
1. Copy RAPIDS .so files into current working directory, a neccessary workaround for RAPIDS+Colab integration.


In [2]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 333, done.[K
remote: Counting objects: 100% (162/162), done.[K
remote: Compressing objects: 100% (107/107), done.[K
remote: Total 333 (delta 95), reused 98 (delta 55), pack-reused 171[K
Receiving objects: 100% (333/333), 95.95 KiB | 2.67 MiB/s, done.
Resolving deltas: 100% (157/157), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 KB 2.0 MB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.0
Traceback (most recent call last):
  File "rapidsai-csp-utils/colab/env-check.py", line 26, in <module>
    gpu_name = pynvml.nvmlDeviceGetName(pynvml.nvmlDeviceGetHandleByIndex(0)).decode('UTF-8')
AttributeError: 'str' object has no attribute 'decode'


In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

Updating your Colab environment.  This will restart your kernel.  Don't Panic!
Get:1 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease [3,622 B]
Hit:2 http://archive.ubuntu.com/ubuntu focal InRelease
Get:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64  Release
Get:7 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Hit:8 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease
Get:9 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease
Hit:11 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease
Hit:12 http://ppa.launchpad.net/graphics-drivers/p

In [1]:
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:13
🔁 Restarting kernel...


In [1]:
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

✨🍰✨ Everything looks OK!


In [2]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'
!pip uninstall cupy -y

Found existing installation: cffi 1.15.1
Uninstalling cffi-1.15.1:
  Successfully uninstalled cffi-1.15.1
Found existing installation: cryptography 38.0.4
Uninstalling cryptography-38.0.4:
  Successfully uninstalled cryptography-38.0.4
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cffi==1.15.0
  Downloading cffi-1.15.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (446 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 446.7/446.7 kB 8.9 MB/s eta 0:00:00
Installing collected packages: cffi
Successfully installed cffi-1.15.0
Installing RAPIDS Stable 22.12
Starting the RAPIDS install on Colab.  This will take about 15 minutes.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
done

## Package Plan ##

  environment location: /usr/loca

### Imports

In [1]:
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

In [2]:
!pip install bertopic
!sudo apt-get install unzip
!pip install gdown
!pip install --upgrade gdown

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       
Reading state information... Done
unzip is already the newest version (6.0-25ubuntu1.1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [3]:

!sudo pip install cffi==1.15.1
#might need to restart kernal here

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python-cffi is already the newest version (1.14.0-1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [4]:
import torch
import csv
import gdown
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

from sklearn.decomposition import PCA
import numpy as np

In [None]:
#device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’

In [5]:
torch.cuda.is_available()

True

In [None]:
### You can skip the next 2 lines if data is already in

### Data Import

Data will be downloaded from google drive folder

In [6]:
#!gdown -v --fuzzy https://drive.google.com/file/d/1Xtc9eTSyjQEaGoRgU9VGlz6YB4SvPoMc/view?usp=sharing
#!gdown --fuzzy https://drive.google.com/file/d/1KvbaoIV18t7bMboOAJfkgyrf5ENGiSWH/view?usp=share_link
!gdown --fuzzy https://drive.google.com/file/d/1RmIhftc9tRw4MMC_f_oSdhPsVb1NoqpZ/view?usp=share_link # Current working csv of our data (output_complex.csv.zip)

Downloading...
From: https://drive.google.com/uc?id=1RmIhftc9tRw4MMC_f_oSdhPsVb1NoqpZ
To: /content/output_complex.csv.zip
100% 784M/784M [00:10<00:00, 71.6MB/s]


In [20]:
!unzip /content/output_complex.csv.zip
#!unzip ~/project/CISC499/output_simple.csv.zip
#!unzip ~/project/CISC499/outputfile.csv.zip

Archive:  /content/output_complex.csv.zip
  inflating: output_complex.csv      


In [21]:
!rm -frd /content/output_complex.csv.zip

In [22]:
data = [] ## IF the system keeps running out of memory, we might have to change this list of string into a numpy object thingn

In [23]:
with open(r'/content/output_complex.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        data.append(row[0])

In [24]:
data[0] 

'text'

In [25]:

data[8]

'a lot of big website use cloaking too   i have heard amazon com uses cloaking '

In [26]:

type(data[0]) # This should be string not list

str

# Optimizing and running the model

In [27]:
### Precompute Embeddings - Optimization #1

In [None]:
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(data[1:], show_progress_bar=True)
# Takes ~ 45 Mins ?

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/311751 [00:00<?, ?it/s]

In [28]:
### Speed up umap - Optimization #2

In [29]:
def rescale(x, inplace=False):
    """ Rescale an embedding so optimization will not have convergence issues.
    """
    if not inplace:
        x = np.array(x, copy=True)

    x /= np.std(x[:, 0]) * 10000

    return x

In [None]:
### Not sure if this is needed 
#from cuml.preprocessing import normalize

#embeddings = normalize(embeddings)


In [None]:
# Initialize and rescale PCA embeddings
pca_embeddings = rescale(PCA(n_components=5).fit_transform(embeddings))

In [None]:
# Start UMAP from PCA embeddings
umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.0,
    metric="cosine",
    init=pca_embeddings,
    low_memory=True
)

In [None]:
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)

### Putting it all together

In [None]:
#train on both (using gpu isnt necessarily "correct" on jupyterlab instance)

topic_model_gpu = BERTopic(umap_model=umap_model,hdbscan_model=hdbscan_model, verbose=True, min_topic_size=400, n_gram_range=(1,2))
topics, probs = topic_model_gpu.fit_transform(data[1:], embeddings)

In [None]:
model.get_topic_freq().head()

In [None]:
model.visualize_topics()

In [None]:
if sample:
  model.save("sample_model")
else:
  topic_model_gpu.save("complex_model")

In [None]:
np.__version__

In [None]:
!pip show bertopic | grep Version

In [None]:
!pip show tensorflow | grep Version

In [None]:
!python --version 

In [None]:
!pip show torch | grep Version

In [None]:
cudf.__version__