<a href="https://colab.research.google.com/github/hcslomeu/MyWork/blob/master/rapids_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [1]:
!nvidia-smi

Wed Jul 15 02:22:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#Setup:
Set up script installs
1. Install most recent Miniconda release compatible with Google Colab's Python install  (3.6.7)
1. removes incompatible files
1. Install RAPIDS libraries
1. Set necessary environment variables
1. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions
1. If running v0.11 or higher, updates pyarrow library to 0.15.x.

In [2]:
# Install RAPIDS
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh stable

import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 165, done.[K
remote: Counting objects: 100% (165/165), done.[K
remote: Compressing objects: 100% (160/160), done.[K
remote: Total 165 (delta 60), reused 20 (delta 4), pack-reused 0[K
Receiving objects: 100% (165/165), 48.48 KiB | 584.00 KiB/s, done.
Resolving deltas: 100% (60/60), done.
PLEASE READ
********************************************************************************************************
Changes:
1. Default stable version is now 0.14.  Nightly is now 0.15.  We have fixed the long conda install.  Hooray!
2. You can now declare your RAPIDSAI version as a CLI option and skip the user prompts (ex: '0.14' or '0.15', between 0.13 to 0.15, without the quotes): 
        "!bash rapidsai-csp-utils/colab/rapids-colab.sh <version/label>"
        Examples: '!bash rapidsai-csp-utils/colab/rapids-colab.sh 0.14', or '!bash rapidsai-csp-utils/colab/rapids-colab.sh stable', or '!bash rapidsai-csp-utils/colab/rapids-colab

# cuDF and cuML Examples #

Now you can run code! 

What follows are basic examples where all processing takes place on the GPU.

#[cuDF](https://github.com/rapidsai/cudf)#

Load a dataset into a GPU memory resident DataFrame and perform a basic calculation.

Everything from CSV parsing to calculating tip percentage and computing a grouped average is done on the GPU.

_Note_: You must import nvstrings and nvcategory before cudf, else you'll get errors.

In [3]:
import cudf
import io, requests

# download CSV file from GitHub
url="https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')

# read CSV from memory
tips_df = cudf.read_csv(io.StringIO(content))
tips_df['tip_percentage'] = tips_df['tip']/tips_df['total_bill']*100

# display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())

size
1    21.729202
2    16.571919
3    15.215685
4    14.594901
5    14.149549
6    15.622920
Name: tip_percentage, dtype: float64


#[cuML](https://github.com/rapidsai/cuml)#

This snippet loads a 

As above, all calculations are performed on the GPU.

In [4]:
import cuml

# Create and populate a GPU DataFrame
df_float = cudf.DataFrame()
df_float['0'] = [1.0, 2.0, 5.0]
df_float['1'] = [4.0, 2.0, 1.0]
df_float['2'] = [4.0, 2.0, 1.0]

# Setup and fit clusters
dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(df_float)

print(dbscan_float.labels_)

0    0
1    1
2    2
dtype: int32


In [6]:
# Import required libraries
import cugraph
import cudf

In [9]:
df_chat = cudf.read_csv("https://raw.githubusercontent.com/hcslomeu/MyWork/master/dataset.csv")
df_chat.head(5)


Unnamed: 0,Num,Inviter,Invitee,MsgCount
0,1,P1,P63,180
1,2,P2,P64,135
2,3,P3,P65,88
3,4,P4,P66,82
4,5,P5,P67,67


In [16]:
##Remove character P from the nodes Inviter and Invitee
df_chat['src_node'] = (df_chat['Inviter'].str.split('P'))[1]
df_chat['des_node'] = (df_chat['Invitee'].str.split('P'))[1]
df_chat.head()



Unnamed: 0,Num,Inviter,Invitee,MsgCount,src_node,des_node
0,1,P1,P63,180,1,63
1,2,P2,P64,135,2,64
2,3,P3,P65,88,3,65
3,4,P4,P66,82,4,66
4,5,P5,P67,67,5,67


In [17]:
###Change data type to integer and floats
df_chat["src_node"] = df_chat["src_node"].astype("int32")
df_chat["des_node"] = df_chat["des_node"].astype("int32")
df_chat["MsgCount"] = df_chat["MsgCount"].astype("float64")

In [18]:
# Louvain algorithm in CUGRAPH requires nodes starting at zero
df_chat["src_0"] = df_chat["src_node"] - 1
df_chat["dst_0"] = df_chat["des_node"] - 1
df_chat.head()

Unnamed: 0,Num,Inviter,Invitee,MsgCount,src_node,des_node,src_0,dst_0
0,1,P1,P63,180.0,1,63,0,62
1,2,P2,P64,135.0,2,64,1,63
2,3,P3,P65,88.0,3,65,2,64
3,4,P4,P66,82.0,4,66,3,65
4,5,P5,P67,67.0,5,67,4,66


In [22]:
# create a Graph 
G = cugraph.Graph()
type(G)

cugraph.structure.graph.Graph

In [24]:
G.add_edge_list(df_chat["src_0"], df_chat["dst_0"], df_chat["MsgCount"])
G.number_of_nodes()

  Use from_cudf_edgelist instead')


130

In [25]:
# Run Louvain on the graph
df_chat_partition, mod = cugraph.louvain(G)
# Print the modularity score
print('Modularity was {}'.format(mod))
df_chat_partition.head()

Modularity was 0.9549491048403094


Unnamed: 0,vertex,partition
0,0,16
1,1,4
2,2,5
3,3,6
4,4,7


In [26]:
# How many partitions where found
part_ids = df_chat_partition["partition"].unique()
print(str(len(part_ids)) + " partition detected")

####Explore members of each community
for p in range(len(part_ids)):
    part = []
    for i in range(len(df_chat_partition)):
        if (df_chat_partition['partition'][i] == p):
            part.append(df_chat_partition['vertex'][i] +1)
    print("Partition " + str(p) + ":")
    print(part)

46 partition detected
Partition 0:
[20, 27, 91, 101]
Partition 1:
[6, 35, 98, 118]
Partition 2:
[11, 39, 55, 73]
Partition 3:
[22, 31, 53, 96]
Partition 4:
[2, 64, 83]
Partition 5:
[3, 65, 71]
Partition 6:
[4, 66, 85]
Partition 7:
[5, 67]
Partition 8:
[7, 68, 108]
Partition 9:
[8, 69]
Partition 10:
[9, 70]
Partition 11:
[10, 72]
Partition 12:
[13, 74]
Partition 13:
[14, 75]
Partition 14:
[15, 76, 81]
Partition 15:
[17, 77]
Partition 16:
[1, 12, 16, 36, 44, 63, 78, 88, 92, 103, 129]
Partition 17:
[18, 79, 128]
Partition 18:
[21, 60, 82]
Partition 19:
[19, 80, 84]
Partition 20:
[23, 86]
Partition 21:
[24, 87]
Partition 22:
[26, 90]
Partition 23:
[28, 93]
Partition 24:
[32, 97]
Partition 25:
[33, 99]
Partition 26:
[25, 34, 52, 89, 100, 102, 120]
Partition 27:
[37, 104]
Partition 28:
[38, 105]
Partition 29:
[40, 46, 106, 109]
Partition 30:
[41, 107]
Partition 31:
[42, 111]
Partition 32:
[43, 112]
Partition 33:
[45, 113]
Partition 34:
[47, 114]
Partition 35:
[48, 115]
Partition 36:
[49, 116

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib