# **How RayDP works together with Pytorch**

RayDP is a distributed data processing library that provides simple APIs for running Spark on Ray and integrating Spark with distributed deep learning and machine learning frameworks. This document builds an end-to-end deep learning pipeline on a single Ray cluster by using Spark for data preprocessing, and uses distributed estimator based on the raydp api to complete the training and evaluation.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oap-project/raydp/blob/master/tutorials/pytorch_example.ipynb)

## 1. Colab enviroment Setup

RayDP requires Ray and PySpark. At the same time, pytorch is used to build deep learning model.

In [74]:
! pip install ray==1.9
# install RayDP nightly build
! pip install raydp-nightly
# or use the released version
# ! pip install raydp
! pip install ray[tune]
! pip install torch==1.8.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


## 2. Get the data file

The dataset is from: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset, and we store the file in github repository. It's used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient. 

In [None]:
! wget https://raw.githubusercontent.com/oap-project/raydp/master/tutorials/dataset/healthcare-dataset-stroke-data.csv -O healthcare-dataset-stroke-data.csv

## 3. Init or connect to a ray cluster

In [76]:
import ray

ray.init(num_cpus=6)

{'metrics_export_port': 59076,
 'node_id': '4a3abd8dce03a2b3e9e76b935b01c20eb501d3d5135929e1eadeccae',
 'node_ip_address': '172.28.0.2',
 'object_store_address': '/tmp/ray/session_2022-05-20_08-04-44_153620_58/sockets/plasma_store',
 'raylet_ip_address': '172.28.0.2',
 'raylet_socket_name': '/tmp/ray/session_2022-05-20_08-04-44_153620_58/sockets/raylet',
 'redis_address': '172.28.0.2:6379',
 'session_dir': '/tmp/ray/session_2022-05-20_08-04-44_153620_58',
 'webui_url': None}

## 4. Get a spark session

In [77]:
import raydp

app_name = "Stoke Prediction with RayDP"
num_executors = 1
cores_per_executor = 1
memory_per_executor = "500M"
spark = raydp.init_spark(app_name, num_executors, cores_per_executor, memory_per_executor)





[2m[36m(RayDPSparkMaster pid=4514)[0m 2022-05-20 08:04:52,275 WARN NativeCodeLoader [Thread-2]: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[2m[36m(RayDPSparkMaster pid=4514)[0m 2022-05-20 08:04:52,602 INFO SecurityManager [Thread-2]: Changing view acls to: root
[2m[36m(RayDPSparkMaster pid=4514)[0m 2022-05-20 08:04:52,604 INFO SecurityManager [Thread-2]: Changing modify acls to: root
[2m[36m(RayDPSparkMaster pid=4514)[0m 2022-05-20 08:04:52,605 INFO SecurityManager [Thread-2]: Changing view acls groups to: 
[2m[36m(RayDPSparkMaster pid=4514)[0m 2022-05-20 08:04:52,606 INFO SecurityManager [Thread-2]: Changing modify acls groups to: 
[2m[36m(RayDPSparkMaster pid=4514)[0m 2022-05-20 08:04:52,607 INFO SecurityManager [Thread-2]: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups

## 5. Get data from .csv file via 'spark' created by **raydp**

In [78]:
data = spark.read.format("csv").option("header", "true") \
        .option("inferSchema", "true") \
        .load("/content/healthcare-dataset-stroke-data.csv")



## 6. Define the data_process function

The dataset is converted to `pyspark.sql.dataframe.DataFrame`. Before feeding into the deep learning model, we can use raydp to do some transformation operations on dataset.

### 6.1 Data Analysis

Here is a part of the data analysis.

In [79]:
# Data overview
data.show(5)
# Statistical N/A distribution
# There are 201 'N/A' value in column 'bmi column',
# we can update them the mean of the column
data.describe().show()
data.filter(data.bmi=='N/A').count()
# Observe the distribution of the column 'gender'
# Then we should remove the outliers 'Other'
data.rollup(data.gender).count().show()
# Observe the proportion of positive and negative samples.
data.rollup(data.stroke).count().show()

+-----+------+----+------------+-------------+------------+-------------+--------------+-----------------+----+---------------+------+
|   id|gender| age|hypertension|heart_disease|ever_married|    work_type|Residence_type|avg_glucose_level| bmi| smoking_status|stroke|
+-----+------+----+------------+-------------+------------+-------------+--------------+-----------------+----+---------------+------+
| 9046|  Male|67.0|           0|            1|         Yes|      Private|         Urban|           228.69|36.6|formerly smoked|     1|
|51676|Female|61.0|           0|            0|         Yes|Self-employed|         Rural|           202.21| N/A|   never smoked|     1|
|31112|  Male|80.0|           0|            1|         Yes|      Private|         Rural|           105.92|32.5|   never smoked|     1|
|60182|Female|49.0|           0|            0|         Yes|      Private|         Urban|           171.23|34.4|         smokes|     1|
| 1665|Female|79.0|           1|            0|         

### 6.2 Define operations

Define data processing operations based on data analysis results.

In [80]:
from pyspark.sql.functions import hour, quarter, month, year, dayofweek, dayofmonth, weekofyear, col, lit, udf, abs as functions_abs, avg

In [81]:
# Delete the useless column 'id'
def drop_col(data):
    data = data.drop('id')
    return data

In [82]:
# Replace the value N/A in 'bmi'
def replace_nan(data):
    bmi_avg = data.agg(avg(col("bmi"))).head()[0]

    @udf("float")
    def replace_nan(value):
        if value=='N/A':
            return float(bmi_avg)
        else:
            return float(value)

    # Replace the value N/A
    data = data.withColumn('bmi', replace_nan(col("bmi")))
    return data

In [83]:
# Drop the only one value 'Other' in column 'gender'
def clean_value(data):
    data = data.filter(data.gender != 'Other')
    return data

In [84]:
# Transform the category columns
def trans_category(data):
    @udf("int")
    def trans_gender(value):
        gender = {'Female': 0,
                  'Male': 1}
        return int(gender[value])

    @udf("int")
    def trans_ever_married(value):
        residence_type = {'No': 0,
                          'Yes': 1}
        return int(residence_type[value])

    @udf("int")
    def trans_work_type(value):
        work_type = {'children': 0,
                     'Govt_job': 1,
                     'Never_worked': 2,
                     'Private': 3,
                     'Self-employed': 4}
        return int(work_type[value])

    @udf("int")
    def trans_residence_type(value):
        residence_type = {'Rural': 0,
                          'Urban': 1}
        return int(residence_type[value])

    @udf("int")
    def trans_smoking_status(value):
        smoking_status = {'formerly smoked': 0,
                          'never smoked': 1,
                          'smokes': 2,
                          'Unknown': 3}
        return int(smoking_status[value])

    data = data.withColumn('gender', trans_gender(col('gender'))) \
               .withColumn('ever_married', trans_ever_married(col('ever_married'))) \
               .withColumn('work_type', trans_work_type(col('work_type'))) \
               .withColumn('Residence_type', trans_residence_type(col('Residence_type'))) \
               .withColumn('smoking_status', trans_smoking_status(col('smoking_status')))
    return data

In [85]:
# Add the discretized column of 'Age'
def map_age(data):
    @udf("int")
    def get_value(value):
        if value >= 18 and value < 26:
            return int(0)
        elif value >=26 and value < 36:
            return int(1)
        elif value >=36 and value < 46:
            return int(2)
        elif value >=46 and value < 56:
            return int(3)
        else:
            return int(4)

    data = data.withColumn('age_dis', get_value(col('age')))
    return data

In [86]:
# Preprocess the data
def data_preprocess(data):
    data = drop_col(data)
    data = replace_nan(data)
    data = clean_value(data)
    data = trans_category(data)
    data = map_age(data)
    return data

## 7. Data processing

In [87]:
import torch
from raydp.utils import random_split

# Transform the dataset
data = data_preprocess(data)
# Split data into train_dataset and test_dataset
train_df, test_df = random_split(data, [0.8, 0.2], 0)
# Balance the positive and negative samples
train_df_neg = train_df.filter(train_df.stroke == '1')
train_df = train_df.unionByName(train_df_neg)
train_df = train_df.unionByName(train_df_neg)
features = [field.name for field in list(train_df.schema) if field.name != "stroke"]
# Convert spark dataframe into ray Dataset
# Remember to align ``parallelism`` with ``num_workers`` of ray train
train_dataset = ray.data.from_spark(train_df, parallelism = 8)
test_dataset = ray.data.from_spark(test_df, parallelism = 8)
feature_dtype = [torch.float] * len(features)

## 8. Define a neural network model

In [88]:
import torch.nn as nn
import torch.nn.functional as F

class NET_Model(nn.Module):
    def __init__(self, cols):
        super().__init__()
        self.emb_layer_gender = nn.Embedding(2, 1)           # gender
        self.emb_layer_hypertension = nn.Embedding(2,1)      # hypertension
        self.emb_layer_heart_disease = nn.Embedding(2,1)     # heart_disease
        self.emb_layer_ever_married = nn.Embedding(2, 1)     # ever_married
        self.emb_layer_work = nn.Embedding(5, 1)             # work_type
        self.emb_layer_residence = nn.Embedding(2, 1)        # Residence_type
        self.emb_layer_smoking_status = nn.Embedding(4, 1)   # smoking_status
        self.emb_layer_age = nn.Embedding(5, 1)              # age column after discretization
        self.fc1 = nn.Linear(cols, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 16)
        self.fc5 = nn.Linear(16, 2)
        self.bn1 = nn.BatchNorm1d(256)
        self.bn2 = nn.BatchNorm1d(128)
        self.bn3 = nn.BatchNorm1d(64)
        self.bn4 = nn.BatchNorm1d(16)

    def forward(self, *x):
        x = torch.cat(x, dim=1)
        # pick the dense attribute columns
        dense_columns = x[:, [1,7,8]]
        # Embedding operation on sparse attribute columns
        sparse_col_1 = self.emb_layer_gender(x[:, 0].long())
        sparse_col_2 = self.emb_layer_hypertension(x[:, 2].long())
        sparse_col_3 = self.emb_layer_heart_disease(x[:, 3].long())
        sparse_col_4 = self.emb_layer_ever_married(x[:, 4].long())
        sparse_col_5 = self.emb_layer_work(x[:, 5].long())
        sparse_col_6 = self.emb_layer_residence(x[:, 6].long())
        sparse_col_7 = self.emb_layer_smoking_status(x[:, 9].long())
        sparse_col_8 = self.emb_layer_age(x[:, 10].long())
        # Splice sparse attribute columns and dense attribute columns
        x = torch.cat([dense_columns, sparse_col_1, sparse_col_2, sparse_col_3, sparse_col_4, sparse_col_5, sparse_col_6, sparse_col_7, sparse_col_8], dim=1)

        x = F.relu(self.fc1(x))
        x = self.bn1(x)
        x = F.relu(self.fc2(x))
        x = self.bn2(x)
        x = F.relu(self.fc3(x))
        x = self.bn3(x)
        x = F.relu(self.fc4(x))
        x = self.bn4(x)
        x = self.fc5(x)
        return x


## 9. Create model, critetion and optimizer

In [89]:
import torch
import torch.nn as nn

net_model = NET_Model(len(features))
criterion = nn.SmoothL1Loss()
optimizer = torch.optim.Adam(net_model.parameters(), lr=0.001)

## 10. Define the Callback which will be executed during training.

In [90]:
from ray.train import TrainingCallback
from typing import List, Dict

class PrintingCallback(TrainingCallback):
    def handle_result(self, results: List[Dict], **info):
        print(results)

## 11. Create distributed estimator and train

In [91]:
from raydp.torch import TorchEstimator

estimator = TorchEstimator(num_workers=1, model=net_model, optimizer=optimizer, loss=criterion,
                           feature_columns=features, feature_types=feature_dtype,
                           label_column="stroke", label_type=torch.float,
                           batch_size=64, num_epochs=30, callbacks=[PrintingCallback()])
# Train the model
estimator.fit_on_spark(train_df, test_df)

2022-05-20 08:05:28,725	INFO trainer.py:172 -- Trainer logs will be logged in: /root/ray_results/train_2022-05-20_08-05-28
2022-05-20 08:05:30,430	INFO trainer.py:178 -- Run results will be logged in: /root/ray_results/train_2022-05-20_08-05-28/run_001
[2m[36m(BaseWorkerMixin pid=5164)[0m 2022-05-20 08:05:30,427	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=1]
[2m[36m(BaseWorkerMixin pid=5164)[0m 2022-05-20 08:05:30,701	INFO torch.py:239 -- Moving model to device: cpu
[2m[36m(BaseWorkerMixin pid=5164)[0m   return F.smooth_l1_loss(input, target, reduction=self.reduction, beta=self.beta)


[{'epoch': 0, 'train_acc': 0.0, 'train_loss': 0.09247553693130613, '_timestamp': 1653033932, '_time_this_iter_s': 1.3748996257781982, '_training_iteration': 1}]
[{'epoch': 0, 'evaluate_acc': 0.0, 'test_loss': 0.04560316858046195, '_timestamp': 1653033932, '_time_this_iter_s': 0.09927511215209961, '_training_iteration': 2}]


[2m[36m(BaseWorkerMixin pid=5164)[0m   return F.smooth_l1_loss(input, target, reduction=self.reduction, beta=self.beta)
[2m[36m(BaseWorkerMixin pid=5164)[0m   return F.smooth_l1_loss(input, target, reduction=self.reduction, beta=self.beta)


[{'epoch': 1, 'train_acc': 0.0, 'train_loss': 0.06568372739878084, '_timestamp': 1653033933, '_time_this_iter_s': 1.012242317199707, '_training_iteration': 3}]
[{'epoch': 1, 'evaluate_acc': 0.0, 'test_loss': 0.051923071308171045, '_timestamp': 1653033933, '_time_this_iter_s': 0.10902786254882812, '_training_iteration': 4}]
[{'epoch': 2, 'train_acc': 0.0, 'train_loss': 0.062333274386557086, '_timestamp': 1653033934, '_time_this_iter_s': 0.9877479076385498, '_training_iteration': 5}]
[{'epoch': 2, 'evaluate_acc': 0.0, 'test_loss': 0.042880474863683474, '_timestamp': 1653033934, '_time_this_iter_s': 0.10124373435974121, '_training_iteration': 6}]
[{'epoch': 3, 'train_acc': 0.0, 'train_loss': 0.062327381250049385, '_timestamp': 1653033935, '_time_this_iter_s': 1.005267858505249, '_training_iteration': 7}]
[{'epoch': 3, 'evaluate_acc': 0.0, 'test_loss': 0.04614457756500034, '_timestamp': 1653033935, '_time_this_iter_s': 0.10248780250549316, '_training_iteration': 8}]
[{'epoch': 4, 'train_ac

## 12. shut down ray and raydp

In [92]:
raydp.stop_spark()
ray.shutdown()