# Train a Pytorch model with SparkXshards

Copyright 2016 The BigDL Authors.

SparkXshards in Orca allows users to process large-scale dataset using existing Python codes in a distributed and data-parallel fashion, as shown below. This notebook is an example of how to train a pytorch model using data of SparkXshards on Orca. 

It is adapted from [PyTorch Tutorial: How to Develop Deep Learning Models with Python](https://machinelearningmastery.com/pytorch-tutorial-develop-deep-learning-models/)

In [1]:
# import necessary libraries
from torch.nn import Linear
from torch.nn import ReLU
from torch.nn import Sigmoid
from torch.nn import Module
from torch.optim import SGD
from torch.nn import BCELoss
from torch.nn.init import kaiming_uniform_
from torch.nn.init import xavier_uniform_

import bigdl.orca.data.pandas
from bigdl.orca import init_orca_context, stop_orca_context
from bigdl.orca.learn.pytorch import Estimator
from bigdl.orca.learn.metrics import Accuracy
from bigdl.orca.data.transformer import StringIndexer
import os

# os.environ['KMP_DUPLICATE_LIB_OK'] ='True'
%env PYTHONHOME=/Users/guoqiong/opt/anaconda3/envs/py37tf2_x


env: PYTHONHOME=/Users/guoqiong/opt/anaconda3/envs/py37tf2_x




## Load data in parallel and get general information

Load data into data_shards, it is a SparkXshards that can be operated on in parallel, here each element of the data_shards is a panda dataframe read from a file on the cluster. Users can distribute local code of `pd.read_csv(dataFile)` using `bigdl.orca.data.pandas.read_csv(datapath)`.

In [2]:
# start an OrcaContext
init_orca_context(memory="4g")



Initializing orca context
Current pyspark location is : /Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/site-packages/pyspark/__init__.py
Start to getOrCreate SparkContext
pyspark_submit_args is:  --driver-class-path /Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/site-packages/bigdl/share/core/lib/all-2.2.0-20220919.010507-1.jar:/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/site-packages/bigdl/share/dllib/lib/bigdl-dllib-spark_2.4.6-2.2.0-SNAPSHOT-jar-with-dependencies.jar:/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/site-packages/bigdl/share/orca/lib/bigdl-orca-spark_2.4.6-2.2.0-SNAPSHOT-jar-with-dependencies.jar pyspark-shell 
2022-11-20 00:31:33 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2022-11-20 00:31:33 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2022-11-20 00:31:34,777 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-11-20 00:31:34,780 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-11-20 00:31:34,781 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-11-20 00:31:34,781 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
22-11-20 00:31:34 [Thread-3] INFO  Engine$:122 - Auto detect executor number and executor cores number
22-11-20 00:31:34 [Thread-3] INFO  Engine$:124 - Executor number is 1 and executor cores number is 2



User settings:

   KMP_AFFINITY=granularity=fine,compact,1,0
   KMP_BLOCKTIME=0
   KMP_SETTINGS=1
   OMP_NUM_THREADS=1

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=128
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=0
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DEVICE_THREAD_LIMIT=2147483647
   KMP_DISP_NUM_BUFFERS=7
   KMP_DUPLICATE_LIB_OK=false
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_FORKJOIN_FRAMES=true
   KMP_FORKJOIN_FRAMES_MODE=3
   KMP_GTID_MODE=0
   KMP_HANDLE_SIGNALS=false
   KMP_HOT_TEAMS_MAX_LEVEL=1
   KMP_HOT_TEAMS_MODE=0
   KMP_INIT_AT_FORK=true
   KMP_INIT_WAIT=2048
   KMP_ITT_PREPARE_DELAY=0
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=queuing
   KMP_MALLOC_POOL_INCR=1M
   KMP_NEXT_WAIT=1024
   KMP_NUM_LOCKS_IN_BLOCK=1
   KMP_PLAIN_BARRIER='2,2'
   KMP_PLAIN_BARRIER_PATTERN=

22-11-20 00:31:35 [Thread-3] INFO  ThreadPool$:95 - Set mkl threads to 1 on thread 14
2022-11-20 00:31:35 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.
22-11-20 00:31:35 [Thread-3] INFO  Engine$:461 - Find existing spark context. Checking the spark conf...
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample
BigDLBasePickler registering: bigdl.dllib.utils.common  Sample
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.EvaluatedResult
BigDLBasePickler registering: bigdl.dllib.utils.common  EvaluatedResult
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JTensor
BigDLBasePickler registering: bigdl.dllib.utils.common  JTensor
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JActivity
BigDLBasePickler registering: bigdl.dllib.utils.common  JActivity
Successfully got a SparkContext


In [3]:
data_shards = bigdl.orca.data.pandas.read_csv('/Users/guoqiong/intelWork/data/ionosphere/ionosphere.csv', header=None)

2022-11-20 00:31:35,738 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-11-20 00:31:35,739 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-11-20 00:31:35,741 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-11-20 00:31:35,742 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
22-11-20 00:31:35 [Thread-3] INFO  Engine$:122 - Auto detect executor number and executor cores number
22-11-20 00:31:35 [Thread-3] INFO  Engine$:124 - Executor number is 1 and executor cores number is 2
22-11-20 00:31:35 [Thread-3] INFO  ThreadPool$:95 - Set mkl threads to 1 on thread 14
2022-11-20 00:31:35 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.
22-11-20 00:31:35 [Thread-3] INFO  Engine$:461 - Find existing spark context. Checking the spark conf...
create shards from Spark DataFrame attempted Arrow optimization failed as: name 'df' is not d

                                                                                

In [4]:
# show the first couple of rows in the data_shards
data_shards.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,25,26,27,28,29,30,31,32,33,34
0,1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1.0,0.0376,...,-0.51171,0.41078,-0.46168,0.21266,-0.3409,0.42267,-0.54487,0.18641,-0.453,g
1,1,0,1.0,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1.0,-0.04549,...,-0.26569,-0.20468,-0.18401,-0.1904,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b
2,1,0,1.0,-0.03365,1.0,0.00485,1.0,-0.12062,0.88965,0.01198,...,-0.4022,0.58984,-0.22145,0.431,-0.17365,0.60436,-0.2418,0.56045,-0.38238,g
3,1,0,1.0,-0.45161,1.0,1.0,0.71216,-1.0,0.0,0.0,...,0.90695,0.51613,1.0,1.0,-0.20099,0.25682,1.0,-0.32382,1.0,b
4,1,0,1.0,-0.02401,0.9414,0.06531,0.92106,-0.23255,0.77152,-0.16399,...,-0.65158,0.1329,-0.53206,0.02431,-0.62197,-0.05707,-0.59573,-0.04608,-0.65697,g


In [5]:
# see the num of partitions of data_shards
data_shards.num_partitions()


1

In [6]:
# count total number of rows in the data_shards
len(data_shards)

351

In [7]:
# columns information of element of data_shards.
columns = data_shards.get_schema()['columns']

##  Encode labels

The labels are in strings. Users can transform the strings into integers using `StringIndexer`

In [8]:
label_encoder = StringIndexer(inputCol=columns[-1])
data_shards = label_encoder.fit_transform(data_shards)

createDataFrame from shards attempted Arrow optimization failed as: 'NoneType' object has no attribute 'json',Will try without Arrow optimization
2022-11-20 00:31:41 WARN  Utils:66 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.


                                                                                

create shards from Spark DataFrame attempted Arrow optimization failed as: name 'df' is not defined. Will try without Arrow optimization




Labels start from 1 so need to be updated to zero based.

In [9]:
def update_label_to_zero_base(df):
    df['34'] = df['34'] - 1
    return df
data_shards = data_shards.transform_shard(update_label_to_zero_base)

## Assemble feature and labels

In [10]:
data_shards = data_shards.assembleFeatureLabelCols(featureCols=list(columns[:-1]),
                                                   labelCols=[columns[-1]])

## Define PyTorch model and train it

Users can build a PyTorch model as usual and use Orca Estimator to train it

In [11]:
# define a MLP model
class MLP(Module):
    # define model elements
    def __init__(self, n_inputs):
        super(MLP, self).__init__()
        # input to first hidden layer
        self.hidden1 = Linear(n_inputs, 10)
        kaiming_uniform_(self.hidden1.weight, nonlinearity='relu')
        self.act1 = ReLU()
        # second hidden layer
        self.hidden2 = Linear(10, 8)
        kaiming_uniform_(self.hidden2.weight, nonlinearity='relu')
        self.act2 = ReLU()
        # third hidden layer and output
        self.hidden3 = Linear(8, 1)
        xavier_uniform_(self.hidden3.weight)
        self.act3 = Sigmoid()

    # forward propagate input
    def forward(self, X):
        # input to first hidden layer
        X = self.hidden1(X)
        X = self.act1(X)
        # second hidden layer
        X = self.hidden2(X)
        X = self.act2(X)
        # third hidden layer and output
        X = self.hidden3(X)
        X = self.act3(X)
        return X

model = MLP(34)

In [12]:
# define criterion and optimizer
criterion = BCELoss()
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)

In [13]:
# build Orca Estimator
orca_estimator = Estimator.from_torch(model=model,
                                      optimizer=optimizer,
                                      loss=criterion,
                                      metrics=[Accuracy()],
                                      backend="bigdl")


creating: createTorchLoss
creating: createTorchOptim
creating: createZooKerasAccuracy
creating: createEstimator


In [14]:
# train the model
orca_estimator.fit(data=data_shards, epochs=100, batch_size=32)

creating: createMaxEpoch
22-11-20 00:31:45 [Thread-3] INFO  InternalDistriOptimizer$:1009 - TorchModel[13f70631] isTorch is true
22-11-20 00:31:45 [Thread-3] INFO  InternalDistriOptimizer$:1016 - torch model will use 1 OMP threads.
22-11-20 00:31:45 [Thread-3] INFO  DistriOptimizer$:830 - caching training rdd ...
22-11-20 00:31:46 [Thread-3] INFO  DistriOptimizer$:655 - Cache thread models...
22-11-20 00:31:46 [Executor task launch worker for task 615] INFO  ThreadPool$:95 - Set mkl threads to 1 on thread 53
22-11-20 00:31:46 [Executor task launch worker for task 615] INFO  ThreadPool$:95 - Set mkl threads to 1 on thread 53
22-11-20 00:31:46 [Executor task launch worker for task 615] INFO  DistriOptimizer$:638 - model thread pool size is 1
2022-11-20 00:31:46 WARN  BlockManager:66 - Asked to remove block test_0weights0, which does not exist
2022-11-20 00:31:46 WARN  BlockManager:66 - Asked to remove block test_0gradients0, which does not exist
22-11-20 00:31:46 [Thread-3] INFO  DistriO

[Stage 26:>                                                         (0 + 1) / 1]OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
----------------------------------------ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/p

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:60690)
Traceback (most recent call last):
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 720, in __init__
    self.handle()
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/site-packages/pyspark/accumulators.py", line 269, in handle
    poll(accum_updates)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/site-packag

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:60690)
Traceback (most recent call last):
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 720, in __init__
    self.handle()
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/site-packages/pyspark/accumulators.py", line 269, in handle
    poll(accum_updates)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/site-packag

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:60690)
Traceback (most recent call last):
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/socketserver.py", line 720, in __init__
    self.handle()
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/site-packages/pyspark/accumulators.py", line 269, in handle
    poll(accum_updates)
  File "/Users/guoqiong/opt/anaconda3/envs/py37tf2_x/lib/python3.7/site-packag

Py4JError: An error occurred while calling o59.estimatorTrain

In [None]:
# stop OrcaContext
stop_orca_context()