# Simple example

Copyright 2016 The BigDL Authors.

SparkXshards in Orca allows users to process large-scale dataset using existing Python codes in a distributed and data-parallel fashion, as shown below. This notebook is an example of a simple deep learning project using keras and SparkXshards.

It is adapted from [Your First Deep Learning Project in Python with Keras Step-by-Step](https://machinelearningmastery.com/tutorial-first-neural-network-python-keras) on diabetes data. 

In [None]:
# import necessary libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

import bigdl.orca.data.pandas
from bigdl.orca import init_orca_context, stop_orca_context
from bigdl.orca.learn.tf.estimator import Estimator
import warnings

warnings.filterwarnings('ignore')

Start an OrcaContext and setup backend using "pandas", then it will use pandas to read each file into a pandas dataframe.

In [None]:
sc = init_orca_context(memory="4g")

##  Load data in parallel and get general information

Load data into data_shards, it is a SparkXshards that can be operated on in parallel, here each element of the data_shards is a panda dataframe read from a file on the cluster. Users can distribute local code of `pd.read_csv(dataFile)` using `bigdl.orca.data.pandas.read_csv(datapath)`.

In [None]:
datapath = './diabetes/pima-indians-diabetes.data.csv'
data_shards = bigdl.orca.data.pandas.read_csv(datapath, header=None)

In [4]:
# show the first couple of rows in the data_shards
data_shards.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
# see the num of partitions of data_shards
data_shards.num_partitions()


1

In [6]:
# count total number of rows in the data_shards
len(data_shards)

768

## Assemble feature and labels

In [7]:
columns = list(data_shards.get_schema()['columns'])
data_shards = data_shards.assembleFeatureLabelCols(featureCols=columns[:-1],
                                                 labelCols=list(columns[-1]))

## Define Keras model and train it

Build the model model as usual. Here, we'll use a Sequential model with two densely connected hidden layers, and an output layer that returns a single, continuous value.


In [None]:
model = Sequential()
model.add(Dense(12, input_shape=(8,), activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [9]:
est = Estimator.from_keras(keras_model=model)

In [None]:
est.fit(data=data_shards,
        batch_size=16,
        epochs=150)

In [11]:
stop_orca_context()

Stopping orca context
