# Linear Regression

In this tutorial, we will introduce how to use BigDL to train to a simple linear regression model. The first thing we need to do it to import necessary packages and inilialize the engine.

In [1]:
%pylab inline
import pandas
import datetime as dt

from bigdl.nn.layer import *
from bigdl.nn.criterion import *
from bigdl.optim.optimizer import *
from bigdl.util.common import *
from bigdl.util.common import Sample
from bigdl.dataset.transformer import *

init_engine()

Populating the interactive namespace from numpy and matplotlib


Then we randomly create datasets for training.

In [2]:
FEATURES_DIM = 2
data_len = 100

def gen_rand_sample():
    features = np.random.uniform(0, 1, (FEATURES_DIM))
    label = (2 * features).sum() + 0.4
    return Sample.from_ndarray(features, label)

rdd_train = sc.parallelize(range(0, data_len)).map( lambda i: gen_rand_sample() )

Then we specify the necessary parameters and construct a linear regression model using BigDL. Please notice that batch_size should be devided by the number of cores you use. In this example, it was set as 8 since there are 4 cores when running the example.

In [3]:
# Parameters
learning_rate = 0.2
training_epochs = 5
batch_size = 4
n_input = FEATURES_DIM
n_output = 1 

def linear_regression(n_input, n_output):
    # Initialize a sequential container
    model = Sequential()  
    # Add a linear layer
    model.add(Linear(n_input, n_output))
 
    return model

model = linear_regression(n_input, n_output)

creating: createSequential
creating: createLinear


Here we construct the optimizer to optimize the linear regression problem. You can specific your own learning rate in *SGD()* method, also, you can replace the *SGD()* with other optimizer such like *Adam()*. Click [here](https://github.com/intel-analytics/BigDL/blob/master/pyspark/bigdl/optim/optimizer.py) to see more optimizer.

In [4]:
# Create an Optimizer
optimizer = Optimizer(
    model=model,
    training_rdd=rdd_train,
    criterion=MSECriterion(),
    optim_method=SGD(learningrate=learning_rate),
    end_trigger=MaxEpoch(training_epochs),
    batch_size=batch_size)

creating: createMSECriterion
creating: createDefault
creating: createSGD
creating: createMaxEpoch
creating: createOptimizer


In [5]:
# Start to train
trained_model = optimizer.optimize()

In [6]:
# Print the first five predicted results of training data.
predict_result = trained_model.predict(rdd_train)
p = predict_result.take(5)

print("predict predict: \n")
for i in p:
    print(str(i) + "\n")

predict predict: 

[ 2.14816165]

[ 0.74190128]

[ 1.87969053]

[ 3.01176739]

[ 1.80625415]



To test the trained model, we construct a dataset for testing and print the result of *Mean Square Error*.

In [7]:
def test_predict(trained_model):
    np.random.seed(100)
    total_length = 10
    features = np.random.uniform(0, 1, (total_length, 2))
    label = (features).sum() + 0.4
    predict_data = sc.parallelize(range(0, total_length)).map(
        lambda i: Sample.from_ndarray(features[i], label))
    
    predict_result = trained_model.predict(predict_data)
    p = predict_result.take(6)
    ground_label = np.array([[-0.47596836], [-0.37598032], [-0.00492062],
                                 [-0.5906958], [-0.12307882], [-0.77907401]], dtype="float32")
    mse = ((p - ground_label) ** 2).mean()
    print mse
    
test_predict(trained_model)

8.03806
