<a href="https://colab.research.google.com/github/madhavamk/computational-data-science/blob/master/MiniProjects/M4_NB_MiniProject_01_MLR_MPI_OpenMP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: Implementation of Multiple Linear Regression using MPI and OpenMP

## Learning Objectives

At the end of the mini-project, you will be able to :

* understand the collective communication operations like scatter, gather, broadcast
* understand the blocking and non-blocking communication
* implement multiple linear regression and run it using MPI
* implement the multiple linear regression based predictions using OpenMP

### Dataset

The dataset chosen for this mini-project is [Combined Cycle Power Plant](https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant). The dataset is made up of 9568 records and 5 columns. Each record contains the values for Ambient Temperature, Exhaust Vaccum, Ambient Pressure, Relative Humidity and Energy Output.

Predicting full load electrical power output of a base load power plant is important in order to maximize the profit from the available megawatt hours.  The base load operation of a power plant is influenced by four main parameters, which are used as input variables in the dataset, such as ambient temperature, atmospheric pressure, relative humidity, and exhaust steam pressure. These parameters affect electrical power output, which is considered as the target variable.

**Note:** The data was collected over a six year period (2006-11).

## Information

#### MPI in a Nutshell

MPI stands for "Message Passing Interface". It is a library of functions (in C / Python) or subroutines (in Fortran) that you insert into source code to perform data communication between processes. MPI was developed over two years of discussions led by the MPI Forum, a group of roughly sixty people representing some forty organizations.

To know more about MPI click [here](https://hpc-tutorials.llnl.gov/mpi/)


#### Multiple Linear Regression

Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).

**Note:** We will be using the mpi4py Python package for MPI based code implementation

## Grading = 20 Points

**Run the below code to install mpi4py package**

In [1]:
!pip install mpi4py

Collecting mpi4py
  Downloading mpi4py-4.0.0.tar.gz (464 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m464.8/464.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: mpi4py
  Building wheel for mpi4py (pyproject.toml) ... [?25l[?25hdone
  Created wheel for mpi4py: filename=mpi4py-4.0.0-cp310-cp310-linux_x86_64.whl size=4266264 sha256=9bfd0fb7cb8e0b8b203964f04f57a3f61887fee715189bb25fe6c060001a413c
  Stored in directory: /root/.cache/pip/wheels/96/17/12/83db63ee0ae5c4b040ee87f2e5c813aea4728b55ec6a37317c
Successfully built mpi4py
Installing collected packages: mpi4py
Successfully installed mpi4py-4.0.0


#### Importing Necessary Packages

In [2]:
# Importing pandas
import pandas as pd
# Importing Numpy
import numpy as np
# Importing MPI from mpi4py package
from mpi4py import MPI
# Importing sqrt function from the Math
from math import sqrt
# Importing Decimal, ROUND_HALF_UP functions from the decimal package
from decimal import Decimal, ROUND_HALF_UP
import time
from sklearn.model_selection import train_test_split

#### Downloading the data

In [3]:
#@title Download the data
!wget -qq https://cdn.iisc.talentsprint.com/CDS/Datasets/PowerPlantData.csv

### Overview

* Load the data and perform data pre-processing
* Identify the features, target and split the data into train and test
* Implement multiple Linear Regression by estimating the coefficients on the given data
* Use MPI package to distribute the data and implement `communicator`
* Define functions for each objective and make a script (.py) file to execute using MPI command
* Use OpenMP component to predict the data and calculate the error on the predicted data
* Implement the Linear Regression from `sklearn` and compare the results

#### Exercise 1: Load data (1 point)

Write a function that takes the filename as input and loads the data in a pandas dataframe with the column names as Ambient Temperature, Exhaust Vaccum, Ambient Pressure, Relative Humidity and Energy Output respectively.

**Hint:** read_csv()


In [4]:
FILENAME = "/content/PowerPlantData.csv" # File path

# YOUR CODE HERE to Define a function to load the data
def load_data(filename):
    data = pd.read_csv(filename)
    return data

#### Exercise 2: Explore data (1 point)

Write a function that takes the data loaded using the above defined function as input and explore it.

**Hint:** You can define and check for following things in the dataset inside a function

- checking for the number of rows and columns
- summary of the dataset
- check for the null values
- check for the duplicate values

In [5]:
# YOUR CODE HERE
def explore_data():
    df = load_data(FILENAME)
    print('Number of rows: {}'.format(df.shape[0]))
    print('Number of columns: {}'.format(df.shape[1]))
    print('Summary of dataset \n {}'.format(df.describe()))
    print('Null values \n {}'.format(df.isnull().sum()))
    print('Duplicate values \n {}'.format(df.duplicated().sum()))
    return df

In [6]:
df = explore_data()

Number of rows: 9568
Number of columns: 5
Summary of dataset 
                 AT            V           AP           RH           PE
count  9568.000000  9568.000000  9568.000000  9568.000000  9568.000000
mean     19.651231    54.305804  1013.259078    73.308978   454.365009
std       7.452473    12.707893     5.938784    14.600269    17.066995
min       1.810000    25.360000   992.890000    25.560000   420.260000
25%      13.510000    41.740000  1009.100000    63.327500   439.750000
50%      20.345000    52.080000  1012.940000    74.975000   451.550000
75%      25.720000    66.540000  1017.260000    84.830000   468.430000
max      37.110000    81.560000  1033.300000   100.160000   495.760000
Null values 
 AT    0
V     0
AP    0
RH    0
PE    0
dtype: int64
Duplicate values 
 41


#### Exercise 3: Handle missing data (1 point)

After exploring the dataset if there are any null values present in the dataset then define a function that takes data loaded using the above defined function as input and handle the null values accordingly.

**Hint:**

- Drop the records containing the null values - dropna()
- Replace the null values with the mean/median/mode - fillna()

In [7]:
# Function to handle missing data

# YOUR CODE HERE
def handle_missing_data(df):
    df = df.dropna()
    return df

In [8]:
df = handle_missing_data(df)

#### Exercise 4: Scale the data (1 point)

Write a function that takes the data after handling the missing data as input and returns the standardized data.

**Hint:**

- standardization of the data  can be performed using the below formula

$ (x - mean(x)) / std(x) $

In [9]:
# Defining a function to standardize the data

# YOUR CODE HERE
def standardize_data(df):
    df = (df - df.mean())/df.std()
    return df

In [10]:
df = standardize_data(df)

In [11]:
df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,-1.517782,-1.065149,-0.407336,1.143885,1.530146
1,0.535228,0.32926,-0.31304,0.061028,-0.504776
2,1.353748,0.204141,-1.028675,-2.150575,-0.914338
3,-0.077992,-0.363223,-1.016888,0.238422,-0.074706
4,-1.053507,-1.073805,0.651804,1.636341,0.589734


#### Exercise 5: Feature selection (1 point)

Write a function that takes scaled data as input and returns the features and target variable values

**Hint:**

- Features: AmbientTemperature, ExhaustVaccum, AmbientPressure, RelativeHumidity
- Target Variable: EnergyOutput

In [12]:
# Define a function

# YOUR CODE HERE
def feature_target(df):
    features = df.drop('PE', axis=1)
    target = df['PE']
    return features, target

In [13]:
features,target = feature_target(df)

#### Exercise 6: Correlation (1 point)

Calculate correlation between the variables

In [14]:
# YOUR CODE HERE
df.corr()

Unnamed: 0,AT,V,AP,RH,PE
AT,1.0,0.844107,-0.507549,-0.542535,-0.948128
V,0.844107,1.0,-0.413502,-0.312187,-0.86978
AP,-0.507549,-0.413502,1.0,0.099574,0.518429
RH,-0.542535,-0.312187,0.099574,1.0,0.389794
PE,-0.948128,-0.86978,0.518429,0.389794,1.0


#### Exercise 7: Estimate the coefficients (2 points)

Write a function that takes features and target as input and returns the estimated coefficient values

**Hint:**

- Calculate the estimated coefficients using the below formula

$ β = (X^T X)^{-1} X^T y $

- transpose(), np.linalg.inv()

In [15]:
# Calculating the coeffients

# YOUR CODE HERE
def estimated_coefficients(features, target):
    X = features
    y = target
    transpose_of_X = X.transpose()
    product_of_X_and_transpose = transpose_of_X.dot(X)
    inverse_of_product = np.linalg.inv(product_of_X_and_transpose)
    product_of_inverse_and_transpose = inverse_of_product.dot(transpose_of_X)
    coefficients = product_of_inverse_and_transpose.dot(y)

    return coefficients


#### Exercise 8: Fit the data to estimate the coefficients (2 points)

Write a function named fit which takes features and targets as input and returns the intercept and coefficient values.

**Hint:**

- create a dummy column in the features dataframe which is made up of all ones
- convert the features dataframe into numpy array
- call the estimated coefficients function which is defined above
- np.ones(), np.concatenate()

In [16]:
# defining a fit function
def fit(features, target):
    # YOUR CODE HERE
    dummy_col = np.ones((features.shape[0], 1))
    features = np.concatenate((dummy_col, features), axis=1)
    converted_features = np.array(features)
    coefficients = estimated_coefficients(features, target)
    intercept = coefficients[0]
    coefficients = coefficients[1:]
    print("Intercept {}, Coefficients {}".format(intercept, coefficients))
    return intercept, coefficients

In [17]:
intercept, coefficients = fit(features,target)

Intercept -1.590307746601738e-15, Coefficients [-0.86350078 -0.17417154  0.02160293 -0.13521023]


#### Exercise 9: Predict the data on estimated coefficients (1 point)

Write a function named predict which takes features, intercept and coefficient values as input and returns the predicted values.

**Hint:**

- Fit the intercept, coefficients values in the below equation

  $y = b_0 + b_1*x + ... + b_i*x_i$

In [18]:
 # fucntion to predict the values
def predict(features, intercept, coefficients):
    '''
    y = b_0 + b_1*x + ... + b_i*x_i
    '''
    #YOUR CODE HERE
    predictions = intercept + np.dot(features, coefficients)

    return predictions

In [19]:
features

Unnamed: 0,AT,V,AP,RH
0,-1.517782,-1.065149,-0.407336,1.143885
1,0.535228,0.329260,-0.313040,0.061028
2,1.353748,0.204141,-1.028675,-2.150575
3,-0.077992,-0.363223,-1.016888,0.238422
4,-1.053507,-1.073805,0.651804,1.636341
...,...,...,...,...
9563,-0.608017,-0.423816,-0.245686,-0.025957
9564,1.846202,1.860591,-0.498263,-0.930735
9565,-0.491277,-0.862913,0.158437,0.366502
9566,-0.268532,0.437854,0.895962,1.461687


In [20]:
features.iloc[0]

Unnamed: 0,0
AT,-1.517782
V,-1.065149
AP,-0.407336
RH,1.143885


In [21]:
predict(features, intercept, coefficients)

array([ 1.33266027, -0.53453122, -0.93596031, ...,  0.52838115,
       -0.02266325, -0.41153611])

#### Exercise 10: Root mean squared error (1 point)

Write a function to calculate the RMSE error.

**Hint:**

- [How to calculate the RSME error](https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e)

In [22]:
# Define a function to calculate the error

# YOUR CODE HERE
def calc_rmse(target, features, intercept, coefficients):
    rmse = sqrt(np.mean((target - predict(features, intercept, coefficients))**2))
    return round(rmse,4)

In [23]:
calc_rmse(target, features, intercept, coefficients)

0.267

#### Exercise 11: Split the data into train and test (1 point)

Write a function named train_test_split which takes features and targets as input and returns the train and test sets respectively.

**Hint:**

- Shuffle the data
- Consider 70 % of data as a train set and the rest of the data as a test set

In [24]:
# YOUR CODE HERE
def my_train_test_split(features, target, test_size=0.3, shuffle=True):
    if shuffle:
        data = features.copy()
        data['target'] = target
        data = data.sample(frac=1).reset_index(drop=True)
        features = data.drop('target', axis=1)
        target = data['target']
        # Splitting the data
    split_index = int(features.shape[0] * (1 - test_size))
    X_train = features[:split_index]
    X_test = features[split_index:]
    Y_train = target[:split_index]
    Y_test = target[split_index:]
    return X_train, X_test, Y_train, Y_test

def data_split(features, target):
    # YOUR CODE HERE
    X_train, X_test, Y_train, Y_test = my_train_test_split(features, target, test_size=0.3, shuffle=True)
    return X_train, X_test, Y_train, Y_test

In [25]:
X_train, X_test, Y_train, Y_test = data_split(features, target)

#### Exercise 12: Implement predict using OpenMP (1 point)

Get the predictions for test data and calculate the test error(RMSE) by implementing the OpenMP (pymp)

**Hints:**

* Using the pymp.Parallel implement the predict function (use from above)

* Call the predict function by passing test data as an argument

* calculate the error (RMSE) by comparing the Actual test data and predicted test data

In [26]:
!pip install pymp-pypi

Collecting pymp-pypi
  Downloading pymp-pypi-0.5.0.tar.gz (12 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pymp-pypi
  Building wheel for pymp-pypi (setup.py) ... [?25l[?25hdone
  Created wheel for pymp-pypi: filename=pymp_pypi-0.5.0-py3-none-any.whl size=10314 sha256=88cb09915b856c4ab2441d94752622e290ab2145a2806a9f019a42d09fc701cf
  Stored in directory: /root/.cache/pip/wheels/5e/db/4b/4c02f5b91b1abcde14433d1b336ac00a09761383e7bb1013cf
Successfully built pymp-pypi
Installing collected packages: pymp-pypi
Successfully installed pymp-pypi-0.5.0


In [27]:
import pymp

def predict_using_openmp(features, intercept, coefficients):
    # YOUR CODE HERE
    predictions = pymp.shared.array(features.shape[0], dtype='float')
    with pymp.Parallel(4) as p:
        for i in p.range(0, features.shape[0]):
            # Call predict function defined above
            predictions[i] = predict(features.iloc[i], intercept, coefficients)
    return predictions

predictions_openmp = predict_using_openmp(X_test, intercept, coefficients)
# Use calc_rmse to calculate rmse
rmse = calc_rmse(Y_test, X_test, intercept, coefficients)
print('RMSE: {}'.format(rmse))

RMSE: 0.2676


#### Exercise 13: Create a communicator (1 point)

Create a comunicator and define the rank and size

In [29]:
# YOUR CODE HERE
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
print('Rank: {}'.format(rank))
print('Size: {}'.format(size))

Rank: 0
Size: 1


#### Exercise 14: Divide the data into slices (1 point)

Write a function named dividing_data which takes train features set, train target set, and size of workers as inputs and returns the sliced data for each worker.

![img](https://cdn.iisc.talentsprint.com/CDS/Images/MiniProject_MPI_DataSlice.JPG)

For Example, if there are 4 processes, slice the data into 4 equal parts with 25% ratio

**Hint:**

- Divide the Data equally among the workers
  - Create an empty list
  - Iterate over the size of workers
  - Append each slice of data to the list

In [30]:
data_test = pd.concat([features, target], axis=1)

In [31]:
def dividing_data(x_train, y_train, size_of_workers):
    # Size of the slice
    slice_for_each_worker = int(Decimal(x_train.shape[0]/size_of_workers).quantize(Decimal('1.'), rounding = ROUND_HALF_UP))
    print('Slice of data for each worker: {}'.format(slice_for_each_worker))
    # YOUR CODE HERE
    data = pd.concat([x_train, y_train], axis=1)
    data_for_worker = []
    for i in range(size_of_workers):
        data_for_worker.append(data[i*slice_for_each_worker:(i+1)*slice_for_each_worker])
    return data_for_worker

In [36]:
data_for_worker = dividing_data(features,target,2)

Slice of data for each worker: 4784


#### Exercise 15: Prepare the data in root worker to assign data for all the workers (1 point)

- When it is the root worker, perform the below operation:
    - Store the features and target values in separate variables
    - Split the data into train and test sets using the train_test_split function defined above
    - Divide the data among the workers using the dividing_data function above

In [37]:
# YOUR CODE HERE
if rank == 0:
    features, target = feature_target(df)
    X_train, X_test, Y_train, Y_test = data_split(features, target)
    data_for_worker = dividing_data(X_train, Y_train, size)

Slice of data for each worker: 6697


#### Exercise 16: Scatter and gather the data (1 point)

Perform the below operations:

- Send slices of the training set(the features data X and the expected target data Y) to every worker including the root worker
    - **Hint:** scatter()
    - use `barrier()` to block workers until all workers in the group reach a Barrier, to scatter from root worker.
- Every worker should get the predicted target Y(yhat) for each slice
- Get the new coefficient of each instance in a slice
    - **Hint:** fit function defined above
- Gather the new coefficient from each worker
    - **Hint:** gather()
    - Take the mean of the gathered coefficients
- Calculate the root mean square error for the test set

To know more about `scatter`, `gather` and `barrier` click [here](https://nyu-cds.github.io/python-mpi/05-collectives/)

In [58]:
# YOUR CODE HERE
if rank != 0:
    data_for_workers = None
data_for_workers = comm.scatter(data_for_worker, root=0)
comm.barrier()
print('Rank: ', id,', received data: ' , data_for_workers, '\n')
intercept, coefficients = fit(data_for_workers.iloc[:,:data_for_workers.shape[1]-1], data_for_workers.iloc[:,-1])
comm.barrier()
predictions = predict(data_for_workers.iloc[:,:data_for_workers.shape[1]-1], intercept, coefficients)
comm.barrier()
print(predictions)

# Implement gather
received_coeff = None
received_coeff = comm.gather(coefficients, root=0)

if rank == 0:
    print('Intercept: {}'.format(intercept))
    print('Rcvd Coefficients: {}'.format(received_coeff))

# Use calc_rmse to calculate rmse
rmse = calc_rmse(Y_test, X_test, intercept, coefficients)
print('RMSE: {}'.format(rmse))

Rank:  <built-in function id> , received data:              AT         V        AP        RH    target
0    -0.668400  1.324704  1.108126  0.302119  0.259858
1    -0.039078 -0.763762  0.503625 -1.894416  0.597351
2     0.072294  0.358375  0.175275 -0.114311 -0.216500
3     0.036064  0.708551  0.296512  0.643894 -0.634266
4    -0.766354 -1.026591  1.416270  0.074726  0.793051
...        ...       ...       ...       ...       ...
6692 -0.594599 -0.269581  0.156753  1.836338  0.355950
6693  1.526845  1.233422 -0.324827 -1.448533 -1.068437
6694  0.947171  1.774818 -1.751045  0.081575 -0.999298
6695 -0.940793 -1.026591  1.268092  0.708961  0.668834
6696 -0.754277 -0.405717  1.219260  0.708276  0.749692

[6697 rows x 5 columns] 

Intercept 0.0006014118810482467, Coefficients [-0.87287925 -0.17044239  0.01768675 -0.14006168]
[ 0.33553179  0.43913212 -0.10447414 ... -1.17106485  0.91990439
  0.6505081 ]
Intercept: 0.0006014118810482467
Rcvd Coefficients: [array([-0.87287925, -0.17044239,  0.0

In [47]:
data_for_workers.iloc[:,:data_for_workers.shape[1]-1]

Unnamed: 0,AT,V,AP,RH
0,-0.668400,1.324704,1.108126,0.302119
1,-0.039078,-0.763762,0.503625,-1.894416
2,0.072294,0.358375,0.175275,-0.114311
3,0.036064,0.708551,0.296512,0.643894
4,-0.766354,-1.026591,1.416270,0.074726
...,...,...,...,...
6692,-0.594599,-0.269581,0.156753,1.836338
6693,1.526845,1.233422,-0.324827,-1.448533
6694,0.947171,1.774818,-1.751045,0.081575
6695,-0.940793,-1.026591,1.268092,0.708961


#### Exercise 17: Make a script and execute everything in one place (1 point)

Write a script(.py) file which contains the code of all the above exercises in it so that you can run the code on multiple processes using MPI.

**Hint:**

- magic commands
- put MPI related code under main function
- !mpirun --allow-run-as-root -np 4 python filename.py

In [53]:
# YOUR CODE HERE for scipt(.py)
%%writefile LinearRegressionMPI.py
from mpi4py import MPI # Importing mpi4py package from MPI module
import numpy as np
import pandas as pd
from decimal import Decimal, ROUND_HALF_UP # Importing Decimal, ROUND_HALF_UP functions from the decimal package
from math import sqrt
from sklearn.model_selection import train_test_split

def handle_missing_data(df):
    df = df.dropna()
    return df

def standardize_data(df):
    df = (df - df.mean())/df.std()
    return df

def feature_target(df):
    features = df.drop('PE', axis=1)
    target = df['PE']
    return features, target

def estimated_coefficients(features, target):
    X = features
    y = target
    transpose_of_X = X.transpose()
    product_of_X_and_transpose = transpose_of_X.dot(X)
    inverse_of_product = np.linalg.inv(product_of_X_and_transpose)
    product_of_inverse_and_transpose = inverse_of_product.dot(transpose_of_X)
    coefficients = product_of_inverse_and_transpose.dot(y)

    return coefficients

def fit(features, target):
    # YOUR CODE HERE
    dummy_col = np.ones((features.shape[0], 1))
    features = np.concatenate((dummy_col, features), axis=1)
    converted_features = np.array(features)
    coefficients = estimated_coefficients(features, target)
    intercept = coefficients[0]
    coefficients = coefficients[1:]
    print("Intercept {}, Coefficients {}".format(intercept, coefficients))
    return intercept, coefficients

def predict(features, intercept, coefficients):
    '''
    y = b_0 + b_1*x + ... + b_i*x_i
    '''
    #YOUR CODE HERE
    predictions = intercept + np.dot(features, coefficients)

    return predictions

def calc_rmse(target, features, intercept, coefficients):
    rmse = sqrt(np.mean((target - predict(features, intercept, coefficients))**2))
    return round(rmse,4)

def data_split(features, target):
    # YOUR CODE HERE
    X_train, X_test, Y_train, Y_test = train_test_split(features, target, test_size=0.3, shuffle=True)
    return X_train, X_test, Y_train, Y_test

FILENAME = "/content/PowerPlantData.csv" # File path
# Defining a function to load the data
def loadData(filename):
    # Loading the dataset with column names as
    data = pd.read_csv(filename)
    # Returning the dataframe
    return data
# Calling the function loadData and storing the dataframe in a variable named df
df = loadData(FILENAME)

def dividing_data(x_train, y_train, size_of_workers):
    # Size of the slice
    slice_for_each_worker = int(Decimal(x_train.shape[0]/size_of_workers).quantize(Decimal('1.'), rounding = ROUND_HALF_UP))
    print('Slice of data for each worker: {}'.format(slice_for_each_worker))
    # YOUR CODE HERE
    data = pd.concat([x_train, y_train], axis=1)
    data_for_worker = []
    for i in range(size_of_workers):
        data_for_worker.append(data[i*slice_for_each_worker:(i+1)*slice_for_each_worker])
    return data_for_worker

# Defining a function
def main():
    # communicator
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()   # number of the process running the code
    size = comm.Get_size()   # total number of processes running
    if rank == 0:
        FILENAME = "/content/PowerPlantData.csv" # File path
        df = loadData(FILENAME)
        df = handle_missing_data(df)
        df = standardize_data(df)
        features, target = feature_target(df)
        X_train, X_test, Y_train, Y_test = data_split(features, target)
        data_for_worker = dividing_data(X_train, Y_train, size)
    if rank != 0:
        data_for_worker = None
    comm.barrier()
    data_for_workers = comm.scatter(data_for_worker, root=0)
    # print('Rank: ', id,', received data: ' , data_for_worker, '\n')
    intercept, coefficients = fit(data_for_workers.iloc[:,:data_for_workers.shape[1]-1],\
                                  data_for_workers.iloc[:,-1])
    comm.barrier()
    predictions = predict(data_for_workers.iloc[:,:data_for_workers.shape[1]-1], intercept, coefficients)
    comm.barrier()
    # print(predictions)
    # Use calc_rmse to calculate rmse
    if rank == 0:
        rmse = calc_rmse(Y_test, X_test, intercept, coefficients)
        print('RMSE: {}'.format(rmse))

# Calling the main function
main()

Overwriting LinearRegressionMPI.py


In [54]:
# YOUR CODE HERE for MPI command
!sudo mpirun --allow-run-as-root --oversubscribe -np 4 python LinearRegressionMPI.py

Slice of data for each worker: 1674
Intercept -0.002762110878289703, Coefficients [-0.87505491 -0.1672303   0.03677637 -0.14058572]
Intercept -0.0005470441667893432, Coefficients [-0.87812142 -0.16999168  0.01652695 -0.14883158]
Intercept 0.013929990782059358, Coefficients [-0.8592474  -0.17442962  0.0101222  -0.14043333]
Intercept -0.008673408738021329, Coefficients [-0.85701678 -0.17264999  0.02282158 -0.12158806]
RMSE: 0.2732


#### Exercise 18: Use Sklearn to compare (1 point)

Apply the Linear regression on the given data using sklearn package and compare with the above results

**Hint:**
* Split the data into train and test
* Fit the train data and predict the test data using `sklearn Linear Regression`
* Compare the coefficients and intercept with above estimated coefficients
* calculate loss (RMSE) on test data and predictions and compare

In [None]:
# YOUR CODE HERE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()
lr.fit(X_train, Y_train)
predictions_sklearn = lr.predict(X_test)
print('Intercept: {}'.format(lr.intercept_))
print('Coefficients: {}'.format(lr.coef_))
print('RMSE: {}'.format(np.sqrt(mean_squared_error(Y_test, predictions_sklearn))))

Intercept: 0.0022435038154269265
Coefficients: [-0.86011126 -0.17803842  0.01995128 -0.13558947]
RMSE: 0.27118891741380025
