<a href="https://colab.research.google.com/github/imranahmed123/DataScience-AI-ML/blob/main/M4_NB_MiniProject_01_MLR_MPI_OpenMP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: Implementation of Multiple Linear Regression using MPI and OpenMP

## Learning Objectives

At the end of the mini-project, you will be able to :

* understand the collective communication operations like scatter, gather, broadcast
* understand the blocking and non-blocking communication
* implement multiple linear regression and run it using MPI
* implement the multiple linear regression based predictions using OpenMP

### Dataset

The dataset chosen for this mini-project is [Combined Cycle Power Plant](https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant). The dataset is made up of 9568 records and 5 columns. Each record contains the values for Ambient Temperature, Exhaust Vaccum, Ambient Pressure, Relative Humidity and Energy Output.

Predicting full load electrical power output of a base load power plant is important in order to maximize the profit from the available megawatt hours.  The base load operation of a power plant is influenced by four main parameters, which are used as input variables in the dataset, such as ambient temperature, atmospheric pressure, relative humidity, and exhaust steam pressure. These parameters affect electrical power output, which is considered as the target variable.

**Note:** The data was collected over a six year period (2006-11).

## Information

#### MPI in a Nutshell

MPI stands for "Message Passing Interface". It is a library of functions (in C / Python) or subroutines (in Fortran) that you insert into source code to perform data communication between processes. MPI was developed over two years of discussions led by the MPI Forum, a group of roughly sixty people representing some forty organizations.

To know more about MPI click [here](https://hpc-tutorials.llnl.gov/mpi/)


#### Multiple Linear Regression

Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).

**Note:** We will be using the mpi4py Python package for MPI based code implementation

## Grading = 20 Points

**Run the below code to install mpi4py package**

In [1]:
!pip install mpi4py

Collecting mpi4py
  Downloading mpi4py-4.0.0.tar.gz (464 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m464.8/464.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: mpi4py
  Building wheel for mpi4py (pyproject.toml) ... [?25l[?25hdone
  Created wheel for mpi4py: filename=mpi4py-4.0.0-cp310-cp310-linux_x86_64.whl size=4266268 sha256=373ecdd98ca9e0a9f4e9fba480ffa9516fcc485f87d8a1011a27349ac4918d83
  Stored in directory: /root/.cache/pip/wheels/96/17/12/83db63ee0ae5c4b040ee87f2e5c813aea4728b55ec6a37317c
Successfully built mpi4py
Installing collected packages: mpi4py
Successfully installed mpi4py-4.0.0


#### Importing Necessary Packages

In [2]:
# Importing pandas
import pandas as pd
# Importing Numpy
import numpy as np
# Importing MPI from mpi4py package
from mpi4py import MPI
# Importing sqrt function from the Math
from math import sqrt
# Importing Decimal, ROUND_HALF_UP functions from the decimal package
from decimal import Decimal, ROUND_HALF_UP
import time

#### Downloading the data

In [5]:
#@title Download the data
!wget -qq https://cdn.iisc.talentsprint.com/CDS/Datasets/PowerPlantData.csv

### Overview

* Load the data and perform data pre-processing
* Identify the features, target and split the data into train and test
* Implement multiple Linear Regression by estimating the coefficients on the given data
* Use MPI package to distribute the data and implement `communicator`
* Define functions for each objective and make a script (.py) file to execute using MPI command
* Use OpenMP component to predict the data and calculate the error on the predicted data
* Implement the Linear Regression from `sklearn` and compare the results

#### Exercise 1: Load data (1 point)

Write a function that takes the filename as input and loads the data in a pandas dataframe with the column names as Ambient Temperature, Exhaust Vaccum, Ambient Pressure, Relative Humidity and Energy Output respectively.

**Hint:** read_csv()


In [6]:
FILENAME = "/content/PowerPlantData.csv" # File path

# YOUR CODE HERE to Define a function to load the data

In [7]:
import pandas as pd

def load_data(filename):
    """
    This function loads the data from a CSV file into a pandas DataFrame.

    Parameters:
    filename (str): The path to the CSV file.

    Returns:
    pandas.DataFrame: A DataFrame containing the data with the specified column names.
    """
    # Define the column names
    column_names = ['Ambient Temperature', 'Exhaust Vaccum', 'Ambient Pressure', 'Relative Humidity', 'Energy Output']

    # Load the data into a DataFrame with the specified column names
    data = pd.read_csv(filename, header=None, names=column_names)

    return data

# Now call the function after it's been defined
FILENAME = "/content/PowerPlantData.csv"  # Update this path if needed
data = load_data(FILENAME)
print(data.head())  # Display the first few rows


  Ambient Temperature Exhaust Vaccum Ambient Pressure Relative Humidity  \
0                  AT              V               AP                RH   
1                8.34          40.77          1010.84             90.01   
2               23.64          58.49           1011.4              74.2   
3               29.74           56.9          1007.15             41.91   
4               19.07          49.69          1007.22             76.79   

  Energy Output  
0            PE  
1        480.48  
2        445.75  
3        438.76  
4        453.09  


#### Exercise 2: Explore data (1 point)

Write a function that takes the data loaded using the above defined function as input and explore it.

**Hint:** You can define and check for following things in the dataset inside a function

- checking for the number of rows and columns
- summary of the dataset
- check for the null values
- check for the duplicate values

In [8]:
# YOUR CODE HERE

In [9]:
import pandas as pd

def explore_data(data):
    """
    This function takes a pandas DataFrame as input and provides an exploration summary
    including the number of rows and columns, dataset summary, and checks for null and duplicate values.

    Parameters:
    data (pandas.DataFrame): The input DataFrame to explore.

    Returns:
    None: The function prints the exploration details.
    """
    # Checking the number of rows and columns
    print("Number of rows and columns:", data.shape)

    # Summary of the dataset
    print("\nSummary of the dataset:")
    print(data.describe())

    # Check for null values
    print("\nChecking for null values:")
    print(data.isnull().sum())

    # Check for duplicate values
    duplicate_count = data.duplicated().sum()
    print("\nNumber of duplicate rows:", duplicate_count)

    if duplicate_count > 0:
        print("There are duplicate rows in the dataset.")
    else:
        print("There are no duplicate rows in the dataset.")


In [10]:
# Load the data
FILENAME = "/content/PowerPlantData.csv"  # Update this path if needed
data = load_data(FILENAME)

# Explore the loaded data
explore_data(data)


Number of rows and columns: (9569, 5)

Summary of the dataset:
       Ambient Temperature Exhaust Vaccum Ambient Pressure Relative Humidity  \
count                 9569           9569             9569              9569   
unique                2774            635             2518              4547   
top                  25.21          70.32          1013.88            100.09   
freq                    14             61               16                26   

       Energy Output  
count           9569  
unique          4837  
top            468.8  
freq               9  

Checking for null values:
Ambient Temperature    0
Exhaust Vaccum         0
Ambient Pressure       0
Relative Humidity      0
Energy Output          0
dtype: int64

Number of duplicate rows: 41
There are duplicate rows in the dataset.


#### Exercise 3: Handle missing data (1 point)

After exploring the dataset if there are any null values present in the dataset then define a function that takes data loaded using the above defined function as input and handle the null values accordingly.

**Hint:**

- Drop the records containing the null values - dropna()
- Replace the null values with the mean/median/mode - fillna()

In [None]:
# Function to handle missing data

# YOUR CODE HERE

In [11]:
def handle_missing_data(data, method='drop', fill_value=None):
    """
    This function handles missing data in the DataFrame according to the specified method.

    Parameters:
    data (pandas.DataFrame): The input DataFrame to handle missing values.
    method (str): The method to handle missing values. Options are:
                  'drop' - Drop rows with missing values.
                  'mean' - Replace missing values with the mean of the column.
                  'median' - Replace missing values with the median of the column.
                  'mode' - Replace missing values with the mode of the column.
    fill_value: A specific value to fill missing data, only if provided and method is 'custom'.

    Returns:
    pandas.DataFrame: A DataFrame with missing data handled.
    """
    # Convert columns to numeric, coerce errors to NaN
    data = data.apply(pd.to_numeric, errors='coerce')

    if method == 'drop':
        # Drop rows with missing values
        data = data.dropna()
    elif method == 'mean':
        # Replace missing values with the mean of each column
        data = data.fillna(data.mean())
    elif method == 'median':
        # Replace missing values with the median of each column
        data = data.fillna(data.median())
    elif method == 'mode':
        # Replace missing values with the mode of each column
        data = data.fillna(data.mode().iloc[0])
    elif method == 'custom' and fill_value is not None:
        # Replace missing values with a specific value
        data = data.fillna(fill_value)
    else:
        raise ValueError("Invalid method or fill_value. Please choose from 'drop', 'mean', 'median', 'mode', or 'custom' with a fill value.")

    return data


In [12]:
# Load the data
FILENAME = "/content/PowerPlantData.csv"  # Update this path if needed
data = load_data(FILENAME)

# Explore the loaded data
explore_data(data)

# Handle missing data (e.g., replacing with the mean)
data_handled = handle_missing_data(data, method='mean')

# Explore the data again to ensure missing data is handled
explore_data(data_handled)


Number of rows and columns: (9569, 5)

Summary of the dataset:
       Ambient Temperature Exhaust Vaccum Ambient Pressure Relative Humidity  \
count                 9569           9569             9569              9569   
unique                2774            635             2518              4547   
top                  25.21          70.32          1013.88            100.09   
freq                    14             61               16                26   

       Energy Output  
count           9569  
unique          4837  
top            468.8  
freq               9  

Checking for null values:
Ambient Temperature    0
Exhaust Vaccum         0
Ambient Pressure       0
Relative Humidity      0
Energy Output          0
dtype: int64

Number of duplicate rows: 41
There are duplicate rows in the dataset.
Number of rows and columns: (9569, 5)

Summary of the dataset:
       Ambient Temperature  Exhaust Vaccum  Ambient Pressure  \
count          9569.000000     9569.000000       9569.0000

#### Exercise 4: Scale the data (1 point)

Write a function that takes the data after handling the missing data as input and returns the standardized data.

**Hint:**

- standardization of the data  can be performed using the below formula

$ (x - mean(x)) / std(x) $

In [None]:
# Defining a function to standardize the data

# YOUR CODE HERE

In [13]:
def standardize_data(data):
    """
    This function takes a pandas DataFrame as input and returns the standardized data.
    The standardization is performed using the formula: (x - mean(x)) / std(x).

    Parameters:
    data (pandas.DataFrame): The input DataFrame to standardize.

    Returns:
    pandas.DataFrame: A DataFrame with standardized data.
    """
    # Standardizing the data
    standardized_data = (data - data.mean()) / data.std()

    return standardized_data


In [14]:
# Load the data
FILENAME = "/content/PowerPlantData.csv"  # Update this path if needed
data = load_data(FILENAME)

# Handle missing data
data = handle_missing_data(data, method='mean')

# Standardize the data
standardized_data = standardize_data(data)

# Display the first few rows of the standardized data
print(standardized_data.head())


   Ambient Temperature  Exhaust Vaccum  Ambient Pressure  Relative Humidity  \
0        -4.767410e-16    5.591642e-16     -1.914412e-14           0.000000   
1        -1.517862e+00   -1.065205e+00     -4.073569e-01           1.143944   
2         5.352555e-01    3.292768e-01     -3.130566e-01           0.061031   
3         1.353818e+00    2.041512e-01     -1.028729e+00          -2.150688   
4        -7.799579e-02   -3.632424e-01     -1.016941e+00           0.238434   

   Energy Output  
0       0.000000  
1       1.530226  
2      -0.504802  
3      -0.914386  
4      -0.074710  


#### Exercise 5: Feature selection (1 point)

Write a function that takes scaled data as input and returns the features and target variable values

**Hint:**

- Features: AmbientTemperature, ExhaustVaccum, AmbientPressure, RelativeHumidity
- Target Variable: EnergyOutput

In [None]:
# Define a function

# YOUR CODE HERE

In [15]:
def feature_selection(data):
    """
    This function takes the standardized data as input and returns the features and target variable values.

    Parameters:
    data (pandas.DataFrame): The input DataFrame containing the standardized data.

    Returns:
    X (pandas.DataFrame): The DataFrame containing the feature variables.
    y (pandas.Series): The Series containing the target variable.
    """
    # Selecting the features
    X = data[['Ambient Temperature', 'Exhaust Vaccum', 'Ambient Pressure', 'Relative Humidity']]

    # Selecting the target variable
    y = data['Energy Output']

    return X, y


In [16]:
# Load the data
FILENAME = "/content/PowerPlantData.csv"  # Update this path if needed
data = load_data(FILENAME)

# Handle missing data
data = handle_missing_data(data, method='mean')

# Standardize the data
standardized_data = standardize_data(data)

# Perform feature selection
X, y = feature_selection(standardized_data)

# Display the first few rows of features and target variable
print(X.head(), y.head())


   Ambient Temperature  Exhaust Vaccum  Ambient Pressure  Relative Humidity
0        -4.767410e-16    5.591642e-16     -1.914412e-14           0.000000
1        -1.517862e+00   -1.065205e+00     -4.073569e-01           1.143944
2         5.352555e-01    3.292768e-01     -3.130566e-01           0.061031
3         1.353818e+00    2.041512e-01     -1.028729e+00          -2.150688
4        -7.799579e-02   -3.632424e-01     -1.016941e+00           0.238434 0    0.000000
1    1.530226
2   -0.504802
3   -0.914386
4   -0.074710
Name: Energy Output, dtype: float64


#### Exercise 6: Correlation (1 point)

Calculate correlation between the variables

In [None]:
# YOUR CODE HERE

In [20]:
def calculate_correlation(data):
    """
    This function takes a pandas DataFrame as input and calculates the correlation between variables.

    Parameters:
    data (pandas.DataFrame): The input DataFrame containing the standardized data.

    Returns:
    pandas.DataFrame: A DataFrame showing the correlation matrix of the variables.
    """
    # Calculate the correlation matrix
    correlation_matrix = data.corr()

    return correlation_matrix


In [21]:
# Load the data
FILENAME = "/content/PowerPlantData.csv"  # Update this path if needed
data = load_data(FILENAME)

# Handle missing data
data = handle_missing_data(data, method='mean')

# Standardize the data
standardized_data = standardize_data(data)

# Calculate correlation
correlation_matrix = calculate_correlation(standardized_data)

# Display the correlation matrix
print(correlation_matrix)


                     Ambient Temperature  Exhaust Vaccum  Ambient Pressure  \
Ambient Temperature             1.000000        0.844107         -0.507549   
Exhaust Vaccum                  0.844107        1.000000         -0.413502   
Ambient Pressure               -0.507549       -0.413502          1.000000   
Relative Humidity              -0.542535       -0.312187          0.099574   
Energy Output                  -0.948128       -0.869780          0.518429   

                     Relative Humidity  Energy Output  
Ambient Temperature          -0.542535      -0.948128  
Exhaust Vaccum               -0.312187      -0.869780  
Ambient Pressure              0.099574       0.518429  
Relative Humidity             1.000000       0.389794  
Energy Output                 0.389794       1.000000  


#### Exercise 7: Estimate the coefficients (2 points)

Write a function that takes features and target as input and returns the estimated coefficient values

**Hint:**

- Calculate the estimated coefficients using the below formula

$ β = (X^T X)^{-1} X^T y $

- transpose(), np.linalg.inv()

In [None]:
# Calculating the coeffients

# YOUR CODE HERE

In [22]:
import numpy as np

def estimate_coefficients(X, y):
    """
    This function takes the features and target as input and returns the estimated coefficient values.

    Parameters:
    X (pandas.DataFrame): The input DataFrame containing the feature variables.
    y (pandas.Series): The input Series containing the target variable.

    Returns:
    numpy.ndarray: An array of estimated coefficient values.
    """
    # Add a column of ones to X for the intercept term
    X_with_intercept = np.c_[np.ones(X.shape[0]), X]

    # Convert to numpy arrays
    X_np = np.array(X_with_intercept)
    y_np = np.array(y).reshape(-1, 1)

    # Calculate the coefficients using the formula β=(XTX)−1XTy
    X_transpose_X = np.dot(X_np.T, X_np)
    X_transpose_X_inv = np.linalg.inv(X_transpose_X)
    X_transpose_y = np.dot(X_np.T, y_np)

    coefficients = np.dot(X_transpose_X_inv, X_transpose_y)

    return coefficients.flatten()


In [23]:
# Load the data
FILENAME = "/content/PowerPlantData.csv"  # Update this path if needed
data = load_data(FILENAME)

# Handle missing data
data = handle_missing_data(data, method='mean')

# Standardize the data
standardized_data = standardize_data(data)

# Perform feature selection
X, y = feature_selection(standardized_data)

# Estimate the coefficients
coefficients = estimate_coefficients(X, y)

# Display the estimated coefficients
print("Estimated Coefficients:", coefficients)


Estimated Coefficients: [-1.49677674e-15 -8.63500780e-01 -1.74171544e-01  2.16029345e-02
 -1.35210234e-01]


#### Exercise 8: Fit the data to estimate the coefficients (2 points)

Write a function named fit which takes features and targets as input and returns the intercept and coefficient values.

**Hint:**

- create a dummy column in the features dataframe which is made up of all ones
- convert the features dataframe into numpy array
- call the estimated coefficients function which is defined above
- np.ones(), np.concatenate()

In [24]:
# defining a fit function
def fit(x, y):
    # YOUR CODE HERE

SyntaxError: incomplete input (<ipython-input-24-066be1772e5a>, line 3)

In [25]:
import numpy as np

def fit(X, y):
    """
    This function takes the features and target as input and returns the intercept and coefficient values
    for a linear regression model.

    Parameters:
    X (pandas.DataFrame or numpy.ndarray): The input DataFrame or array containing the feature variables.
    y (pandas.Series or numpy.ndarray): The input Series or array containing the target variable.

    Returns:
    tuple: A tuple containing the intercept and an array of coefficients.
    """
    # Add a column of ones to X for the intercept term
    X_with_intercept = np.c_[np.ones(X.shape[0]), X]

    # Convert to numpy arrays if needed
    X_np = np.array(X_with_intercept)
    y_np = np.array(y).reshape(-1, 1)

    # Estimate coefficients using the formula β=(X^T X)−1 X^T y
    X_transpose_X = np.dot(X_np.T, X_np)
    X_transpose_X_inv = np.linalg.inv(X_transpose_X)
    X_transpose_y = np.dot(X_np.T, y_np)

    coefficients = np.dot(X_transpose_X_inv, X_transpose_y)

    # The first coefficient is the intercept
    intercept = coefficients[0]

    # The remaining coefficients are for the features
    feature_coefficients = coefficients[1:].flatten()

    return intercept[0], feature_coefficients

# Example usage (assuming you have standardized data and selected features):
# intercept, coefficients = fit(X, y)
# print("Intercept:", intercept)
# print("Coefficients:", coefficients)


#### Exercise 9: Predict the data on estimated coefficients (1 point)

Write a function named predict which takes features, intercept and coefficient values as input and returns the predicted values.

**Hint:**

- Fit the intercept, coefficients values in the below equation

  $y = b_0 + b_1*x + ... + b_i*x_i$

In [26]:
 # fucntion to predict the values
def predict(x, intercept, coefficients):
    '''
    y = b_0 + b_1*x + ... + b_i*x_i
    '''
    #YOUR CODE HERE

    return predictions

In [27]:
import numpy as np
import pandas as pd

def predict(X, intercept, coefficients):
    """
    This function predicts the target values for the given input features using the provided intercept and coefficients.

    Parameters:
    X (pandas.DataFrame or numpy.ndarray): The input DataFrame or array containing the feature variables.
    intercept (float): The intercept value of the regression model.
    coefficients (numpy.ndarray): The array of coefficient values for the regression model.

    Returns:
    numpy.ndarray: An array of predicted values.
    """
    # Convert X to numpy array if it's a DataFrame
    if isinstance(X, pd.DataFrame):
        X = X.values

    # Calculate predictions using the formula: y = intercept + (coefficients * X)
    predictions = intercept + np.dot(X, coefficients)

    return predictions


1. Loading the data.
2. Handling missing data.
3. Standardizing the data.
4. Performing feature selection.
5. Splitting the data into training and test sets.

In [28]:
def handle_missing_data(data, method='mean'):
    # Convert all columns to numeric, forcing non-numeric values to NaN
    data = data.apply(pd.to_numeric, errors='coerce')

    if method == 'mean':
        # Replace missing values with column means
        return data.fillna(data.mean())
    elif method == 'drop':
        # Drop rows with missing values
        return data.dropna()
    else:
        raise ValueError("Invalid method. Use 'mean' or 'drop'.")


In [29]:
import numpy as np

def train_test_split(X, y, test_size=0.3, random_state=42):
    """
    Split the dataset into training and test sets.

    Parameters:
    X (pandas.DataFrame or numpy.ndarray): Feature variables.
    y (pandas.Series or numpy.ndarray): Target variable.
    test_size (float): Proportion of the dataset to include in the test split.
    random_state (int): Seed for reproducibility.

    Returns:
    X_train, X_test, y_train, y_test: Split datasets.
    """
    # Set random seed for reproducibility
    np.random.seed(random_state)

    # Convert to NumPy arrays if they're pandas DataFrame or Series
    if isinstance(X, pd.DataFrame):
        X = X.values
    if isinstance(y, pd.Series):
        y = y.values

    # Concatenate X and y to shuffle them together
    data = np.concatenate((X, y.reshape(-1, 1)), axis=1)

    # Shuffle the data
    np.random.shuffle(data)

    # Split the data into training and test sets
    split_index = int((1 - test_size) * len(data))
    train_data = data[:split_index]
    test_data = data[split_index:]

    # Separate features and target
    X_train, y_train = train_data[:, :-1], train_data[:, -1]
    X_test, y_test = test_data[:, :-1], test_data[:, -1]

    return X_train, X_test, y_train, y_test


In [30]:
# Load the data
FILENAME = "/content/PowerPlantData.csv"  # Ensure the file path is correct
data = load_data(FILENAME)

# Clean and handle missing data
data = handle_missing_data(data, method='mean')

# Standardize the data
standardized_data = standardize_data(data)

# Perform feature selection
X, y = feature_selection(standardized_data)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the model to estimate coefficients
intercept, coefficients = fit(X_train, y_train)

# Predict using test data
predicted_values = predict(X_test, intercept, coefficients)

# Display the first few predicted values
print(predicted_values[:5])


[-0.50300231 -1.11995819 -1.20769725 -0.20023076 -1.20770031]


#### Exercise 10: Root mean squared error (1 point)

Write a function to calculate the RMSE error.

**Hint:**

- [How to calculate the RSME error](https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e)

In [None]:
# Define a function to calculate the error

# YOUR CODE HERE

In [31]:
import numpy as np

def calculate_rmse(y_true, y_pred):
    """
    Calculate the Root Mean Squared Error (RMSE) between the actual and predicted values.

    Parameters:
    y_true (numpy.ndarray or pandas.Series): The actual target values.
    y_pred (numpy.ndarray or pandas.Series): The predicted target values.

    Returns:
    float: The calculated RMSE value.
    """
    # Convert to numpy arrays if inputs are pandas Series
    if isinstance(y_true, pd.Series):
        y_true = y_true.values
    if isinstance(y_pred, pd.Series):
        y_pred = y_pred.values

    # Calculate RMSE using the formula: sqrt(mean((y_true - y_pred)^2))
    rmse = np.sqrt(np.mean((y_true - y_pred) ** 2))

    return rmse


In [32]:
# Assume you have already made predictions with your test data
# predicted_values = predict(X_test, intercept, coefficients)

# Now, calculate the RMSE
rmse = calculate_rmse(y_test, predicted_values)

# Display the RMSE
print("Root Mean Squared Error (RMSE):", rmse)


Root Mean Squared Error (RMSE): 0.2693686279036376


#### Exercise 11: Split the data into train and test (1 point)

Write a function named train_test_split which takes features and targets as input and returns the train and test sets respectively.

**Hint:**

- Shuffle the data
- Consider 70 % of data as a train set and the rest of the data as a test set

In [None]:
    # YOUR CODE HERE

In [33]:
# Assuming X and y are already defined (from feature selection)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the sizes of the training and testing sets
print("Training set size:", len(X_train))
print("Test set size:", len(X_test))


Training set size: 6698
Test set size: 2871


#### Exercise 12: Implement predict using OpenMP (1 point)

Get the predictions for test data and calculate the test error(RMSE) by implementing the OpenMP (pymp)

**Hints:**

* Using the pymp.Parallel implement the predict function (use from above)

* Call the predict function by passing test data as an argument

* calculate the error (RMSE) by comparing the Actual test data and predicted test data

In [34]:
!pip install pymp-pypi

Collecting pymp-pypi
  Downloading pymp-pypi-0.5.0.tar.gz (12 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pymp-pypi
  Building wheel for pymp-pypi (setup.py) ... [?25l[?25hdone
  Created wheel for pymp-pypi: filename=pymp_pypi-0.5.0-py3-none-any.whl size=10314 sha256=589a3f89320c0e0bf6b6357baf74fe3f0d67259c53fa5f4ed93f4db658cb79ab
  Stored in directory: /root/.cache/pip/wheels/5e/db/4b/4c02f5b91b1abcde14433d1b336ac00a09761383e7bb1013cf
Successfully built pymp-pypi
Installing collected packages: pymp-pypi
Successfully installed pymp-pypi-0.5.0


In [35]:
import pymp
# YOUR CODE HERE

In [36]:
import numpy as np
import pymp  # Ensure that pymp is installed in your environment

def parallel_predict(X, intercept, coefficients, num_threads=4):
    """
    This function predicts values for input features using parallel processing with pymp.
    """
    predictions = pymp.shared.array((X.shape[0],), dtype='float')

    with pymp.Parallel(num_threads) as p:
        for i in p.range(X.shape[0]):
            predictions[i] = intercept + np.dot(X[i], coefficients)

    return np.array(predictions)

def calculate_rmse(y_true, y_pred):
    """
    Calculate the Root Mean Squared Error (RMSE) between actual and predicted values.
    """
    return np.sqrt(np.mean((y_true - y_pred) ** 2))


In [37]:
# Assuming you have your X_train, X_test, y_train, and y_test ready

# Fit your model to get intercept and coefficients
intercept, coefficients = fit(X_train, y_train)  # Ensure the fit function is defined

# Predict using the parallel_predict function
predicted_test_values = parallel_predict(X_test, intercept, coefficients, num_threads=4)

# Calculate the RMSE for the test set
rmse = calculate_rmse(y_test, predicted_test_values)

# Display the RMSE
print("Test RMSE using parallel prediction:", rmse)


Test RMSE using parallel prediction: 0.2693686279036376


#### Exercise 13: Create a communicator (1 point)

Create a comunicator and define the rank and size

In [54]:
# YOUR CODE HERE

In [38]:
!apt-get install -y mpich
!pip install mpi4py



Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  hwloc-nox libmpich-dev libmpich12 libslurm37
Suggested packages:
  mpich-doc
The following NEW packages will be installed:
  hwloc-nox libmpich-dev libmpich12 libslurm37 mpich
0 upgraded, 5 newly installed, 0 to remove and 49 not upgraded.
Need to get 14.2 MB of archives.
After this operation, 102 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libslurm37 amd64 21.08.5-2ubuntu1 [542 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 hwloc-nox amd64 2.7.0-2ubuntu1 [205 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libmpich12 amd64 4.0-3 [5,866 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/universe amd64 mpich amd64 4.0-3 [197 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libmpich-dev amd64 4.0-3 [7,375 kB]
Fetched 14.2 MB in 

In [39]:
from mpi4py import MPI

# Create a communicator
comm = MPI.COMM_WORLD

# Get the rank of the current process
rank = comm.Get_rank()

# Get the total number of processes in the communicator
size = comm.Get_size()

# Output the rank and size for this process
print(f"Process rank: {rank}, Total number of processes: {size}")


Process rank: 0, Total number of processes: 1


Alternative

In [40]:
!pip install pymp-pypi




In [41]:
import pymp

# Parallel execution using pymp
with pymp.Parallel(4) as p:
    for i in p.range(10):
        print(f"Hello from thread {p.thread_num}, iteration {i}")


Hello from thread 1, iteration 3
Hello from thread 1, iteration 4
Hello from thread 2, iteration 6Hello from thread 3, iteration 8
Hello from thread 2, iteration 7
Hello from thread 1, iteration 5

Hello from thread 3, iteration 9
Hello from thread 0, iteration 0
Hello from thread 0, iteration 1
Hello from thread 0, iteration 2


#### Exercise 14: Divide the data into slices (1 point)

Write a function named dividing_data which takes train features set, train target set, and size of workers as inputs and returns the sliced data for each worker.

![img](https://cdn.iisc.talentsprint.com/CDS/Images/MiniProject_MPI_DataSlice.JPG)

For Example, if there are 4 processes, slice the data into 4 equal parts with 25% ratio

**Hint:**

- Divide the Data equally among the workers
  - Create an empty list
  - Iterate over the size of workers
  - Append each slice of data to the list

In [42]:
def dividing_data(x_train, y_train, size_of_workers):
    # Size of the slice
    slice_for_each_worker = int(Decimal(x_train.shape[0]/size_of_workers).quantize(Decimal('1.'), rounding = ROUND_HALF_UP))
    print('Slice of data for each worker: {}'.format(slice_for_each_worker))
    # YOUR CODE HERE

In [43]:
from decimal import Decimal, ROUND_HALF_UP

def dividing_data(x_train, y_train, size_of_workers):
    # Size of the slice
    slice_for_each_worker = int(Decimal(x_train.shape[0] / size_of_workers).quantize(Decimal('1.'), rounding=ROUND_HALF_UP))

    print('Slice of data for each worker: {}'.format(slice_for_each_worker))

    # Create a list to hold the data slices for each worker
    data_slices = []

    # Iterate over the number of workers and create slices
    for worker in range(size_of_workers):
        start_index = worker * slice_for_each_worker
        end_index = start_index + slice_for_each_worker

        # For the last worker, make sure to include any leftover rows
        if worker == size_of_workers - 1:
            end_index = x_train.shape[0]

        # Append the sliced data for each worker
        x_slice = x_train[start_index:end_index]
        y_slice = y_train[start_index:end_index]

        data_slices.append((x_slice, y_slice))

    return data_slices

# Example usage:
# Assuming x_train and y_train are numpy arrays or pandas DataFrames with your training data.
# size_of_workers = 4
# data_slices = dividing_data(x_train, y_train, size_of_workers)
# print(data_slices)


Example Workflow:

In [44]:
import numpy as np

# Example data
x_train = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0])

# Divide data for 4 workers
size_of_workers = 4
data_slices = dividing_data(x_train, y_train, size_of_workers)

# Output each worker's data slice
for i, (x_slice, y_slice) in enumerate(data_slices):
    print(f"Worker {i+1} slice:")
    print("x_slice:", x_slice)
    print("y_slice:", y_slice)


Slice of data for each worker: 2
Worker 1 slice:
x_slice: [[1]
 [2]]
y_slice: [1 0]
Worker 2 slice:
x_slice: [[3]
 [4]]
y_slice: [1 0]
Worker 3 slice:
x_slice: [[5]
 [6]]
y_slice: [1 0]
Worker 4 slice:
x_slice: [[7]
 [8]]
y_slice: [1 0]


MPI

In [63]:
!apt-get install -y -qq mpich
!pip install mpi4py




In [64]:
!mpirun --version


mpirun (Open MPI) 4.1.2

Report bugs to http://www.open-mpi.org/community/help/


sample code for MPI

In [65]:
code = """
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Only the root process (rank 0) will create and scatter the data
if rank == 0:
    data = np.linspace(0, 100, size)
    print(f"Root process is scattering data: {data}")
else:
    data = None

# Scatter the data to all processes
data = comm.scatter(data, root=0)
print(f"Process {rank} received data: {data}")

# Each process computes the square of the data it received
result = data ** 2

# Gather the results at the root process
results = comm.gather(result, root=0)

# The root process prints the gathered results
if rank == 0:
    print(f"Root process gathered results: {results}")
"""

with open("mpi_test.py", "w") as f:
    f.write(code)


In [69]:
!mpirun --allow-run-as-root --oversubscribe -np 4 python mpi_test.py


another sample code for MPI

In [73]:
# Save this simple MPI test script
test_code = """
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

print(f"Process {rank} out of {size} processes is running.")
"""

with open("mpi_test.py", "w") as f:
    f.write(test_code)


In [74]:
!mpirun --allow-run-as-root --oversubscribe -np 4 python mpi_test.py


In [None]:
Running required code.

In [72]:
!mpirun --allow-run-as-root --oversubscribe -np 4 python mpi_scatter_gather.py


#### Exercise 15: Prepare the data in root worker to assign data for all the workers (1 point)

- When it is the root worker, perform the below operation:
    - Store the features and target values in separate variables
    - Split the data into train and test sets using the train_test_split function defined above
    - Divide the data among the workers using the dividing_data function above

In [45]:
# YOUR CODE HERE

In [61]:
!apt-get install -y -qq mpich
!pip install mpi4py




1. Store features and target values in separate variables.
2. Split the data into training and test sets using the train_test_split function.
3. Divide the data among workers using the dividing_data function.
4. Distribute the data using MPI to assign data to all workers.


mpi_data_prep.py file code in below cell.

In [54]:
from mpi4py import MPI
import numpy as np
from sklearn.model_selection import train_test_split

# Example function to divide the data for each worker
def dividing_data(x_train, y_train, size_of_workers):
    slice_size = len(x_train) // size_of_workers
    slices = [(x_train[i * slice_size:(i + 1) * slice_size], y_train[i * slice_size:(i + 1) * slice_size])
              for i in range(size_of_workers)]
    return slices

def prepare_data_for_workers(x, y, test_size, size_of_workers):
    """
    Function to prepare data in root worker to assign to all workers.

    Parameters:
    x (numpy.ndarray or pandas.DataFrame): The input feature data.
    y (numpy.ndarray or pandas.Series): The target labels.
    test_size (float): Proportion of the dataset to include in the test split.
    size_of_workers (int): Number of workers for dividing the data.

    Returns:
    Tuple containing the training and testing data (features and labels).
    """

    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()

    if rank == 0:
        # Step 1: Store features and target values in variables
        print("Root worker: Storing features and target values")

        # Step 2: Split the data into train and test sets
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size, random_state=42)
        print("Root worker: Split the data into train and test sets")

        # Step 3: Divide the training data among workers
        data_slices = dividing_data(x_train, y_train, size_of_workers)
        print(f"Root worker: Divided the training data into {size_of_workers} slices")

        # Distribute the slices to each worker
        for i in range(1, size_of_workers):
            comm.send(data_slices[i], dest=i, tag=i)
            print(f"Root worker: Sent data slice {i} to worker {i}")

        return data_slices[0], x_test, y_test  # Root worker keeps the first slice

    else:
        # Other workers receive their slice of the data
        data_slice = comm.recv(source=0, tag=rank)
        print(f"Worker {rank}: Received data slice")
        return data_slice, None, None

# Example usage:
if __name__ == "__main__":
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size_of_workers = comm.Get_size()

    # Assume you have some dataset (x, y)
    if rank == 0:
        # Example data
        x = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
        y = np.array([1, 0, 1, 0, 1, 0, 1, 0])
    else:
        x = None
        y = None

    # Prepare the data
    train_data, x_test, y_test = prepare_data_for_workers(x, y, test_size=0.3, size_of_workers=size_of_workers)

    if rank == 0:
        print("Root worker has its data slice and test set")
        print(f"x_test: {x_test}, y_test: {y_test}")
    else:
        print(f"Worker {rank} received its training data slice.")


Root worker: Storing features and target values
Root worker: Split the data into train and test sets
Root worker: Divided the training data into 1 slices
Root worker has its data slice and test set
x_test: [[2]
 [6]
 [1]], y_test: [0 0 1]


In [75]:
! mpirun --allow-run-as-root --oversubscribe -np 6 mpi_data_prep.py

#### Exercise 16: Scatter and gather the data (1 point)

Perform the below operations:

- Send slices of the training set(the features data X and the expected target data Y) to every worker including the root worker
    - **Hint:** scatter()
    - use `barrier()` to block workers until all workers in the group reach a Barrier, to scatter from root worker.
- Every worker should get the predicted target Y(yhat) for each slice
- Get the new coefficient of each instance in a slice
    - **Hint:** fit function defined above
- Gather the new coefficient from each worker
    - **Hint:** gather()
    - Take the mean of the gathered coefficients
- Calculate the root mean square error for the test set

To know more about `scatter`, `gather` and `barrier` click [here](https://nyu-cds.github.io/python-mpi/05-collectives/)

In [None]:
# YOUR CODE HERE

Key Operations:
1. Scatter: Distribute data slices (features X and target Y) among all workers.
2. Barrier: Ensure all workers have reached a certain point before proceeding.
3. Predictions and Fitting: Each worker fits a model to its slice and predicts values (yhat).
4. Gather: Gather the computed coefficients from all workers to the root worker.
5. Compute Mean of Coefficients: The root worker computes the mean of all coefficients.
6. RMSE Calculation: The root worker calculates the RMSE for the test set.

In [59]:
# YOUR CODE HERE for scipt(.py)

from mpi4py import MPI
import numpy as np
from sklearn.model_selection import train_test_split

# Example fit function that estimates coefficients using linear regression
def fit(X, y):
    X = np.c_[np.ones(X.shape[0]), X]  # Add intercept term
    beta = np.linalg.inv(X.T @ X) @ (X.T @ y)  # Linear regression formula
    return beta  # Coefficients including intercept

# RMSE calculation function
def calculate_rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

def scatter_gather_operations(x_train, y_train, x_test, y_test, size_of_workers):
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()

    # Step 1: Scatter the training data to all workers
    data_slices = None
    if rank == 0:
        # Prepare slices to scatter
        slice_size = len(x_train) // size_of_workers
        data_slices = [(x_train[i * slice_size:(i + 1) * slice_size], y_train[i * slice_size:(i + 1) * slice_size])
                       for i in range(size_of_workers)]

    # Scatter the slices (x_train and y_train) to all workers
    data_slice = comm.scatter(data_slices, root=0)

    # Step 2: Synchronize all workers using Barrier
    comm.Barrier()

    # Step 3: Each worker computes the coefficients for its slice
    X_slice, y_slice = data_slice
    coefficients = fit(X_slice, y_slice)

    # Step 4: Gather all coefficients at the root worker
    all_coefficients = comm.gather(coefficients, root=0)

    if rank == 0:
        # Step 5: Take the mean of the gathered coefficients
        mean_coefficients = np.mean(all_coefficients, axis=0)

        # Step 6: Use the mean coefficients to predict on the test set
        X_test_augmented = np.c_[np.ones(x_test.shape[0]), x_test]  # Add intercept term
        yhat_test = X_test_augmented @ mean_coefficients

        # Step 7: Calculate the RMSE for the test set
        rmse_test = calculate_rmse(y_test, yhat_test)
        print(f"Root Mean Squared Error (RMSE) on test set: {rmse_test}")

    return

# Example usage:
if __name__ == "__main__":
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size_of_workers = comm.Get_size()

    if rank == 0:
        # Generate some example data
        x_train = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
        y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0])

        x_test = np.array([[1], [2], [3], [4]])
        y_test = np.array([1, 0, 1, 0])
    else:
        x_train = None
        y_train = None
        x_test = None
        y_test = None

    scatter_gather_operations(x_train, y_train, x_test, y_test, size_of_workers)


Root Mean Squared Error (RMSE) on test set: 0.48795003647426655


In [76]:
! mpirun --allow-run-as-root --oversubscribe -np 6 python mpi_scatter_gather.py

Expeced output not coming for script execution.
Example below


```
# This is formatted as code
Root process scattering data slices to 4 workers.
Process 0 received data slice: (array([[1], [2]]), array([1, 0]))
Process 1 received data slice: (array([[3], [4]]), array([1, 0]))
Process 2 received data slice: (array([[5], [6]]), array([1, 0]))
Process 3 received data slice: (array([[7], [8]]), array([1, 0]))
Process 0 computed coefficients: [1.0, -0.5]
Process 1 computed coefficients: [0.5, 0.1]
...
Mean coefficients: [0.75, 0.05]
Root Mean Squared Error (RMSE) on test set: 0.23

```




#### Exercise 17: Make a script and execute everything in one place (1 point)

Write a script(.py) file which contains the code of all the above exercises in it so that you can run the code on multiple processes using MPI.

**Hint:**

- magic commands
- put MPI related code under main function
- !mpirun --allow-run-as-root -np 4 python filename.py

In [56]:
# YOUR CODE HERE for scipt(.py)

Step 1:  Python Script (mpi_exercise.py)

In [None]:
from mpi4py import MPI
import numpy as np
from sklearn.model_selection import train_test_split

# Fit function that estimates coefficients using linear regression
def fit(X, y):
    X = np.c_[np.ones(X.shape[0]), X]  # Add intercept term
    beta = np.linalg.inv(X.T @ X) @ (X.T @ y)  # Linear regression formula
    return beta  # Coefficients including intercept

# RMSE calculation function
def calculate_rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

def scatter_gather_operations(x_train, y_train, x_test, y_test, size_of_workers):
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()

    # Scatter the data from the root to all workers
    data_slices = None
    if rank == 0:
        # Root worker creates slices to scatter
        slice_size = len(x_train) // size_of_workers
        data_slices = [(x_train[i * slice_size:(i + 1) * slice_size], y_train[i * slice_size:(i + 1) * slice_size])
                       for i in range(size_of_workers)]
        print(f"Root process scattering data slices to {size_of_workers} workers.")
    else:
        data_slices = None

    # Scatter the data to all processes (including root)
    data_slice = comm.scatter(data_slices, root=0)
    print(f"Process {rank} received data slice: {data_slice}")

    # Barrier to ensure all processes have received their slices before proceeding
    comm.Barrier()

    # Each worker computes the coefficients for its data slice
    X_slice, y_slice = data_slice
    coefficients = fit(X_slice, y_slice)
    print(f"Process {rank} computed coefficients: {coefficients}")

    # Predict the target (yhat) for the slice using the coefficients
    yhat_slice = X_slice @ coefficients[1:] + coefficients[0]
    print(f"Process {rank} predicted yhat for its slice: {yhat_slice}")

    # Barrier to ensure all processes finish computation before gathering
    comm.Barrier()

    # Gather the coefficients back to the root process
    all_coefficients = comm.gather(coefficients, root=0)

    # Root process calculates the mean of the gathered coefficients and computes RMSE
    if rank == 0:
        mean_coefficients = np.mean(all_coefficients, axis=0)
        print(f"Mean coefficients: {mean_coefficients}")

        # Use the mean coefficients to predict on the test set
        X_test_augmented = np.c_[np.ones(x_test.shape[0]), x_test]  # Add intercept term
        yhat_test = X_test_augmented @ mean_coefficients

        # Calculate the RMSE for the test set
        rmse_test = calculate_rmse(y_test, yhat_test)
        print(f"Root Mean Squared Error (RMSE) on test set: {rmse_test}")

    return

if __name__ == "__main__":
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size_of_workers = comm.Get_size()

    if rank == 0:
        # Generate some example data
        x_train = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
        y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0])

        x_test = np.array([[1], [2], [3], [4]])
        y_test = np.array([1, 0, 1, 0])
    else:
        x_train = None
        y_train = None
        x_test = None
        y_test = None

    scatter_gather_operations(x_train, y_train, x_test, y_test, size_of_workers)


Step 2: Save the Script as mpi_exercise.py

In [77]:
script_code = """
from mpi4py import MPI
import numpy as np
from sklearn.model_selection import train_test_split

# Fit function that estimates coefficients using linear regression
def fit(X, y):
    X = np.c_[np.ones(X.shape[0]), X]  # Add intercept term
    beta = np.linalg.inv(X.T @ X) @ (X.T @ y)  # Linear regression formula
    return beta  # Coefficients including intercept

# RMSE calculation function
def calculate_rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

def scatter_gather_operations(x_train, y_train, x_test, y_test, size_of_workers):
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()

    # Scatter the data from the root to all workers
    data_slices = None
    if rank == 0:
        # Root worker creates slices to scatter
        slice_size = len(x_train) // size_of_workers
        data_slices = [(x_train[i * slice_size:(i + 1) * slice_size], y_train[i * slice_size:(i + 1) * slice_size)]
                       for i in range(size_of_workers)]
        print(f"Root process scattering data slices to {size_of_workers} workers.")
    else:
        data_slices = None

    # Scatter the data to all processes (including root)
    data_slice = comm.scatter(data_slices, root=0)
    print(f"Process {rank} received data slice: {data_slice}")

    # Barrier to ensure all processes have received their slices before proceeding
    comm.Barrier()

    # Each worker computes the coefficients for its data slice
    X_slice, y_slice = data_slice
    coefficients = fit(X_slice, y_slice)
    print(f"Process {rank} computed coefficients: {coefficients}")

    # Predict the target (yhat) for the slice using the coefficients
    yhat_slice = X_slice @ coefficients[1:] + coefficients[0]
    print(f"Process {rank} predicted yhat for its slice: {yhat_slice}")

    # Barrier to ensure all processes finish computation before gathering
    comm.Barrier()

    # Gather the coefficients back to the root process
    all_coefficients = comm.gather(coefficients, root=0)

    # Root process calculates the mean of the gathered coefficients and computes RMSE
    if rank == 0:
        mean_coefficients = np.mean(all_coefficients, axis=0)
        print(f"Mean coefficients: {mean_coefficients}")

        # Use the mean coefficients to predict on the test set
        X_test_augmented = np.c_[np.ones(x_test.shape[0]), x_test]  # Add intercept term
        yhat_test = X_test_augmented @ mean_coefficients

        # Calculate the RMSE for the test set
        rmse_test = calculate_rmse(y_test, yhat_test)
        print(f"Root Mean Squared Error (RMSE) on test set: {rmse_test}")

    return

if __name__ == "__main__":
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size_of_workers = comm.Get_size()

    if rank == 0:
        # Generate some example data
        x_train = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
        y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0])

        x_test = np.array([[1], [2], [3], [4]])
        y_test = np.array([1, 0, 1, 0])
    else:
        x_train = None
        y_train = None
        x_test = None
        y_test = None

    scatter_gather_operations(x_train, y_train, x_test, y_test, size_of_workers)
"""

with open("mpi_exercise.py", "w") as f:
    f.write(script_code)


Step 3: Run the Script with mpirun

In [80]:
# YOUR CODE HERE for MPI command

!mpirun --allow-run-as-root --oversubscribe -np 4 python mpi_exercise.py

Expected output not coming while running on colab.

#### Exercise 18: Use Sklearn to compare (1 point)

Apply the Linear regression on the given data using sklearn package and compare with the above results

**Hint:**
* Split the data into train and test
* Fit the train data and predict the test data using `sklearn Linear Regression`
* Compare the coefficients and intercept with above estimated coefficients
* calculate loss (RMSE) on test data and predictions and compare

In [None]:
# YOUR CODE HERE

In [81]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Function to compare custom linear regression and sklearn's linear regression
def compare_with_sklearn(x_train, y_train, x_test, y_test):
    # Step 1: Apply Sklearn's Linear Regression
    model = LinearRegression()
    model.fit(x_train, y_train)  # Fit the model using training data

    # Step 2: Get the coefficients and intercept from the sklearn model
    sklearn_coefficients = model.coef_
    sklearn_intercept = model.intercept_
    print(f"Sklearn Coefficients: {sklearn_coefficients}")
    print(f"Sklearn Intercept: {sklearn_intercept}")

    # Step 3: Predict the test data
    yhat_sklearn = model.predict(x_test)

    # Step 4: Calculate RMSE for Sklearn model
    rmse_sklearn = np.sqrt(mean_squared_error(y_test, yhat_sklearn))
    print(f"Sklearn RMSE: {rmse_sklearn}")

    return sklearn_coefficients, sklearn_intercept, rmse_sklearn

if __name__ == "__main__":
    # Example data
    x_train = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
    y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0])

    x_test = np.array([[1], [2], [3], [4]])
    y_test = np.array([1, 0, 1, 0])

    # Step 5: Compare results
    sklearn_coefficients, sklearn_intercept, rmse_sklearn = compare_with_sklearn(x_train, y_train, x_test, y_test)

    # You can now compare this with the custom implementation from Exercise 16


Sklearn Coefficients: [-0.04761905]
Sklearn Intercept: 0.7142857142857144
Sklearn RMSE: 0.4879500364742666
