<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


## Final Project - Build an ML Pipeline for Airfoil noise prediction


Estimated time needed: **90** minutes


## Scenario


You are a data engineer at an aeronautics consulting company. Your company prides itself in being able to efficiently design airfoils for use in planes and sports cars. Data scientists in your office need to work with different algorithms and data in different formats. While they are good at Machine Learning, they count on you to be able to do ETL jobs and build ML pipelines. In this project you will use the modified version of the NASA Airfoil Self Noise dataset. You will clean this dataset, by dropping the duplicate rows, and removing the rows with null values. You will create an ML pipe line to create a model that will predict the SoundLevel based on all the other columns. You will evaluate the model and towards the end you will persist the model.



## Objectives

In this 4 part assignment you will:

- Part 1 Perform ETL activity
  - Load a csv dataset
  - Remove duplicates if any
  - Drop rows with null values if any
  - Make transformations
  - Store the cleaned data in parquet format
- Part 2 Create a  Machine Learning Pipeline
  - Create a machine learning pipeline for prediction
- Part 3 Evaluate the Model
  - Evaluate the model using relevant metrics
- Part 4 Persist the Model 
  - Save the model for future production use
  - Load and verify the stored model


## Datasets

In this lab you will be using dataset(s):

 - The original dataset can be found here NASA airfoil self noise dataset. https://archive.ics.uci.edu/dataset/291/airfoil+self+noise
 
 - This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.


Diagram of an airfoil. - For informational purpose


![Airfoil with flow](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/images/Airfoil_with_flow.png)


Diagram showing the Angle of attack. - For informational purpose


![Airfoil angle of attack](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/images/Airfoil_angle_of_attack.jpg)


## Before you Start


**Before you start attempting this project it is highly recommended that you finish the practice project.**


## Setup


For this lab, we will be using the following libraries:

*   [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01) for connecting to the Spark Cluster


### Installing Required Libraries

Spark Cluster is pre-installed in the Skills Network Labs environment. However, you need libraries like pyspark and findspark to
 connect to this cluster.


The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [6]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [7]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

## Part 1 - Perform ETL activity


### Task 1 - Import required libraries


In [8]:



from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
airfoil_self_noise = fetch_ucirepo(id=291) 
  
# data (as pandas dataframes) 
X = airfoil_self_noise.data.features 
y = airfoil_self_noise.data.targets 
  
# metadata 
print(airfoil_self_noise.metadata) 
  
# variable information 
print(airfoil_self_noise.variables) 




{'uci_id': 291, 'name': 'Airfoil Self-Noise', 'repository_url': 'https://archive.ics.uci.edu/dataset/291/airfoil+self+noise', 'data_url': 'https://archive.ics.uci.edu/static/public/291/data.csv', 'abstract': 'NASA data set, obtained from a series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections conducted in an anechoic wind tunnel.', 'area': 'Physics and Chemistry', 'tasks': ['Regression'], 'characteristics': ['Multivariate'], 'num_instances': 1503, 'num_features': 5, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['scaled-sound-pressure'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1989, 'last_updated': 'Fri Mar 29 2024', 'dataset_doi': '10.24432/C5VW2C', 'creators': ['Thomas Brooks', 'D. Pope', 'Michael Marcolini'], 'intro_paper': None, 'additional_info': {'summary': 'The NASA data set comprises different size NACA 0012 airfoils at various wind tunnel speeds and angles of

### Task 2 - Create a spark session


In [9]:
#Create a SparkSession
from pyspark.sql import SparkSession

# Step 1: Create a Spark session
spark = SparkSession.builder \
    .appName("MySparkApplication") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

24/06/10 20:28:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


### Task 3 - Load the csv file into a dataframe


Download the data file.

NOTE : Please ensure you use the dataset below and not the original dataset mentioned above.


In [10]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv


--2024-06-10 20:28:13--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104, 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60682 (59K) [text/csv]
Saving to: ‘NASA_airfoil_noise_raw.csv.6’


2024-06-10 20:28:13 (39.9 MB/s) - ‘NASA_airfoil_noise_raw.csv.6’ saved [60682/60682]



Load the dataset into the spark dataframe


In [11]:
# Load the dataset that you have downloaded in the previous task

# Define the correct file path
file_path = "/resources/labs/authoride/IBMSkillsNetwork+BD0231EN/labs/NASA_airfoil_noise_raw.csv"

# Load the CSV file into a DataFrame
df = spark.read.csv(file_path, header=True, inferSchema=True)
df.show()



                                                                                

+---------+-------------+-----------+------------------+-----------------------+----------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevel|
+---------+-------------+-----------+------------------+-----------------------+----------+
|      800|          0.0|     0.3048|              71.3|             0.00266337|   126.201|
|     1000|          0.0|     0.3048|              71.3|             0.00266337|   125.201|
|     1250|          0.0|     0.3048|              71.3|             0.00266337|   125.951|
|     1600|          0.0|     0.3048|              71.3|             0.00266337|   127.591|
|     2000|          0.0|     0.3048|              71.3|             0.00266337|   127.461|
|     2500|          0.0|     0.3048|              71.3|             0.00266337|   125.571|
|     3150|          0.0|     0.3048|              71.3|             0.00266337|   125.201|
|     4000|          0.0|     0.3048|              71.3|             0.00266337|

### Task 4 - Print top 5 rows of the dataset


In [12]:
#your code goes here
df.show(5)



+---------+-------------+-----------+------------------+-----------------------+----------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevel|
+---------+-------------+-----------+------------------+-----------------------+----------+
|      800|          0.0|     0.3048|              71.3|             0.00266337|   126.201|
|     1000|          0.0|     0.3048|              71.3|             0.00266337|   125.201|
|     1250|          0.0|     0.3048|              71.3|             0.00266337|   125.951|
|     1600|          0.0|     0.3048|              71.3|             0.00266337|   127.591|
|     2000|          0.0|     0.3048|              71.3|             0.00266337|   127.461|
+---------+-------------+-----------+------------------+-----------------------+----------+
only showing top 5 rows



### Task 6 - Print the total number of rows in the dataset


In [13]:
#your code goes here
total_rows = df.count()
print("Total number of rows:", total_rows)

Total number of rows: 1522


### Task 7 - Drop all the duplicate rows from the dataset


In [14]:
# Remove duplicates from the DataFrame
df_no_duplicates = df.dropDuplicates()

# Count the total number of rows after removing duplicates
total_rows_no_duplicates = df_no_duplicates.count()

# Print the total number of rows after removing duplicates
print("Total number of rows after removing duplicates:", total_rows_no_duplicates)



Total number of rows after removing duplicates: 1503


                                                                                

### Task 8 - Print the total number of rows in the dataset


In [15]:
#your code goes here

total_rows = df.count()

# Print the total number of rows
print("Total number of rows:", total_rows)

Total number of rows: 1522


### Task 9 - Drop all the rows that contain null values from the dataset


In [16]:
# Drop rows with any null values
df_no_nulls = df.na.drop()

# Drop duplicate rows
df_no_duplicates = df_no_nulls.dropDuplicates()

# Count the total number of rows after dropping null values and duplicates
total_rows_no_nulls_duplicates = df_no_duplicates.count()


                                                                                

### Task 10 - Print the total number of rows in the dataset


In [17]:
#your code goes here

# Print the total number of rows after dropping null values and duplicates
print("Total number of rows after dropping null values and duplicates:", total_rows_no_nulls_duplicates)


Total number of rows after dropping null values and duplicates: 1499


### Task 11 - Rename the column "SoundLevel" to "SoundLevelDecibels"


In [18]:
# your code goes here
# Rename the column from "SoundLevel" to "SoundLevelDecibels"
df_renamed = df_no_duplicates.withColumnRenamed("SoundLevel", "SoundLevelDecibels")

# Show the DataFrame with the renamed column
df_renamed.show()



+---------+-------------+-----------+------------------+-----------------------+------------------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevelDecibels|
+---------+-------------+-----------+------------------+-----------------------+------------------+
|     4000|          3.0|     0.3048|              31.7|             0.00529514|           115.608|
|     3150|          2.0|     0.2286|              31.7|             0.00372371|           121.527|
|     2000|          7.3|     0.2286|              31.7|              0.0132672|           115.309|
|     2000|          5.4|     0.1524|              71.3|             0.00401199|           131.111|
|      500|          9.9|     0.1524|              71.3|              0.0193001|           131.279|
|     1000|         12.6|     0.1524|              71.3|              0.0483159|           122.044|
|    12500|          0.0|     0.0254|              39.6|             4.28464E-4|           129.116|


### Task 12 - Save the dataframe in parquet format, name the file as "NASA_airfoil_noise_cleaned.parquet"


In [21]:



# Save the DataFrame in Parquet format, overwriting the existing file
df_renamed.write.mode("overwrite").parquet("NASA_airfoil_noise_cleaned.parquet")


# your code goes here



                                                                                

#### Part 1 - Evaluation



**Run the code cell below.**<br>
**Use the answers here to answer the final evaluation quiz in the next section.**<br>
**If the code throws up any errors, go back and review the code you have written.**</b>


In [22]:
print("Part 1 - Evaluation")

print("Total rows = ", rowcount1)
print("Total rows after dropping duplicate rows = ", rowcount2)
print("Total rows after dropping duplicate rows and rows with null values = ", rowcount3)
print("New column name = ", df.columns[-1])

import os

print("NASA_airfoil_noise_cleaned.parquet exists :", os.path.isdir("NASA_airfoil_noise_cleaned.parquet"))

Part 1 - Evaluation


NameError: name 'rowcount1' is not defined

## Part - 2 Create a  Machine Learning Pipeline


### Task 1 - Load data from "NASA_airfoil_noise_cleaned.parquet" into a dataframe


In [23]:
#your code goes here

# Load the Parquet file into a DataFrame
df_parquet = spark.read.parquet("NASA_airfoil_noise_cleaned.parquet")



### Task 2 - Print the total number of rows in the dataset


In [24]:
#your code goes here

rowcount4 = df_parquet.count()
print(rowcount4)





1499


                                                                                

### Task 3 - Define the VectorAssembler pipeline stage


Stage 1 - Assemble the input columns into a single column "features". Use all the columns except SoundLevelDecibels as input features.


In [25]:
from pyspark.ml.feature import VectorAssembler

# List of input columns (excluding SoundLevelDecibels)
input_columns = [col for col in df_parquet.columns if col != "SoundLevelDecibels"]

# Initialize VectorAssembler
assembler = VectorAssembler(inputCols=input_columns, outputCol="features")

# Transform the DataFrame
df_assembled = assembler.transform(df_parquet)

# Show the DataFrame
df_assembled.show()




+---------+-------------+-----------+------------------+-----------------------+------------------+--------------------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevelDecibels|            features|
+---------+-------------+-----------+------------------+-----------------------+------------------+--------------------+
|      630|          0.0|     0.3048|              31.7|             0.00331266|           129.095|[630.0,0.0,0.3048...|
|     4000|          0.0|     0.3048|              31.7|             0.00331266|           118.145|[4000.0,0.0,0.304...|
|     4000|          1.5|     0.3048|              39.6|             0.00392107|           117.741|[4000.0,1.5,0.304...|
|      800|          4.0|     0.3048|              71.3|             0.00497773|           131.755|[800.0,4.0,0.3048...|
|     1250|          0.0|     0.2286|              31.7|              0.0027238|           128.805|[1250.0,0.0,0.228...|
|     2500|          4.0|     0.

### Task 4 - Define the StandardScaler pipeline stage


Stage 2 - Scale the "features" using standard scaler and store in "scaledFeatures" column


In [26]:
from pyspark.ml.feature import StandardScaler

# Initialize StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

# Fit the StandardScaler to the data
scaler_model = scaler.fit(df_assembled)

# Transform the data
df_scaled = scaler_model.transform(df_assembled)

# Show the DataFrame
df_scaled.show()



                                                                                

+---------+-------------+-----------+------------------+-----------------------+------------------+--------------------+--------------------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevelDecibels|            features|      scaledFeatures|
+---------+-------------+-----------+------------------+-----------------------+------------------+--------------------+--------------------+
|      630|          0.0|     0.3048|              31.7|             0.00331266|           129.095|[630.0,0.0,0.3048...|[0.19963520038433...|
|     4000|          0.0|     0.3048|              31.7|             0.00331266|           118.145|[4000.0,0.0,0.304...|[1.26752508180528...|
|     4000|          1.5|     0.3048|              39.6|             0.00392107|           117.741|[4000.0,1.5,0.304...|[1.26752508180528...|
|      800|          4.0|     0.3048|              71.3|             0.00497773|           131.755|[800.0,4.0,0.3048...|[0.25350501636105...|
|     

### Task 5 - Define the Model creation pipeline stage


Stage 3 - Create a LinearRegression stage to predict "SoundLevelDecibels"

**Note:You need to use the scaledfeatures retreived in the previous step(StandardScaler pipeline stage).**


In [27]:

from pyspark.ml.regression import LinearRegression

# Define the Linear Regression model
lr = LinearRegression(featuresCol='scaledFeatures', labelCol='SoundLevelDecibels')

# Fit the model to the data
lr_model = lr.fit(df_scaled)

# Print the coefficients and intercept
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

# Summarize the model over the training set and print out some metrics
training_summary = lr_model.summary
print("Root Mean Squared Error: %f" % training_summary.rootMeanSquaredError)
print("R Squared: %f" % training_summary.r2)



24/06/10 20:39:36 WARN util.Instrumentation: [2072220a] regParam is zero, which might cause numerical instability and overfitting.
24/06/10 20:39:37 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
24/06/10 20:39:37 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
24/06/10 20:39:39 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
24/06/10 20:39:39 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
[Stage 35:>                                                         (0 + 8) / 8]

Coefficients: [-4.04205014061969,-2.4999359735353774,-3.3413415990555646,1.5607714388511529,-1.9348543216323604]
Intercept: 132.82719138025894
Root Mean Squared Error: 4.801411
R Squared: 0.516196


                                                                                

### Task 6 - Build the pipeline


Build a pipeline using the above three stages


In [28]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import StandardScaler

# Initialize Spark session
spark = SparkSession.builder.appName("AirfoilSelfNoisePrediction").getOrCreate()


# Rename the column from SoundLevel to SoundLevelDecibels
df = df.withColumnRenamed("SoundLevel", "SoundLevelDecibels")

# Remove rows with null values
df = df.dropna()

# Define the stages of the pipeline
assembler = VectorAssembler(
    inputCols=["Frequency", "AngleOfAttack", "ChordLength", "FreeStreamVelocity", "SuctionSideDisplacement", "outputCol="features"
)

scaler = StandardScaler(
    inputCol="features",
    outputCol="scaledFeatures",
    withStd=True,
    withMean=False
)

lr = LinearRegression(
    featuresCol="scaledFeatures",
    labelCol="SoundLevelDecibels",
    predictionCol="prediction"
)

# Construct the pipeline
pipeline = Pipeline(stages=[assembler, scaler, lr])






### Task 7 - Split the data


In [29]:
# Split the data into training and testing sets with 70:30 split.
# set the value of seed to 42
# the above step is very important. DO NOT set the value of seed to any other value other than 42.

#your code goes here

(trainingData, testingData) =df.randomSplit([0.7, 0.3], seed=42)



### Task 8 - Fit the pipeline


In [30]:




# Fit the pipeline to the training data
pipeline_model = pipeline.fit(trainingData)



24/06/10 20:40:57 WARN util.Instrumentation: [aa06e88b] regParam is zero, which might cause numerical instability and overfitting.


#### Part 2 - Evaluation



**Run the code cell below.**<br>
**Use the answers here to answer the final evaluation quiz in the next section.**<br>
**If the code throws up any errors, go back and review the code you have written.**</b>


In [31]:
print("Part 2 - Evaluation")
print("Total rows = ", rowcount4)
ps = [str(x).split("_")[0] for x in pipeline.getStages()]

print("Pipeline Stage 1 = ", ps[0])
print("Pipeline Stage 2 = ", ps[1])
print("Pipeline Stage 3 = ", ps[2])

print("Label column = ", lr.getLabelCol())

Part 2 - Evaluation
Total rows =  1499
Pipeline Stage 1 =  VectorAssembler
Pipeline Stage 2 =  StandardScaler
Pipeline Stage 3 =  LinearRegression
Label column =  SoundLevelDecibels


## Part 3 - Evaluate the Model


### Task 1 - Predict using the model


In [32]:
# Fit the pipeline using the training data
pipeline_model = pipeline.fit(trainingData)

# Make predictions on the test data
predictions = pipeline_model.transform(testingData)

# Show the predictions
predictions.select("Frequency", "AngleOfAttack", "ChordLength", "FreeStreamVelocity", "SuctionSideDisplacement", "SoundLevelDecibels", "prediction").show()


24/06/10 20:41:33 WARN util.Instrumentation: [df682d66] regParam is zero, which might cause numerical instability and overfitting.


+---------+-------------+-----------+------------------+-----------------------+------------------+------------------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevelDecibels|        prediction|
+---------+-------------+-----------+------------------+-----------------------+------------------+------------------+
|      200|          0.0|     0.3048|              39.6|             0.00310138|           118.129|118.12899999999229|
|      200|          7.3|     0.2286|              31.7|              0.0132672|           128.679|128.67900000000606|
|      200|          7.3|     0.2286|              39.6|              0.0123481|           130.989|130.98900000000754|
|      200|          9.9|     0.1524|              71.3|              0.0193001|           134.319|134.31900000000755|
|      200|         12.3|     0.1016|              31.7|              0.0418756|           124.987|124.98700000000537|
|      200|         15.6|     0.1016|           

### Task 2 - Print the MSE


In [33]:
# Evaluate the model using Mean Squared Error (MSE)
evaluator = RegressionEvaluator(labelCol="SoundLevelDecibels", predictionCol="prediction", metricName="mse")
mse = evaluator.evaluate(predictions)

print(f"Mean Squared Error (MSE) on test data = {mse}")


Mean Squared Error (MSE) on test data = 2.445646531627305e-23


### Task 3 - Print the MAE


In [34]:
from pyspark.ml.evaluation import RegressionEvaluator

# Evaluate the model using Mean Absolute Error (MAE)
evaluator = RegressionEvaluator(labelCol="SoundLevelDecibels", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)

print(f"Mean Absolute Error (MAE) on test data = {mae}")



Mean Absolute Error (MAE) on test data = 3.8051473561071e-12


### Task 4 - Print the R-Squared(R2)


In [35]:
from pyspark.ml.evaluation import RegressionEvaluator

# Evaluate the model using R-squared (R2)
evaluator = RegressionEvaluator(labelCol="SoundLevelDecibels", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)

print(f"R-squared (R2) on test data = {r2}")



R-squared (R2) on test data = 1.0


#### Part 3 - Evaluation



**Run the code cell below.**<br>
**Use the answers here to answer the final evaluation quiz in the next section.**<br>
**If the code throws up any errors, go back and review the code you have written.**</b>


In [36]:
print("Part 3 - Evaluation")

print("Mean Squared Error = ", round(mse,2))
print("Mean Absolute Error = ", round(mae,2))
print("R Squared = ", round(r2,2))

lrModel = pipeline_model.stages[-1]

print("Intercept = ", round(lrModel.intercept,2))


Part 3 - Evaluation
Mean Squared Error =  0.0
Mean Absolute Error =  0.0
R Squared =  1.0
Intercept =  -0.0


## Part 4 - Persist the Model


### Task 1 - Save the model to the path "Final_Project"


In [39]:
# Save the pipeline model as "Final_Project"
# your code goes here
# Save the trained pipeline model
pipeline_model.write().overwrite().save("Final Project")


                                                                                

### Task 2 - Load the model from the path "Final_Project"


In [40]:
# Load the pipeline model you have created in the previous step
from pyspark.ml import PipelineModel

# Load the saved pipeline model
loaded_pipeline_model = PipelineModel.load("Final Project")



### Task 3 - Make predictions using the loaded model on the testdata


In [41]:
# Use the loaded pipeline model and make predictions using testingData

# Make predictions on the test data using the loaded model
predictions_loaded_model = loaded_pipeline_model.transform(testingData)




### Task 4 - Show the predictions


In [43]:


# Show the predictions
predictions_loaded_model.select("Frequency", "AngleOfAttack", "ChordLength", "FreeStreamVelocity", "SuctionSideDisplacement", "SoundLevelDecibels", "prediction").show(5)



+---------+-------------+-----------+------------------+-----------------------+------------------+------------------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevelDecibels|        prediction|
+---------+-------------+-----------+------------------+-----------------------+------------------+------------------+
|      200|          0.0|     0.3048|              39.6|             0.00310138|           118.129|118.12899999999229|
|      200|          7.3|     0.2286|              31.7|              0.0132672|           128.679|128.67900000000606|
|      200|          7.3|     0.2286|              39.6|              0.0123481|           130.989|130.98900000000754|
|      200|          9.9|     0.1524|              71.3|              0.0193001|           134.319|134.31900000000755|
|      200|         12.3|     0.1016|              31.7|              0.0418756|           124.987|124.98700000000537|
+---------+-------------+-----------+-----------

#### Part 4 - Evaluation




**Run the code cell below.**<br>
**Use the answers here to answer the final evaluation quiz in the next section.**<br>
**If the code throws up any errors, go back and review the code you have written.**</b>


In [44]:
print("Part 4 - Evaluation")

loadedmodel = loaded_pipeline_model.stages[-1]
totalstages = len(loaded_pipeline_model.stages)
inputcolumns = loaded_pipeline_model.stages[0].getInputCols()

print("Number of stages in the pipeline = ", totalstages)
for i,j in zip(inputcolumns, loadedmodel.coefficients):
    print(f"Coefficient for {i} is {round(j,4)}")

Part 4 - Evaluation
Number of stages in the pipeline =  3
Coefficient for Frequency is 0.0
Coefficient for AngleOfAttack is 0.0
Coefficient for ChordLength is 0.0
Coefficient for FreeStreamVelocity is -0.0
Coefficient for SuctionSideDisplacement is 0.0
Coefficient for SoundLevelDecibels is 7.0395


### Stop Spark Session


In [None]:
spark.stop()

## Authors


[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0231ENSkillsNetwork866-2023-01-01)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-05-26|0.1|Ramesh Sannareddy|Initial Version Created|


Copyright © 2023 IBM Corporation. All rights reserved.
