# Build an ML Pipeline for Airfoil noise prediction

**Airfoil**: A cross-sectional shape of a wing, blade, or sail that is designed to generate lift when air flows over it. In aeronautics, airfoils are critical components in aircraft wings, helicopter rotors, propellers, and turbine blades. The shape of an airfoil directly affects its aerodynamic performance, including lift generation, drag characteristics, and importantly, the noise it produces as air flows over its surface. Understanding and predicting airfoil noise is essential for designing quieter, more efficient aircraft and reducing environmental noise pollution.

## Setup

In [47]:
%pip install pyspark
%pip install findspark
%pip install numpy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting numpy
  Using cached numpy-2.3.5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
Using cached numpy-2.3.5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.6 MB)
Installing collected packages: numpy
Successfully installed numpy-2.3.5
Note: you may need to restart the kernel to use updated packages.


In [None]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

## Part 1 - Perform ETL activity

### Import required libraries

In [4]:
from pyspark.sql import SparkSession

### Create a spark session

In [5]:
spark = SparkSession.builder.appName("Airfoil Noise Prediction").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/14 11:16:47 WARN Utils: Your hostname, maishuji, resolves to a loopback address: 127.0.1.1; using 192.168.0.18 instead (on interface wlp4s0)
25/12/14 11:16:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/14 11:16:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Load the csv file into a datadrame

In [10]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv -O ./../data/raw/NASA_airfoil_noise_raw.csv


--2025-12-14 11:19:12--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60682 (59K) [text/csv]
Saving to: ‘./../data/raw/NASA_airfoil_noise_raw.csv’


2025-12-14 11:19:13 (384 KB/s) - ‘./../data/raw/NASA_airfoil_noise_raw.csv’ saved [60682/60682]



In [26]:
df = spark.read.csv("./../data/raw/NASA_airfoil_noise_raw.csv", header=True, inferSchema=True)

### Print top 5 rows of the dataset

In [27]:
df.show(5)

+---------+-------------+-----------+------------------+-----------------------+----------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevel|
+---------+-------------+-----------+------------------+-----------------------+----------+
|      800|          0.0|     0.3048|              71.3|             0.00266337|   126.201|
|     1000|          0.0|     0.3048|              71.3|             0.00266337|   125.201|
|     1250|          0.0|     0.3048|              71.3|             0.00266337|   125.951|
|     1600|          0.0|     0.3048|              71.3|             0.00266337|   127.591|
|     2000|          0.0|     0.3048|              71.3|             0.00266337|   127.461|
+---------+-------------+-----------+------------------+-----------------------+----------+
only showing top 5 rows


### Print the total number of rows in the dataset

In [28]:
rowcount1 = df.count()
print(f"Row count before removing duplicates and nulls: {rowcount1}")

Row count before removing duplicates and nulls: 1522


### Drop all the duplicate rows from the dataset

In [29]:
df = df.dropDuplicates()

### Print the total number of rows in the dataset

In [30]:
rowcount2 = df.count()
print(f"Row count after removing duplicates: {rowcount2}")

Row count after removing duplicates: 1503


### Drop all the rows that contain null values from the dataset

In [31]:
df = df.dropna()

### Print the total number of rows in the dataset

In [32]:
rowcount3 = df.count()
print(f"Row count after removing nulls: {rowcount3}")

Row count after removing nulls: 1499


### Rename the column "SoundLevel" to "SoundLevelDecibels"

In [33]:
df = df.withColumnRenamed("SoundLevel", "SoundLevelDecibels")

### Save the dataframe in parquet format, name the file as "NASA_airfoil_noise_cleaned.parquet"

In [40]:
df.write.parquet("./../data/processed/NASA_airfoil_noise_cleaned.parquet", mode="overwrite")

### Part 1 - Evaluation

In [41]:
print("Part 1 - Evaluation")

print("Total rows = ", rowcount1)
print("Total rows after dropping duplicate rows = ", rowcount2)
print("Total rows after dropping duplicate rows and rows with null values = ", rowcount3)
print("New column name = ", df.columns[-1])

import os

print("NASA_airfoil_noise_cleaned.parquet exists :", os.path.isdir("./../data/processed/NASA_airfoil_noise_cleaned.parquet"))

Part 1 - Evaluation
Total rows =  1522
Total rows after dropping duplicate rows =  1503
Total rows after dropping duplicate rows and rows with null values =  1499
New column name =  SoundLevelDecibels
NASA_airfoil_noise_cleaned.parquet exists : True


## Part 2 - Create a Machine Learning Pipeline

In [48]:
from pyspark.ml import Pipeline
from pyspark.ml.pipeline import PipelineModel
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import StandardScaler
from pyspark.ml.evaluation import RegressionEvaluator

### Load data from the .parquet

In [42]:
df = spark.read.parquet("./../data/processed/NASA_airfoil_noise_cleaned.parquet")
df.show(5)

+---------+-------------+-----------+------------------+-----------------------+------------------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevelDecibels|
+---------+-------------+-----------+------------------+-----------------------+------------------+
|      400|          0.0|     0.3048|              31.7|             0.00331266|           125.045|
|    16000|          1.5|     0.3048|              71.3|             0.00336729|           106.582|
|      630|          3.0|     0.3048|              55.5|             0.00452492|           129.569|
|      315|          4.0|     0.2286|              31.7|             0.00509068|           123.609|
|      630|          4.0|     0.2286|              31.7|             0.00509068|           130.349|
+---------+-------------+-----------+------------------+-----------------------+------------------+
only showing top 5 rows


### Print the total number of rows in the dataset

In [43]:
rowcount4 = df.count()
print(f"Row count after reading the cleaned parquet file: {rowcount4}")

Row count after reading the cleaned parquet file: 1499


### Define the VectorAssembler pipeline stage

Using all the columns except the last one ("SoundLevelDecibels") and assemble it into a single column "features".

In [None]:
df.columns[:-1]

In [49]:

assembler = VectorAssembler(inputCols=df.columns[:-1], outputCol="features")

### Define the StandardScaler pipeline stage

In [50]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

### Define the Model creation pipeline stage

In [54]:
lr = LinearRegression(featuresCol="scaledFeatures", labelCol="SoundLevelDecibels")

### Build the pipeline

In [55]:
pipeline = Pipeline(stages=[assembler, scaler, lr])

### Split the data

In [51]:
(trainingData, testData) = df.randomSplit([0.7, 0.3], seed=42)

### Fit the pipeline

In [58]:
pipelineModel = pipeline.fit(trainingData)

25/12/14 12:23:32 WARN Instrumentation: [6b535e65] regParam is zero, which might cause numerical instability and overfitting.


### Part 2 - Evaluation

In [59]:
print("Part 2 - Evaluation")
print("Total rows = ", rowcount4)
ps = [str(x).split("_")[0] for x in pipeline.getStages()]

print("Pipeline Stage 1 = ", ps[0])
print("Pipeline Stage 2 = ", ps[1])
print("Pipeline Stage 3 = ", ps[2])

print("Label column = ", lr.getLabelCol())

Part 2 - Evaluation
Total rows =  1499
Pipeline Stage 1 =  VectorAssembler
Pipeline Stage 2 =  StandardScaler
Pipeline Stage 3 =  LinearRegression
Label column =  SoundLevelDecibels


## Part 3 - Evaluatiion of the model

### Predict using the model

In [60]:
predictions = pipelineModel.transform(testData)

In [61]:
predictions.select("features", "SoundLevelDecibels", "prediction").show(5)

+--------------------+------------------+------------------+
|            features|SoundLevelDecibels|        prediction|
+--------------------+------------------+------------------+
|[200.0,7.3,0.2286...|           128.679|122.59722914376778|
|[200.0,8.9,0.1016...|            133.42|127.37968204568838|
|[200.0,9.5,0.0254...|           119.146|130.34077425074506|
|[200.0,9.5,0.0254...|           116.074|131.11016975113537|
|[200.0,9.9,0.1524...|           134.319|127.12627360125096|
+--------------------+------------------+------------------+
only showing top 5 rows


### Print the MSE

In [62]:
evaluator = RegressionEvaluator(labelCol="SoundLevelDecibels", predictionCol="prediction", metricName="mse")
mse = evaluator.evaluate(predictions)
print("Mean Squared Error (MSE) on test data = ", mse)

Mean Squared Error (MSE) on test data =  24.99766625502418


### Print the MAE

In [63]:
evaluator = RegressionEvaluator(labelCol="SoundLevelDecibels", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
print("Mean Absolute Error (MAE) on test data = ", mae)

Mean Absolute Error (MAE) on test data =  3.9136790958812044


### Print the R-Squared (R2)

In [64]:
evaluator = RegressionEvaluator(labelCol="SoundLevelDecibels", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)
print("R Squared (R2) on test data = ", r2)

R Squared (R2) on test data =  0.4959688408974623


### Part 3 - Evaluation

In [65]:
print("Part 3 - Evaluation")

print("Mean Squared Error = ", round(mse,2))
print("Mean Absolute Error = ", round(mae,2))
print("R Squared = ", round(r2,2))

lrModel = pipelineModel.stages[-1]

print("Intercept = ", round(lrModel.intercept,2))

Part 3 - Evaluation
Mean Squared Error =  25.0
Mean Absolute Error =  3.91
R Squared =  0.5
Intercept =  132.88
