# Build an ML Pipeline for Airfoil noise prediction

**Airfoil**: A cross-sectional shape of a wing, blade, or sail that is designed to generate lift when air flows over it. In aeronautics, airfoils are critical components in aircraft wings, helicopter rotors, propellers, and turbine blades. The shape of an airfoil directly affects its aerodynamic performance, including lift generation, drag characteristics, and importantly, the noise it produces as air flows over its surface. Understanding and predicting airfoil noise is essential for designing quieter, more efficient aircraft and reducing environmental noise pollution.

## Setup

In [3]:
%pip install pyspark
%pip install findspark

Collecting pyspark
  Using cached pyspark-4.0.1-py2.py3-none-any.whl
Collecting py4j==0.10.9.9 (from pyspark)
  Using cached py4j-0.10.9.9-py2.py3-none-any.whl.metadata (1.3 kB)
Using cached py4j-0.10.9.9-py2.py3-none-any.whl (203 kB)
Installing collected packages: py4j, pyspark
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [pyspark]m1/2[0m [pyspark]
[1A[2KSuccessfully installed py4j-0.10.9.9 pyspark-4.0.1
Note: you may need to restart the kernel to use updated packages.
Collecting findspark
  Using cached findspark-2.0.1-py2.py3-none-any.whl.metadata (352 bytes)
Using cached findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1
Note: you may need to restart the kernel to use updated packages.


In [None]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

## Part 1 - Perform ETL activity

### Import required libraries

In [4]:
from pyspark.sql import SparkSession

### Create a spark session

In [5]:
spark = SparkSession.builder.appName("Airfoil Noise Prediction").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/14 11:16:47 WARN Utils: Your hostname, maishuji, resolves to a loopback address: 127.0.1.1; using 192.168.0.18 instead (on interface wlp4s0)
25/12/14 11:16:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/14 11:16:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Load the csv file into a datadrame

In [10]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv -O ./../data/raw/NASA_airfoil_noise_raw.csv


--2025-12-14 11:19:12--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60682 (59K) [text/csv]
Saving to: ‘./../data/raw/NASA_airfoil_noise_raw.csv’


2025-12-14 11:19:13 (384 KB/s) - ‘./../data/raw/NASA_airfoil_noise_raw.csv’ saved [60682/60682]



In [26]:
df = spark.read.csv("./../data/raw/NASA_airfoil_noise_raw.csv", header=True, inferSchema=True)

### Print top 5 rows of the dataset

In [27]:
df.show(5)

+---------+-------------+-----------+------------------+-----------------------+----------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevel|
+---------+-------------+-----------+------------------+-----------------------+----------+
|      800|          0.0|     0.3048|              71.3|             0.00266337|   126.201|
|     1000|          0.0|     0.3048|              71.3|             0.00266337|   125.201|
|     1250|          0.0|     0.3048|              71.3|             0.00266337|   125.951|
|     1600|          0.0|     0.3048|              71.3|             0.00266337|   127.591|
|     2000|          0.0|     0.3048|              71.3|             0.00266337|   127.461|
+---------+-------------+-----------+------------------+-----------------------+----------+
only showing top 5 rows


### Print the total number of rows in the dataset

In [28]:
rowcount1 = df.count()
print(f"Row count before removing duplicates and nulls: {rowcount1}")

Row count before removing duplicates and nulls: 1522


### Drop all the duplicate rows from the dataset

In [29]:
df = df.dropDuplicates()

### Print the total number of rows in the dataset

In [30]:
rowcount2 = df.count()
print(f"Row count after removing duplicates: {rowcount2}")

Row count after removing duplicates: 1503


### Drop all the rows that contain null values from the dataset

In [31]:
df = df.dropna()

### Print the total number of rows in the dataset

In [32]:
rowcount3 = df.count()
print(f"Row count after removing nulls: {rowcount3}")

Row count after removing nulls: 1499


### Rename the column "SoundLevel" to "SoundLevelDecibels"

In [33]:
df = df.withColumnRenamed("SoundLevel", "SoundLevelDecibels")

### Save the dataframe in parquet format, name the file as "NASA_airfoil_noise_cleaned.parquet"

In [40]:
df.write.parquet("./../data/processed/NASA_airfoil_noise_cleaned.parquet", mode="overwrite")

### Part 1 - Evaluation

In [41]:
print("Part 1 - Evaluation")

print("Total rows = ", rowcount1)
print("Total rows after dropping duplicate rows = ", rowcount2)
print("Total rows after dropping duplicate rows and rows with null values = ", rowcount3)
print("New column name = ", df.columns[-1])

import os

print("NASA_airfoil_noise_cleaned.parquet exists :", os.path.isdir("./../data/processed/NASA_airfoil_noise_cleaned.parquet"))

Part 1 - Evaluation
Total rows =  1522
Total rows after dropping duplicate rows =  1503
Total rows after dropping duplicate rows and rows with null values =  1499
New column name =  SoundLevelDecibels
NASA_airfoil_noise_cleaned.parquet exists : True


## Part 2 - Create a Machine Learning Pipeline

### Load data from the .parquet

### Print the total number of rows in the dataset

### Define the VectorAssembler pipeline stage

### Define the StandardScaler pipeline stage

### Define the Model creation pipeline stage

### Build the pipeline

### Split the data

### Fit the pipeline

### Part 2 - Evaluation

In [None]:
print("Part 2 - Evaluation")
print("Total rows = ", rowcount4)
ps = [str(x).split("_")[0] for x in pipeline.getStages()]

print("Pipeline Stage 1 = ", ps[0])
print("Pipeline Stage 2 = ", ps[1])
print("Pipeline Stage 3 = ", ps[2])

print("Label column = ", lr.getLabelCol())