# Dataset Coding Challenge

## Objective:
Prepare a dataset for training an ML model on the airfRANS data.

The airfRANS dataset is a collection of airfoils and their RANS simulation solution. The [documentation](https://airfrans.readthedocs.io/en/latest/notes/dataset.html) describes the dataset and the functionality in the Python library they've built in detail.

Ask:
* Create a dataset for training an ML model using the airfRANS dataset.
  * The dataset should provide a sequence of points with their SDF (distance from the airfoil) value as the input `(x, y, sdf)` and the velocity `(x, y, v_x, v_y`) as the target. Package the data such that it can be quickly loaded for training a model.
* Provide some dataset statistics to help users understand the data.
* Document your design decisions.

### Note:
This is an intentionally open-ended challenge. The primary objective is to see how you write code. Think of this as an evaluation of your ability to write code for a production environment.

## Time
This challenge is designed to take ~2-3 hours. If it's taking much longer than that, feel free to stop and document what your next steps would have been.

## Deliverables
Your choice. You can send us a github repo, jupyter notebook, or just raw files. Do whatever you think will best demonstrate your SW engineering skills.

# Data Processing
The below provides some sample code to help with basic dataset loading of the airfrans dataset.

In [None]:
!pip install airfrans --quiet
!pip install pyvista --quiet
!sudo apt install libgl1-mesa-glx xvfb

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libfontenc1 libxfont2 libxkbfile1 x11-xkb-utils xfonts-base xfonts-encodings
  xfonts-utils xserver-common
The following NEW packages will be installed:
  libfontenc1 libgl1-mesa-glx libxfont2 libxkbfile1 x11-xkb-utils xfonts-base
  xfonts-encodings xfonts-utils xserver-common xvfb
0 upgraded, 10 newly installed, 0 to remove and 49 not upgraded.
Need to get 7,820 kB of archives.
After this operation, 12.0 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libfontenc1 amd64 1:1.1.4-1build3 [14.7 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe 

In [None]:
import airfrans as af
import os
from pathlib import Path

In [None]:
# This has already been run
# Download the dataset
# NOTE: Dataset documentation can be found here: https://airfrans.readthedocs.io/en/latest/notes/dataset.html

directory_name = Path('airfrans/')
file_name = 'Dataset'
if not directory_name.exists() or not any(directory_name.iterdir()):
  af.dataset.download(root=str(directory_name), file_name=file_name, unzip=True, OpenFOAM=False)

Downloading AirfRANS: 9.34GB [07:17, 22.9MB/s]                            


Extracting Dataset.zip at airfrans...


In [None]:
# Note: we use the `reynolds` task here to prevent a memory error with Colab. Feel free to use `full` if doing this offline.
dataset_list, dataset_name = af.dataset.load(root=str(directory_name/file_name), task = 'reynolds', train = True)

Loading dataset (task: reynolds, split: train): 100%|██████████| 504/504 [04:30<00:00,  1.86it/s]


In [None]:
# Some set basics!

T = 298.15 # As recommended in original documentation
inlet_velocity = []
angle_attack = []
digits4 = []
digits5 = []

for item in dataset_name:
    params = item.split('_')
    inlet_velocity.append(float(params[2]))
    angle_attack.append(float(params[3]))

    if len(params) == 7:
        digits4.append(list(map(float, params[-3:])))
    else:
        digits5.append(list(map(float, params[-4:])))

print("\n--- ♥ Dataset Overview ♥ ---")
print(f"\nDataset_name: {type(dataset_name)} containing elements of type {type(dataset_name[0])}")
print(f"\n\tExample element: {dataset_name[0]}")
print(f"\tLength: {len(dataset_name):,} elements")
print(f"\t- NACA 4-digit count: {len(digits4):,}")
print(f"\t- NACA 5-digit count: {len(digits5):,}")
print(f"\tInlet velocity range: {min(inlet_velocity):.2f} to {max(inlet_velocity):.2f}")
print(f"\tAngle of attack range: {min(angle_attack):.2f} to {max(angle_attack):.2f}")

print(f"\nDataset_list: {type(dataset_list)} containing elements of type {type(dataset_list[0])}")
print(f"\n\tLength: {len(dataset_list):,} elements")
print(f"\tShape of each element: (N, 12)")
print(f"\t-  N: number of points in simulation")
print(f"\t- 12: number of features")
print(f"\tNumber of points in first simulation as reference: {len(dataset_list[0][:]):,}")
print(f"\t- Note: number of points in simulations vary a bit")

features = [
    ["\tPosition x", "Position y", "Inlet velocity", "Inlet velocity"],
    ["\tDist to airfoil", "Normals a", "Normals b", "Velocity x"],
    ["\tVelocity y", "Pressure/mass", "Kin. viscosity", "Bool*"]
]

print("\n\tDescription of 12 features, in order: ")
print('\t\n'.join(['\t'.join([str(cell) for cell in row]) for row in features]))
print("\t\n\t*Evals to true if point lies on the airfoil")


--- ♥ Dataset Overview ♥ ---

Dataset_name: <class 'list'> containing elements of type <class 'str'>

	Example element: airFoil2D_SST_58.831_-3.563_2.815_4.916_10.078
	Length: 504 elements
	- NACA 4-digit count: 239
	- NACA 5-digit count: 265
	Inlet velocity range: 46.84 to 77.95
	Angle of attack range: -4.93 to 14.79

Dataset_list: <class 'list'> containing elements of type <class 'numpy.ndarray'>

	Length: 504 elements
	Shape of each element: (N, 12)
	-  N: number of points in simulation
	- 12: number of features
	Number of points in first simulation as reference: 170,180
	- Note: number of points in simulations vary a bit

	Description of 12 features, in order: 
	Position x	Position y	Inlet velocity	Inlet velocity	
	Dist to airfoil	Normals a	Normals b	Velocity x	
	Velocity y	Pressure/mass	Kin. viscosity	Bool*
	
	*Evals to true if point lies on the airfoil


In [None]:
# Input:   dataset_list
# Output:  tensor(input: x, y, sdf), tensor(target: x, y, v_x, v_y)
#
# Challenges:
#   1. Memory Usage
#      - `af.dataset.load` pulls all data into RAM, little left to play with.
#   2. Data Structure
#      - Indexing is a bit complex because our dataset is a list of ndarray simulations.
#        For example, to find the 500,000th data point, we need to find the index of the simulation it belongs to first.
#        This is challenging, because each simulation has a different number of data points.
#
# Approach:
#   First idea: Create a dataset to hold target features (position, sdf, v_pos) and simply run the ML model on that.
#      - Pros: Straightforward, readable code; still a viable option.
#      - Cons: Requires copying data, which quickly exceeded Google Colab memory limits. (No $$.)
#      - Next steps: Consider chunking the data, possibly loading only parts at a time to avoid RAM overload.
#
# Solution:
#   Avoid making any copies and directly reference data in RAM.
#      - Pros: Keeps RAM usage low! No need to spend money.
#      - Cons: Less readable code. (I sorry <3)
#      - Details: Solved the indexing problem without data copies by mapping a given index to a `sim_id` and `row`.
#                Stored valid index ranges of each simulation (e.g., sim 0: [0, 170180], sim 1: [170181, 320912], ...) in an array,
#                and used binary search to find `sim_id`.
#      - Next steps: Do sanity check with teammates who understand physics part of it. Test the current code and explore how
#        Geometric Deep Learning libraries can help!

import torch
from torch.utils.data import Dataset, DataLoader

class AirfoilDataset(Dataset):
    def __init__(self, dataset_list):
        self.dataset_list = dataset_list
        self.sim_bounds = self._find_sim_bounds()

    def _find_sim_bounds(self):
        bound, s_b = 0, []
        for sim in self.dataset_list:
            # Find length of current simulation and add to total
            bound += len(sim)

            # Record boundary index of current simulation
            s_b.append(bound)
        return s_b

    def _bin_search(self, left, right, arr, target):
        mid = left + (right - left) // 2

        if arr[mid] <= target and target < arr[mid+1]:
            # Target found
            return mid+1

        elif  arr[mid] < target:
            # Target bigger than midpoint
            return self._bin_search(mid + 1, right, arr, target)

        else:
            # Target smaller than midpoint
            return self._bin_search(left, mid - 1, arr, target)

        # Target not found anywhere
        return -1

    def __len__(self):
        return self.sim_bounds[-1]

    def __getitem__(self, idx):
        if idx < self.sim_bounds[0]:
            # We can skip search
            sim_id = 0
            row = idx
        else:
            # Find index of target simulation
            sim_id = self._bin_search(0, len(self.dataset_list)-1, self.sim_bounds, idx)

            # Find index of target datapoint within simulation
            row = idx - self.sim_bounds[sim_id-1]

        x = self.dataset_list[sim_id][row][0] # Used this approach instead of af.Simulation to maintain encapsulation
        y = self.dataset_list[sim_id][row][1]
        sdf = self.dataset_list[sim_id][row][4]
        v_x = self.dataset_list[sim_id][row][7]
        v_y = self.dataset_list[sim_id][row][8]

        # Put everything in tensor
        input = torch.cat([torch.tensor((x, y, sdf), dtype = torch.float)])
        target = torch.cat([torch.tensor((x, y, v_x, v_y), dtype = torch.float)])
        return input, target


In [None]:
# Example output

dataset = AirfoilDataset(dataset_list)
print(dataset.__getitem__(0))

(tensor([ 4.2169, -0.1999,  3.2231]), tensor([ 4.2169, -0.1999, 54.5453, -3.3534]))
