# Dataset Coding Challenge

## Objective:
Prepare a dataset for training an ML model on the airfRANS data.

The airfRANS dataset is a collection of airfoils and their RANS simulation solution. The [documentation](https://airfrans.readthedocs.io/en/latest/notes/dataset.html) describes the dataset and the functionality in the Python library they've built in detail.

Ask:
* Create a dataset for training an ML model using the airfRANS dataset.
  * The dataset should provide a sequence of points with their SDF (distance from the airfoil) value as the input `(x, y, sdf)` and the velocity `(x, y, v_x, v_y`) as the target. Package the data such that it can be quickly loaded for training a model.
* Provide some dataset statistics to help users understand the data.
* Document your design decisions.

### Note:
This is an intentionally open-ended challenge. The primary objective is to see how you write code. Think of this as an evaluation of your ability to write code for a production environment.

## Time
This challenge is designed to take ~2-3 hours. If it's taking much longer than that, feel free to stop and document what your next steps would have been.

## Deliverables
Your choice. You can send us a github repo, jupyter notebook, or just raw files. Do whatever you think will best demonstrate your SW engineering skills.

# Data Processing
The below provides some sample code to help with basic dataset loading of the airfrans dataset.

In [2]:
!pip install airfrans --quiet
!pip install pyvista --quiet
!sudo apt install libgl1-mesa-glx xvfb

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libgl1-mesa-glx is already the newest version (23.0.4-0ubuntu1~22.04.1).
xvfb is already the newest version (2:21.1.4-2ubuntu1.7~22.04.12).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [3]:
import airfrans as af
import os
import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader
from itertools import chain, islice
from pathlib import Path


In [4]:
# This has already been run
# Download the dataset
# NOTE: Dataset documentation can be found here: https://airfrans.readthedocs.io/en/latest/notes/dataset.html

directory_name = Path('airfrans/')
file_name = 'Dataset'
if not directory_name.exists() or not any(directory_name.iterdir()):
    af.dataset.download(root=str(directory_name), file_name=file_name, unzip=True, OpenFOAM=False)

In [5]:
# Note: we use the `reynolds` task here to prevent a memory error with Colab. Feel free to use `full` if doing this offline.
dataset_list, dataset_name = af.dataset.load(root=str(directory_name/file_name), task = 'reynolds', train = True)

Loading dataset (task: reynolds, split: train): 100%|██████████| 504/504 [04:11<00:00,  2.00it/s]


In [7]:
# Some set basics!

T = 298.15 # As recommended in original documentation
inlet_velocity = []
angle_attack = []
digits4 = []
digits5 = []

for item in dataset_name:
    params = item.split('_')
    inlet_velocity.append(float(params[2]))
    angle_attack.append(float(params[3]))

    if len(params) == 7:
        digits4.append(list(map(float, params[-3:])))
    else:
        digits5.append(list(map(float, params[-4:])))

print("\n--- ♥ Dataset Overview ♥ ---")
print(f"\nDataset_name: {type(dataset_name)} containing elements of type {type(dataset_name[0])}")
print(f"\n\tExample element: {dataset_name[0]}")
print(f"\tLength: {len(dataset_name):,} elements")
print(f"\t- NACA 4-digit count: {len(digits4):,}")
print(f"\t- NACA 5-digit count: {len(digits5):,}")
print(f"\tInlet velocity range: {min(inlet_velocity):.2f} to {max(inlet_velocity):.2f}")
print(f"\tAngle of attack range: {min(angle_attack):.2f} to {max(angle_attack):.2f}")

print(f"\nDataset_list: {type(dataset_list)} containing elements of type {type(dataset_list[0])}")
print(f"\n\tLength: {len(dataset_list):,} elements")
print(f"\tShape of each element: (N, 12)")
print(f"\t-  N: number of points in simulation")
print(f"\t- 12: number of features")
print(f"\tNumber of points in first simulation as reference: {len(dataset_list[0][:]):,}")
print(f"\t- Note: number of points in simulations vary a bit")

features = [
    ["\tPosition x", "Position y", "Inlet velocity", "Inlet velocity"],
    ["\tDist to airfoil", "Normals a", "Normals b", "Velocity x"],
    ["\tVelocity y", "Pressure/mass", "Kin. viscosity", "Bool*"]
]

print("\n\tDescription of 12 features, in order: ")
print('\t\n'.join(['\t'.join([str(cell) for cell in row]) for row in features]))
print("\t\n\t*Evals to true if point lies on the airfoil")


--- ♥ Dataset Overview ♥ ---

Dataset_name: <class 'list'> containing elements of type <class 'str'>

	Example element: airFoil2D_SST_58.831_-3.563_2.815_4.916_10.078
	Length: 504 elements
	- NACA 4-digit count: 239
	- NACA 5-digit count: 265
	Inlet velocity range: 46.84 to 77.95
	Angle of attack range: -4.93 to 14.79

Dataset_list: <class 'list'> containing elements of type <class 'numpy.ndarray'>

	Length: 504 elements
	Shape of each element: (N, 12)
	-  N: number of points in simulation
	- 12: number of features
	Number of points in first simulation as reference: 170,180
	- Note: number of points in simulations vary a bit

	Description of 12 features, in order: 
	Position x	Position y	Inlet velocity	Inlet velocity	
	Dist to airfoil	Normals a	Normals b	Velocity x	
	Velocity y	Pressure/mass	Kin. viscosity	Bool*
	
	*Evals to true if point lies on the airfoil


In [14]:
# Input:   dataset_list
# Output:  tensor(input: x, y, sdf), tensor(target: x, y, v_x, v_y)
#
# Challenges:
#   Memory Usage.
#      - `af.dataset.load` pulls all data into RAM, little left to play with.
#   Indexing.
#      - Pulling datapoints from multiple simluations means we want an indexable dataset, but simulations have varying vertical dimensions
#      - We coud use something like np.vstack, but this would create copies of dataset. Back to memory problem.
#
# Solution:
#   Avoid making any copies and directly reference data in RAM.
#      - Pros: Keeps RAM usage low! No need to spend money.
#      - Next steps: Do sanity check with teammates who understand physics part of it. Test the current code and explore how
#        Geometric Deep Learning libraries could help (mentioned in airfRANS documentation)

class AirfoilDataset(Dataset):
    def __init__(self, dataset_list):
        self.dataset_list = dataset_list

        # find total number of datapoints
        self.dataset_length = sum(sim.shape[0] for sim in dataset_list)

        # create iterator
        self.all_rows = chain.from_iterable(dataset_list)

    def __len__(self):
        return self.dataset_length

    def __getitem__(self, idx):
        elem = next(islice(self.all_rows, idx, idx+1))

        # pull features we need
        x = elem[0]
        y = elem[1]
        sdf = elem[4]
        v_x = elem[7]
        v_y = elem[8]

        # put everything in tensor
        input = torch.cat([torch.tensor((x, y, sdf), dtype = torch.float)])
        target = torch.cat([torch.tensor((x, y, v_x, v_y), dtype = torch.float)])
        return input, target

In [15]:
# Example output

dataset = AirfoilDataset(dataset_list)
print(dataset.__getitem__(0))

(tensor([ 4.2169, -0.1999,  3.2231]), tensor([ 4.2169, -0.1999, 54.5453, -3.3534]))
