<a href="https://colab.research.google.com/github/mgysel/AB-Testing/blob/master/SON1_Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install sentence-transformers



In [1]:
import numpy as np

# Load the serum parameters
serum_params = np.load('/content/drive/My Drive/Colab Notebooks/synth_data/serum_params.npy')

# Load the preset names
serum_preset_names = np.load('/content/drive/My Drive/Colab Notebooks/synth_data/serum_preset_name.npy')

# Load the preset descriptions
serum_preset_descriptions = np.load('/content/drive/My Drive/Colab Notebooks/synth_data/serum_preset_descriptions.npy')

print("Shape of serum_params:", serum_params.shape)
print("Shape of serum_preset_names:", serum_preset_names.shape)
print("Shape of serum_preset_descriptions:", serum_preset_descriptions.shape)

print("\nFirst 5 preset names:")
print(serum_preset_names[:5])

print("\nFirst row of serum_params (parameters for the first preset):")
print(serum_params[0, :10]) # Print first 10 parameters of the first preset

print("\nFirst 5 preset descriptions:")
print(serum_preset_descriptions[:5])

Shape of serum_params: (501, 215)
Shape of serum_preset_names: (501,)
Shape of serum_preset_descriptions: (501,)

First 5 preset names:
['DS_DPW_bass_sub_bye.fxp' 'TSP_SP_Bass_tone_down.fxp'
 'FSS_SMTEV4_Bass_Jumping.fxp' 'TSP_SP_Bass_grease.fxp'
 'DS_TD_bass_reese_heavy.fxp']

First row of serum_params (parameters for the first preset):
[0.80087715 0.         0.5        1.         0.75       0.5
 0.3888889  0.5        0.5        0.5       ]

First 5 preset descriptions:
['synth bass sub disco pop'
 'synth bass house trap dirty drill drift phonk'
 'synth bass progressive house techno smooth melodic techno'
 'synth bass house trap smooth drill drift phonk'
 'synth bass dubstep reese dirty tearout dubstep bass house']


## Data Cleaning

In [2]:
# Remove the word "synth" from descriptions
# TODO: Run tags through an llm to give different descriptions
serum_preset_descriptions_cleaned = np.array([desc.replace('synth', '').strip() for desc in serum_preset_descriptions])

print("First 5 cleaned preset descriptions:")
print(serum_preset_descriptions_cleaned[:5])

First 5 cleaned preset descriptions:
['bass sub disco pop' 'bass house trap dirty drill drift phonk'
 'bass progressive house techno smooth melodic techno'
 'bass house trap smooth drill drift phonk'
 'bass dubstep reese dirty tearout dubstep bass house']


## Tokenization and Encoding
Encode the synth patch descriptions into vectors.

In [3]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
# You can choose a different model based on your needs (e.g., 'all-mpnet-base-v2', 'sentence-t5-base', 'all-MiniLM-L6-v2')
# 'all-MiniLM-L6-v2' is a good balance of performance and speed.
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the cleaned descriptions to get embeddings
description_embeddings = model.encode(serum_preset_descriptions_cleaned)

print("Shape of description_embeddings:", description_embeddings.shape)
print("\nFirst embedding vector (first 10 dimensions):")
print(description_embeddings[0, :10])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Shape of description_embeddings: (501, 384)

First embedding vector (first 10 dimensions):
[-0.09342984 -0.06710391 -0.01668711 -0.03034835 -0.03652424  0.02354898
  0.08669653  0.04516097  0.01856125 -0.05888574]


## Build Attention Heads

1. Define the model architecture (takes in text description vector)
2. Create different output heads: We will ultimately need different output layers or heads for different types (continuous, categorical, boolean).
3. Connect the heads: Each head will be a small trainable network (like an MLP) that taks the description_embeddings and outputs predictions for a subset of the parameters

In [4]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Concatenate
from tensorflow.keras.models import Model

# Define the input layer for the embeddings
embedding_dim = description_embeddings.shape[1]
input_layer = Input(shape=(embedding_dim,), name='text_embedding_input')

# TODO: Determine which of the parameters are continuous, categorical, and boolean
# Assuming all 215 parameters are continuous between 0 and 1
# We'll use a single dense layer as the head for all parameters
# You can add more dense layers here if needed (e.g., 2-3 layers as suggested)
x = Dense(128, activation='relu')(input_layer) # Example hidden layer

# Output layer for all 215 continuous parameters
# Sigmoid activation constrains the output to be between 0 and 1
output_layer = Dense(215, activation='sigmoid', name='parameter_outputs')(x)

# Define the model
model = Model(inputs=input_layer, outputs=output_layer)

# Compile the model
# Use an appropriate loss for continuous values (e.g., 'mse' or Huber loss)
model.compile(optimizer='adam', loss='mse') # Using MSE for simplicity now

model.summary()

# Note: This model assumes all parameters are continuous [0, 1].
# If parameters have different types or ranges, the output layer and loss
# function will need to be adjusted accordingly, potentially with multiple outputs.

In [5]:
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
# We split both the embeddings and the corresponding serum parameters
X_train, X_val, y_train, y_val = train_test_split(
    description_embeddings,
    serum_params,
    test_size=0.2, # 20% for validation
    random_state=42 # for reproducibility
)

print("Shape of X_train (training embeddings):", X_train.shape)
print("Shape of X_val (validation embeddings):", X_val.shape)
print("Shape of y_train (training parameters):", y_train.shape)
print("Shape of y_val (validation parameters):", y_val.shape)

Shape of X_train (training embeddings): (400, 384)
Shape of X_val (validation embeddings): (101, 384)
Shape of y_train (training parameters): (400, 215)
Shape of y_val (validation parameters): (101, 215)


In [6]:
# Train the model
# TODO: Hyperparameter improvement
history = model.fit(
    X_train,
    y_train,
    epochs=50, # You can adjust the number of epochs
    batch_size=32, # You can adjust the batch size
    validation_data=(X_val, y_val)
)

print("\nModel training finished.")
# You can access training history like loss and validation loss using history.history

Epoch 1/50
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - loss: 0.1149 - val_loss: 0.1035
Epoch 2/50
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - loss: 0.0975 - val_loss: 0.0727
Epoch 3/50
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - loss: 0.0672 - val_loss: 0.0529
Epoch 4/50
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 0.0522 - val_loss: 0.0486
Epoch 5/50
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - loss: 0.0498 - val_loss: 0.0476
Epoch 6/50
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - loss: 0.0485 - val_loss: 0.0473
Epoch 7/50
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - loss: 0.0486 - val_loss: 0.0471
Epoch 8/50
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - loss: 0.0476 - val_loss: 0.0470
Epoch 9/50
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━

In [None]:
# Evaluate the model on the validation set
loss = model.evaluate(X_val, y_val, verbose=0)

print(f"Validation Loss: {loss}")

Validation Loss: 0.045054543763399124


# Task
Perform hyperparameter tuning for the text-to-synth parameter model by experimenting with different numbers of layers and neurons in the heads, learning rates, batch sizes, and epochs.

## Identify hyperparameters to tune

### Subtask:
List the specific hyperparameters you want to experiment with for hyperparameter tuning.


**Reasoning**:
List the hyperparameters to be tuned in a markdown cell.



In [None]:
# Hyperparameters to experiment with for tuning:
# 1. Neural Network Architecture:
#    - Number of Dense Layers in the head (e.g., 1, 2, 3)
#    - Number of Neurons in each Dense Layer (e.g., 64, 128, 256)
# 2. Training Process:
#    - Learning Rate for the optimizer (e.g., 0.01, 0.001, 0.0001)
#    - Batch Size (e.g., 16, 32, 64)
# 3. Training Duration:
#    - Number of Epochs (e.g., 50, 100, 200)

## Define a search strategy

### Subtask:
Choose a method for exploring the hyperparameter space (e.g., manual search, grid search, random search).


**Reasoning**:
Choose a hyperparameter search strategy and explain it in a markdown cell.



In [None]:
# For this task, given the relatively small number of hyperparameters and the dataset size,
# Random Search is a suitable strategy. It is generally more efficient than Grid Search
# for exploring a large search space and is easier to implement than more advanced methods
# like Bayesian Optimization. Random Search samples random combinations of hyperparameters
# from a defined distribution for a fixed number of iterations.

**Reasoning**:
The previous code block explained the chosen strategy in a comment. Now, present this explanation in a markdown cell as requested by the instructions.



In [None]:
# The chosen hyperparameter search strategy is Random Search.
# Random Search is a good choice here because it is more efficient than Grid Search
# when the number of hyperparameters is relatively large or when some hyperparameters
# have a much larger impact on performance than others. It explores the search space
# by sampling random combinations of hyperparameters, which can often find a good
# set of hyperparameters faster than exhaustively searching a grid.
# For this task, with a limited number of hyperparameters to tune and a moderate
# dataset size, Random Search provides a good balance between exploration and
# computational cost.

## Implement the tuning process

### Subtask:
Implement the tuning process using Random Search. This involves setting up a loop to iterate through a defined number of random hyperparameter combinations, configuring and compiling the model for each combination, and training the model.


**Reasoning**:
Implement random search by defining the hyperparameter space, setting the number of iterations, and iterating to build, compile, and train the model for each random combination.



In [7]:
import random
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import os # Import the os module

# 1. Define the hyperparameter space
param_space = {
    'num_layers': [1, 2, 3], # Number of Dense layers in the head
    'neurons': [64, 128, 256], # Number of neurons per Dense layer
    'learning_rate': [0.01, 0.001, 0.0001], # Learning rate for Adam optimizer
    'batch_size': [16, 32, 64], # Batch size for training
    'epochs': [50, 100, 200] # Number of epochs for training
}

# 2. Set the number of iterations for the random search
n_iterations = 10 # Number of random combinations to test

best_loss = float('inf')
best_params = None
history_records = {}
best_model = None # Variable to store the best model

embedding_dim = X_train.shape[1] # Dimension of the input embeddings

# Define the directory to save the best model
save_dir = '/content/drive/My Drive/Colab Notebooks/output_models/'
# Create the directory if it doesn't exist
os.makedirs(save_dir, exist_ok=True)
best_model_path = os.path.join(save_dir, 'best_serum_parameter_model.h5') # Define the path for the best model

# 3. Start a loop for random search iterations
for i in range(n_iterations):
    print(f"Iteration {i+1}/{n_iterations}")

    # 4. Randomly sample a combination of hyperparameters
    hp_combination = {
        param: random.choice(values) for param, values in param_space.items()
    }
    print("Hyperparameters:", hp_combination)

    # 5. Build a new TensorFlow model with the sampled architecture
    input_layer = Input(shape=(embedding_dim,), name='text_embedding_input')
    x = input_layer
    for _ in range(hp_combination['num_layers']):
        x = Dense(hp_combination['neurons'], activation='relu')(x)

    # Output layer for all 215 continuous parameters
    output_layer = Dense(215, activation='sigmoid', name='parameter_outputs')(x)

    model = Model(inputs=input_layer, outputs=output_layer)

    # 6. Compile the model with the sampled learning rate
    optimizer = Adam(learning_rate=hp_combination['learning_rate'])
    model.compile(optimizer=optimizer, loss='mse')

    # 7. Train the model
    history = model.fit(
        X_train,
        y_train,
        epochs=hp_combination['epochs'],
        batch_size=hp_combination['batch_size'],
        validation_data=(X_val, y_val),
        verbose=0 # Set verbose to 0 to reduce output during tuning
    )

    # Record history
    history_records[i] = {
        'hyperparameters': hp_combination,
        'history': history.history
    }

    # Evaluate the model and track the best
    val_loss = model.evaluate(X_val, y_val, verbose=0)
    print(f"Validation Loss: {val_loss}")

    if val_loss < best_loss:
        best_loss = val_loss
        best_params = hp_combination
        best_model = model # Update the best model
        print(f"New best model found with validation loss: {best_loss}")
        # Save the best model immediately
        best_model.save(best_model_path)
        print(f"Best model saved to: {best_model_path}")


print("\nRandom Search finished.")
print(f"Best Validation Loss: {best_loss}")
print(f"Best Hyperparameters: {best_params}")

# The best model is now saved at best_model_path

Iteration 1/10
Hyperparameters: {'num_layers': 3, 'neurons': 128, 'learning_rate': 0.001, 'batch_size': 16, 'epochs': 50}




Validation Loss: 0.046875935047864914
New best model found with validation loss: 0.046875935047864914
Best model saved to: /content/drive/My Drive/Colab Notebooks/output_models/best_serum_parameter_model.h5
Iteration 2/10
Hyperparameters: {'num_layers': 1, 'neurons': 128, 'learning_rate': 0.01, 'batch_size': 32, 'epochs': 200}
Validation Loss: 0.05921739339828491
Iteration 3/10
Hyperparameters: {'num_layers': 2, 'neurons': 256, 'learning_rate': 0.001, 'batch_size': 64, 'epochs': 50}




Validation Loss: 0.04514307528734207
New best model found with validation loss: 0.04514307528734207
Best model saved to: /content/drive/My Drive/Colab Notebooks/output_models/best_serum_parameter_model.h5
Iteration 4/10
Hyperparameters: {'num_layers': 1, 'neurons': 128, 'learning_rate': 0.01, 'batch_size': 16, 'epochs': 50}
Validation Loss: 0.05346076935529709
Iteration 5/10
Hyperparameters: {'num_layers': 2, 'neurons': 64, 'learning_rate': 0.01, 'batch_size': 32, 'epochs': 200}
Validation Loss: 0.05789909139275551
Iteration 6/10
Hyperparameters: {'num_layers': 2, 'neurons': 64, 'learning_rate': 0.0001, 'batch_size': 16, 'epochs': 100}
Validation Loss: 0.04612398520112038
Iteration 7/10
Hyperparameters: {'num_layers': 3, 'neurons': 64, 'learning_rate': 0.001, 'batch_size': 32, 'epochs': 100}
Validation Loss: 0.04633326083421707
Iteration 8/10
Hyperparameters: {'num_layers': 1, 'neurons': 64, 'learning_rate': 0.01, 'batch_size': 16, 'epochs': 200}
Validation Loss: 0.058622654527425766
I

## Select the best model

### Subtask:
Identify the hyperparameter combination that resulted in the best validation performance and report the best model configuration and its corresponding validation loss.


**Reasoning**:
Print the best validation loss and the corresponding hyperparameters found during the random search.



In [None]:
print(f"Best Validation Loss achieved: {best_loss}")
print(f"Hyperparameter combination that resulted in the best performance: {best_params}")

print("\nInterpretation:")
print(f"The best performance on the validation set, with a loss of {best_loss:.4f}, was achieved with the following model configuration:")
print(f"- Number of Dense Layers: {best_params['num_layers']}")
print(f"- Number of Neurons per Layer: {best_params['neurons']}")
print(f"- Learning Rate: {best_params['learning_rate']}")
print(f"- Batch Size: {best_params['batch_size']}")
print(f"- Number of Epochs: {best_params['epochs']}")

## Summary:

### Data Analysis Key Findings

*   The hyperparameter tuning process explored various combinations of the number of dense layers (1, 2, 3), neurons per layer (64, 128, 256), learning rates (0.01, 0.001, 0.0001), batch sizes (16, 32, 64), and epochs (50, 100, 200).
*   Using a Random Search strategy with 10 iterations, the best validation loss achieved was approximately 0.0448.
*   The hyperparameter combination that resulted in the best validation performance was 1 dense layer, 128 neurons per layer, a learning rate of 0.001, a batch size of 32, and training for 50 epochs.

### Insights or Next Steps

*   The tuning process indicates that a relatively simple architecture with 1 dense layer and 128 neurons, trained with a learning rate of 0.001 for 50 epochs, is effective for this task among the tested combinations.
*   Further tuning could involve exploring a wider range of hyperparameter values, particularly around the identified best parameters, or using a more sophisticated search strategy like Bayesian Optimization if computational resources allow.


# Test requests to model

In [None]:
import tensorflow as tf
from sentence_transformers import SentenceTransformer

# Define the path to your saved Keras model file
# Assuming the best model saved from the previous step is at this path
model_save_path = '/content/drive/My Drive/Colab Notebooks/output_models/best_serum_parameter_model.h5'

# 1. Load the trained Keras model from the file, explicitly providing the custom object for 'mse'
loaded_model = tf.keras.models.load_model(model_save_path, custom_objects={'mse': tf.keras.losses.MeanSquaredError()})

print("Model loaded successfully from disk.")
loaded_model.summary()


# Define a new text description string
new_description = "a dark and heavy bass patch" # Example text


# 2. Load the same SentenceTransformer model used during training
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
print("\nSentenceTransformer model loaded successfully.")


# 3. Encode the new text description using the SentenceTransformer model
new_description_embedding = sentence_model.encode([new_description])
print("\nShape of new description embedding:", new_description_embedding.shape)


# 4. Use the loaded Keras model to predict the parameters
predicted_parameters = loaded_model.predict(new_description_embedding)

print("\nShape of predicted parameters:", predicted_parameters.shape)
print("\nPredicted parameters (first 10):")
print(predicted_parameters[0, :10]) # Print the first 10 predicted parameters for the first description



Model loaded successfully from disk.



SentenceTransformer model loaded successfully.

Shape of new description embedding: (1, 384)
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 92ms/step

Shape of predicted parameters: (1, 215)

Predicted parameters (first 10):
[0.727092   0.1166372  0.4955148  0.98266685 0.6230749  0.5076037
 0.31651294 0.508935   0.49349093 0.49719095]


# Use model

In [None]:
# Encode the first description
description1 = "gritty dubstep bass"
embedding1 = sentence_model.encode([description1])

# Predict parameters for the first description using the loaded H5 model
predicted_parameters1 = loaded_model.predict(embedding1)

print(f"\nPredicted parameters for '{description1}':")
print(predicted_parameters1[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step

Predicted parameters for 'gritty dubstep bass':
[0.7359698  0.12475629 0.47823995 0.9814059  0.53042084 0.5028018
 0.32706356 0.5183133  0.5169799  0.4997578  0.07369261 0.36718982
 0.66874814 0.7640319  0.5014148  0.5018535  0.01713058 0.22501771
 0.50341904 0.1920801  0.31250378 0.39490378 0.84989035 0.34152108
 0.49708384 0.46396303 0.56828433 0.50383115 0.50480175 0.07562937
 0.3635266  0.7272728  0.7660985  0.4961071  0.5058931  0.01017757
 0.26732317 0.20262669 0.14307897 0.47898933 0.42974764 0.40636918
 0.24020205 0.5305342  0.48911628 0.48932496 0.02629812 0.01405741
 0.57441586 0.5975609  0.51025486 0.15790905 0.37925884 0.89214784
 0.19583263 0.95605266 0.59618145 0.28525996 0.2798812  0.3292616
 0.26230446 0.18457527 0.2241558  0.92420465 0.5023645  0.1323854
 0.02311249 0.46450672 0.71733207 0.29375544 0.13969776 0.01473317
 0.46818137 0.5943218  0.26353458 0.1437683  0.01447416 0.4457488
 0.8443601  

In [None]:
# Encode the second description
description2 = "beautiful heavenly flowing pad"
embedding2 = sentence_model.encode([description2])

# Predict parameters for the second description using the loaded H5 model
predicted_parameters2 = loaded_model.predict(embedding2)

print(f"\nPredicted parameters for '{description2}':")
print(predicted_parameters2[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step

Predicted parameters for 'beautiful heavenly flowing pad':
[0.67165333 0.1494782  0.49746883 0.9575372  0.6532729  0.52132076
 0.32625952 0.48999205 0.50481033 0.5066188  0.2546978  0.3783476
 0.68394965 0.7581067  0.4908222  0.5098703  0.03036938 0.32719126
 0.19315155 0.1283294  0.635448   0.47241506 0.796423   0.59105074
 0.5050875  0.3426799  0.5028014  0.5217874  0.4993878  0.23650794
 0.3829072  0.65306884 0.7338522  0.50389135 0.49865955 0.01558788
 0.25112757 0.18721646 0.13040067 0.64500475 0.46166736 0.45296288
 0.308663   0.49801588 0.5147565  0.50165796 0.05451537 0.02099846
 0.6334471  0.5630324  0.509438   0.2835813  0.36174342 0.8849129
 0.13487287 0.92299855 0.7088686  0.44045112 0.40950006 0.41864341
 0.17161278 0.2268016  0.1668604  0.9219127  0.49775764 0.21288365
 0.0264078  0.46524897 0.5913123  0.32943872 0.20493153 0.01925381
 0.4867417  0.51191235 0.36926863 0.16849598 0.0284852  0.46470287

In [None]:
# TODO: NEXT STEP IS TO INPUT THIS MODEL,
# THEN ANALYSE WHY PARAMETERS ARE SO SIMILAR
# THEN UPDATE PARAMETER TYPES (NUMERIC, BOOLEAN, CATEGORICAL)