Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

element profile/hyperparameter optimization #385

Closed
Rana-Phy opened this issue Jan 13, 2022 · 5 comments
Closed

element profile/hyperparameter optimization #385

Rana-Phy opened this issue Jan 13, 2022 · 5 comments

Comments

@Rana-Phy
Copy link

Dear Developers,

I am trying to optimize the element profile for a multicomponent system.
I am a very beginner in python doing this (manually) by python 'for loop'.
I am afraid that it will take 15 years to be finished (200x200x200 number of searches).
I am seeing that authors previously did it for several multicomponent systems.

Could you suggest to us some efficient and faster way to do it?

####################

loop

rcut_grid = []
for rc_1 in np.arange(4,6,0.01):
for rc_2 in np.arange(4,6,0.01):
for rc_3 in np.arange(4,6,0.01):

        element_profile = {'Ti': {'r': rc_1, 'w': Ti}, 'Si': {'r': rc_2 , 'w': Si}, 
               'C': {'r': rc_3, 'w': C}}
        describer = BispectrumCoefficients(rcutfac=0.5, twojmax=6, 
                               element_profile=element_profile, quadratic=False, 
                               pot_fit=True, include_stress=False, n_jobs=4)
        tsc_features = describer.transform(tsc_train_structures)
        y = tsc_df['y_orig'] / tsc_df['n']
        x = tsc_features
        simple_model = LinearRegression(n_jobs=4)
        simple_model.fit(x, y, sample_weight=weights)
        energy_indices = np.argwhere(np.array(tsc_df["dtype"]) == "energy").ravel()
        forces_indices = np.argwhere(np.array(tsc_df["dtype"]) == "force").ravel()
        simple_predict_y = simple_model.predict(x)
        original_energy = y[energy_indices]
        original_forces = y[forces_indices]
        simple_predict_energy = simple_predict_y[energy_indices]
        simple_predict_forces = simple_predict_y[forces_indices]
        e_e=mean_absolute_error(original_energy, simple_predict_energy) *10000
        e_f=mean_absolute_error(original_forces, simple_predict_forces)

        rcut_grid.append((rc_1, rc_2, rc_3, e_e, e_f))
@JiQi535
Copy link
Collaborator

JiQi535 commented Jan 13, 2022

Hi Rana, I can give two pieces of advice:

  1. Try to parallelize your grid search for the best combination of parameters.
    Since each combination of parameters is independent to each other, we can let them run in parallel and make selection afterwards. There are Python packages helping us to parallelize the grid search, for example, the multiprocessing package. If you can divide your search into 24 parallel processes, then the search is likely accelerated for a few times or over 10 times.
  2. Make a reasonable size of search space for the optimal parameters.
    In your case, 200x200x200 number of searches seem to include too many cases which are not practical or necessary. I won't suggest an exact range for your search, but you may decide the intervals and total number of searches depends on the available resources you have.

Hi Rana, I can give two pieces of advice:

  1. Try to parallelize your grid search for the best combination of parameters.
    Since each combination of parameters is independent to each other, we can let them run in parallel and make selection afterwards. There are Python packages helping us to parallelize the grid search, for example, the multiprocessing package. If you can divide your search into 24 parallel processes, then the search is likely accelerated for a few times or over 10 times.

  2. Make a reasonable size of search space for the optimal parameters.
    In your case, 200x200x200 number of searches seem to include too many cases which are not practical or necessary. I won't suggest an exact range for your search, but you may decide the intervals and total number of searches depends on the available resources you have.

@Rana-Phy
Copy link
Author

Thanks for your suggestion. It is fast now!
Is there any technical reason behind 'divide your search into 24 parallel processes'?

@JiQi535
Copy link
Collaborator

JiQi535 commented Jan 14, 2022

Thanks for your suggestion. It is fast now! Is there any technical reason behind 'divide your search into 24 parallel processes'?

Happy to know that it helps! I used "24" as an example, as there are 24 cores on each node of the computer cluster resources our group have access to. This value should be modified on different machines to achieve best efficiency.

@Rana-Phy
Copy link
Author

Dear Ji Qi,

Ok, now I am seeing that multiprocessing is at least three times slower than 'n_jobs'=24 of sci-kit learn for the larger dataset.
Maybe this could be because of our cluster setup or my script.
I am trying to understand the maml base model classes.
and my understanding could be completely wrong.
So,
skl_model=SKLModel(describer=describer, model=LinearRegression())
Is not the SKLModel is the model and describer is containing the hyperparameter (cut,wt,jmax) and model=LinearRegression() parameters will be learned during training/fitting?
May I put element_profile in a hyperparameter optimization package like optuna, hyperopt or anything you suggest instead of doing for loop. It's not possible for some reason.
I will be waiting to hear from you.

Best regards,
Rana

@JiQi535
Copy link
Collaborator

JiQi535 commented Jan 25, 2022

Rana-Phy

The describer here is the local environment describer of the SNAP potential, which describes the material structures in a math form. The LinearRegression is the model used in the ML training process to connect the local environment describers (input) to target properties, which are energies, forces and stresses.

For parameter tuning, I'm not aware any existing automatic algorithms for SNAP training. Please let me know if there is, which I would be interested in. In previous works from our group, we used differential evolution implemented in scipy for parameter tuning of a SNAP for Mo (http://dx.doi.org/10.1103/PhysRevMaterials.1.043603), and we also used stepwise grid search for SNAPs for alloy systems (http://dx.doi.org/10.1038/s41524-020-0339-0). Those maybe good references for you.

@shyuep shyuep closed this as completed Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants