Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new synthesis recipes to API. #257

Merged
merged 19 commits into from
Jun 3, 2021
Merged

Conversation

hhaoyan
Copy link
Contributor

@hhaoyan hhaoyan commented Apr 26, 2021

Adding new schema for synthesis recipes as we want to incorporate the latest solid-state/sol-gel synthesis datasets at https://github.com/CederGroupHub/text-mined-synthesis_public.

Contributor Checklist

  • I have broken down my PR scope into the following TODO tasks
    • Define new pydantic models.
    • Write queries. The following queries will be implemented:
      • Query by target/precursor material composition.
      • Query by synthesis type.
      • Query by paragraph string (keyword matching).
      • Filtering by experimental operations.
      • Filtering by experimental conditions.
    • Implement script that ports the public dataset from here to MP database.
  • I have run the tests locally and they passed.
  • I have added tests, or extended existing tests, to cover any new features or bugs fixed in this PR

@hhaoyan
Copy link
Contributor Author

hhaoyan commented May 6, 2021

Hi @mkhorton could you elaborate on what tests should be implemented for this particular API? Thanks!

@mkhorton
Copy link
Member

mkhorton commented May 6, 2021

Right now, no tests (shocking, I know!). We're just building up our test suite for this repo and figuring out what they should look like, starting with client tests, maybe @munrojm can comment.

In any case, this can be merged without tests if it's otherwise good.

@munrojm
Copy link
Member

munrojm commented May 6, 2021

Hi @hhaoyan, this looks great so far! Thanks for putting it together. As @mkhorton said, I am in the process of getting a full initial suite of tests implemented, so don't worry about that portion.

Let us know when you feel it is ready to merge.

@hhaoyan
Copy link
Contributor Author

hhaoyan commented May 7, 2021

@mkhorton @munrojm I'm thinking to do some post processing in Query classes based on the query parameters. However, the current function def post_process(self, docs: List[Dict]) only accepts a single parameter docs. Is there a proper way to pass the query parameters to post_process?

@munrojm
Copy link
Member

munrojm commented May 7, 2021

@hhaoyan, not currently. What sort of post-processing do you have in mind? Right now the idea is to have the post_process method run for both the search and key name endpoints. This would not support using values of the query parameters that are only seen in the former.

@hhaoyan
Copy link
Contributor Author

hhaoyan commented May 7, 2021

I’m trying to selectively return part of a full text paragraph based on search keywords, which requires the post processor to have access to the search keywords. For example,if a search keyword is preset in the second sentence, then only that sentence will be returned. This is somewhat similar to what google search results look like. Is there a way I can achieve this?

@munrojm
Copy link
Member

munrojm commented May 7, 2021

Have you looked into using an aggregation pipeline similar to what is in query_synth_text? I don't know all of your requirements, but each "passage" that is returned should be a short bit of text from the paragraph that is queried containing the search term. See here, https://docs.atlas.mongodb.com/reference/atlas-search/highlighting/.

Let me know if this looks like it will work. If not, we can figure out something else.

@mkhorton
Copy link
Member

mkhorton commented May 7, 2021

Yes, this is how I've done the highlighting on the current synthesis data -- I just set maxNumPassages to 1 and I believe that's below the 150 limit. Whether we post-process or not, that highlighting functionality is definitely what we want to use. It's fast and we've already written the frontend to handle the output.

@hhaoyan
Copy link
Contributor Author

hhaoyan commented May 8, 2021

Thanks for the info @mkhorton @munrojm ! I looked at what had been done in /text_search/ and I think this should be enough. However I don't have access to a MongoDB Atlas subscription and couldn't test the search feature on my local computer. If that's possible, I'd like to have some access to a test environment so I can test & convert all the data records from our side.

@hhaoyan
Copy link
Contributor Author

hhaoyan commented Jun 3, 2021

@mkhorton I think this could be merged now!

@munrojm
Copy link
Member

munrojm commented Jun 3, 2021

@hhaoyan, can you confirm that you are able to query properly when running the API locally?

@hhaoyan
Copy link
Contributor Author

hhaoyan commented Jun 3, 2021

Just had a conversation with @codytodonnell and it seemed to work well.

@munrojm
Copy link
Member

munrojm commented Jun 3, 2021

@hhaoyan, okay great. Do you mind fixing the mypy linting issues before I merge? Don't worry about the second set of tests.

@hhaoyan
Copy link
Contributor Author

hhaoyan commented Jun 3, 2021

OK sure!

Also, the code has 9 ensure_index function calls when each query is run . Do you think this would slow down the API? Do you have any suggestions on how to make ensure_index to be run just by once?

The code that runs 9 ensure_index is here:

https://github.com/materialsproject/api/pull/257/files#diff-03cb6bf4cb52b7cc45905fd54f7cb1e29314ae92bc32da135f78b526059d2e98R209

@munrojm
Copy link
Member

munrojm commented Jun 3, 2021

I am going to merge main in with another branch that should allow me to define a separate query operator for this custom function. That should allow it to only run when in debug mode. I will take care of that change after this is merged.

For now, can you simply comment out that code block?

@mkhorton
Copy link
Member

mkhorton commented Jun 3, 2021

This looks great, my only comment would be to add a docstring to the data_adaptor files to clarify what they do, what data they take, etc. and for some of the functions in those files too.

@hhaoyan
Copy link
Contributor Author

hhaoyan commented Jun 3, 2021

OK, the mypy lint was fixed and I added docstrings to data adaptors.

I also commented the block of ensuring indexes.

@munrojm
Copy link
Member

munrojm commented Jun 3, 2021

Great, thanks @hhaoyan!

@munrojm munrojm merged commit 80f6599 into materialsproject:main Jun 3, 2021
@mkhorton
Copy link
Member

mkhorton commented Jun 3, 2021 via email

@hhaoyan
Copy link
Contributor Author

hhaoyan commented Jun 3, 2021

Thanks @mkhorton and @munrojm !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants