Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

54 measuring novelty #59

Merged
merged 13 commits into from
Mar 6, 2023
Merged

54 measuring novelty #59

merged 13 commits into from
Mar 6, 2023

Conversation

beingkk
Copy link
Contributor

@beingkk beingkk commented Feb 21, 2023

Closes #54

This is a first PR for novelty measurement component of the project.

It contributes:

  • Utils file dap_aria_mapping/utils/novelty_utils.py
  • Script to calculate novelty scores for OpenAlex papers dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

The usage examples of the script are:

All levels, full dataset (takes about 30 mins)

python dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

One level, test dataset (a few seconds)

python dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py --taxonomy-level 1 --test

The outputs for each taxonomy level are:

  • A table with all documents and their novelty score
  • A table with all topic pairs and the years they have occurred, and the commonness score for that topic pair and year

The outputs are presently stored locally. Happy to create new issues for adding storing on s3, and any other improvements you'd like to suggest.

Also let me know if I need to write tests - would be happy to leave that for another issue to move on with generating results.

For more context: The novelty score is calculated using the approach described in this paper (Lee et al 2015). This so-called "U-measure" has been shown (Bornmann et al 2019) to have some correlation with what researchers consider novel papers. However, note that I have adapted it for combinations of topics - whereas originally it has been used for combinations of citations/cited journals. This will likely create some challenges (eg, citations in a way reflect the full content of the paper, whereas abstracts are much shorter)

From Lee et al 2015:
Screenshot 2023-03-02 at 12 22 15

From Bornmann et al 2019

The results for novelty score U are (mostly) in agreement with our expectations concerning the results for the different tags. We found, for instance, that for a standard deviation increase in novelty score U, the expected number of assignments of the “new finding” tag increases by 7.47% (the result is statistically significant). The results further show that this indicator seems to be especially suited to identifying papers suggesting new targets for drug discovery

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

@beingkk beingkk added the enhancement New feature or request label Mar 1, 2023
@beingkk beingkk marked this pull request as ready for review March 2, 2023 12:27
Copy link
Contributor

@emily-bicks emily-bicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

General comments are:

  • would be really helpful to save the outputs in S3 and then write getters so we can use them to generate the data for the app
  • Is there a reason we're only doing novelty on publications and not patents?
  • Suggest moving the script out of the notebook directory and into pipelines

@@ -0,0 +1,81 @@
"""
Script/pipeline to calculate novelty scores for OpenAlex papers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would suggest creating pipelines/novelty for this script rather than "notebooks"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will change!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note, I see you are using typer - any reason for using it over argparse? I have no preference, but if you are going to build CLIs, we've been doing it exclusively with the latter (plenty of examples throughout the pipeline folder).

Copy link
Contributor Author

@beingkk beingkk Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used argparse, but typer is super easy to use (looks like a bit simpler than argparse in terms of the length of boilerplate code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns:
None (saved novelty scores to file)
"""
# Fetch OpenAlex metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're only doing this for papers? can we do it for patents as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten to patent data yet - I would like to keep it for another PR. I think I need to first sense-check the results for papers, see if the novelty score actually yields any interesting results and insights.

work_novelty_df, topic_pair_commonness_df = nu.document_novelty(
topics_df, "work_id"
)
# Export novelty scores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest adding a command line option for --local but otherwise save results to S3 using nesta_ds_utils saving and loading function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will try to add that one in this PR.



def preprocess_topics_dict(
topics_dict: dict,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't matter too much, but we've been using Dict from typing so you can specify the types of the keys/values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense, will fix.

dap_aria_mapping/utils/novelty_utils.py Show resolved Hide resolved
)


def novelty_score(commonness_score: float) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function name could be a bit more specific as it's not really calculating the full novelty score and I was a bit confused with def document_novelty()

@beingkk
Copy link
Contributor Author

beingkk commented Mar 2, 2023

Thanks very much @emily-bicks, I will aim to adjust according to your comments and resubmit by end of week or Monday.

OUTPUT_FOLDER = PROJECT_DIR / "outputs/interim"


def main(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you migrate your code to pipeline, I'd suggest making the function name a bit more informative.

n_test_sample: int = 1000,
):
"""
Calculate novelty scores for OpenAlex papers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be quite nice if you briefly describe the method / calculation, for code legacy sake (and not relying on checking the very nice PR message you wrote on it)

# Export novelty scores
OUTPUT_FOLDER.mkdir(parents=True, exist_ok=True)
export_path = OUTPUT_FOLDER / f"openalex_novelty_{level}{test_suffix}.csv"
work_novelty_df.to_csv(export_path, index=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've generally saved everything as parquets. I don't think this is an issue at all, but for consistency sake maybe it's worth considering changing it to that format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - I've never used parquet but will try

@ampudia19
Copy link
Contributor

Very nice PR Karlis, I haven't had a chance at running it (my laptop is hovering over 90% CPU & Mem usage unfortunately), but I can't immediately see anything wrong.

A question I do have is where you plan to take this. If you aim to recreate the paper measure, ie. not pairs of topics (or journals) but rather pairs of referenced topics (or journals), iterrowing pandas DFs is not going to work, and you'll have to resort either to parallelizable approaches, polars (no idea how this one works), or vectorizing your ops. I'm sure you've given this some thought, but I still want to flag it while we have time.

Being able to recreate the Lee 2015 measure would be nice for validation (as we can then claim Bornmann's argument that it roughly works). It would also serve as the perfect baseline for using our simpler (no reference papers, only use within-paper topic combs) approach, in a way that we can potentially check how much we lose as we ignore citation info.

@beingkk
Copy link
Contributor Author

beingkk commented Mar 2, 2023

Thanks very much @ampudia19! I will aim to incorporate the easy to do suggestions in this PR by end of Monday.

In terms of where to take this next, I'd like to approach this iteratively, with developing the minimum viable product and building on top of it as much as time allows.

The present implementation is Step 1 - I think we can see it as an almost ready MVP, as we can use it to spot "uncommon" combinations of taxonomy topics.

Step 2: I suspect it will still need some filtering to remove noise (eg, low frequency, non-interesting, random combinations of topics) and light sense-checking (eg, browsing the most and least "novel" papers and seeing if it seems to make sense)

Step 3: I would like to focus on aggregating the novelty scores to provide useful input into the dashboard (will need to discuss the details during our catch up)

Step 4: Apply the same pipeline to patents.

Step 5: Only then consider improvements or changes to the novelty score. I think trying re-implementing the citation-based version would be a good bet - would be interesting to compare the results with topic-based version (if we can get the journal names of the cited papers). In terms of computational cost, if it's an order of magnitude increase then it might be still doable with the present implementation.

I think the other option you suggest, to use the abstracts of the cited papers and detect topics in those will be a bigger challenge (lots more data, more optimisation) - I'm doubtful I'll be able to complete that by end of March...

@beingkk
Copy link
Contributor Author

beingkk commented Mar 3, 2023

Hey @ampudia19 and @emily-bicks, thanks again for the quick review!

I've responded to most of your comments (see below). I haven't changed from typer to argparse - if you insist, perhaps I can create a new issue and address it a bit later? If yes, than happy to merge now.

  • Saving the outputs in S3 and then write getters
  • Moving the script out of the notebook directory and into pipelines
  • Adding a command line option for --save-to-local to save locally
  • Typing
  • Missing "Returns" in docstrings
  • More informative function names
  • Briefly describing the method / calculation, for code legacy sake
  • Saving everything as parquets

@emily-bicks
Copy link
Contributor

emily-bicks commented Mar 3, 2023 via email

@beingkk beingkk merged commit ada42bf into dev Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Measuring novelty
3 participants