54 measuring novelty #59

beingkk · 2023-02-21T18:12:28Z

Closes #54

This is a first PR for novelty measurement component of the project.

It contributes:

Utils file dap_aria_mapping/utils/novelty_utils.py
Script to calculate novelty scores for OpenAlex papers dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

The usage examples of the script are:

All levels, full dataset (takes about 30 mins)

python dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

One level, test dataset (a few seconds)

python dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py --taxonomy-level 1 --test

The outputs for each taxonomy level are:

A table with all documents and their novelty score
A table with all topic pairs and the years they have occurred, and the commonness score for that topic pair and year

The outputs are presently stored locally. Happy to create new issues for adding storing on s3, and any other improvements you'd like to suggest.

Also let me know if I need to write tests - would be happy to leave that for another issue to move on with generating results.

For more context: The novelty score is calculated using the approach described in this paper (Lee et al 2015). This so-called "U-measure" has been shown (Bornmann et al 2019) to have some correlation with what researchers consider novel papers. However, note that I have adapted it for combinations of topics - whereas originally it has been used for combinations of citations/cited journals. This will likely create some challenges (eg, citations in a way reflect the full content of the paper, whereas abstracts are much shorter)

From Lee et al 2015:

From Bornmann et al 2019

The results for novelty score U are (mostly) in agreement with our expectations concerning the results for the different tags. We found, for instance, that for a standard deviation increase in novelty score U, the expected number of assignments of the “new finding” tag increases by 7.47% (the result is statistically significant). The results further show that this indicator seems to be especially suited to identifying papers suggesting new targets for drug discovery

Checklist:

emily-bicks

Looks great!

General comments are:

would be really helpful to save the outputs in S3 and then write getters so we can use them to generate the data for the app
Is there a reason we're only doing novelty on publications and not patents?
Suggest moving the script out of the notebook directory and into pipelines

emily-bicks · 2023-03-02T12:31:52Z

dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

@@ -0,0 +1,81 @@
+""" 
+Script/pipeline to calculate novelty scores for OpenAlex papers


would suggest creating pipelines/novelty for this script rather than "notebooks"

Good point, will change!

On that note, I see you are using typer - any reason for using it over argparse? I have no preference, but if you are going to build CLIs, we've been doing it exclusively with the latter (plenty of examples throughout the pipeline folder).

I haven't used argparse, but typer is super easy to use (looks like a bit simpler than argparse in terms of the length of boilerplate code)

@ampudia19 https://towardsdatascience.com/typer-probably-the-simplest-to-use-python-command-line-interface-library-17abf1a5fd3e

emily-bicks · 2023-03-02T12:33:49Z

dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

+    Returns:
+        None (saved novelty scores to file)
+    """
+    # Fetch OpenAlex metadata


Is there a reason we're only doing this for papers? can we do it for patents as well?

I haven't gotten to patent data yet - I would like to keep it for another PR. I think I need to first sense-check the results for papers, see if the novelty score actually yields any interesting results and insights.

emily-bicks · 2023-03-02T12:37:01Z

dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

+        work_novelty_df, topic_pair_commonness_df = nu.document_novelty(
+            topics_df, "work_id"
+        )
+        # Export novelty scores


suggest adding a command line option for --local but otherwise save results to S3 using nesta_ds_utils saving and loading function

Okay, will try to add that one in this PR.

emily-bicks · 2023-03-02T12:37:34Z

dap_aria_mapping/utils/novelty_utils.py

+
+
+def preprocess_topics_dict(
+    topics_dict: dict,


doesn't matter too much, but we've been using Dict from typing so you can specify the types of the keys/values

I see, makes sense, will fix.

dap_aria_mapping/utils/novelty_utils.py

emily-bicks · 2023-03-02T12:44:29Z

dap_aria_mapping/utils/novelty_utils.py

+    )
+
+
+def novelty_score(commonness_score: float) -> float:


this function name could be a bit more specific as it's not really calculating the full novelty score and I was a bit confused with def document_novelty()

beingkk · 2023-03-02T14:59:21Z

Thanks very much @emily-bicks, I will aim to adjust according to your comments and resubmit by end of week or Monday.

ampudia19 · 2023-03-02T15:04:17Z

dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

+OUTPUT_FOLDER = PROJECT_DIR / "outputs/interim"
+
+
+def main(


If you migrate your code to pipeline, I'd suggest making the function name a bit more informative.

ampudia19 · 2023-03-02T15:06:31Z

dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

+    n_test_sample: int = 1000,
+):
+    """
+    Calculate novelty scores for OpenAlex papers


It would be quite nice if you briefly describe the method / calculation, for code legacy sake (and not relying on checking the very nice PR message you wrote on it)

ampudia19 · 2023-03-02T15:08:54Z

dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py

+        # Export novelty scores
+        OUTPUT_FOLDER.mkdir(parents=True, exist_ok=True)
+        export_path = OUTPUT_FOLDER / f"openalex_novelty_{level}{test_suffix}.csv"
+        work_novelty_df.to_csv(export_path, index=False)


We've generally saved everything as parquets. I don't think this is an issue at all, but for consistency sake maybe it's worth considering changing it to that format.

Nice - I've never used parquet but will try

ampudia19 · 2023-03-02T15:18:17Z

Very nice PR Karlis, I haven't had a chance at running it (my laptop is hovering over 90% CPU & Mem usage unfortunately), but I can't immediately see anything wrong.

A question I do have is where you plan to take this. If you aim to recreate the paper measure, ie. not pairs of topics (or journals) but rather pairs of referenced topics (or journals), iterrowing pandas DFs is not going to work, and you'll have to resort either to parallelizable approaches, polars (no idea how this one works), or vectorizing your ops. I'm sure you've given this some thought, but I still want to flag it while we have time.

Being able to recreate the Lee 2015 measure would be nice for validation (as we can then claim Bornmann's argument that it roughly works). It would also serve as the perfect baseline for using our simpler (no reference papers, only use within-paper topic combs) approach, in a way that we can potentially check how much we lose as we ignore citation info.

beingkk · 2023-03-02T15:52:33Z

Thanks very much @ampudia19! I will aim to incorporate the easy to do suggestions in this PR by end of Monday.

In terms of where to take this next, I'd like to approach this iteratively, with developing the minimum viable product and building on top of it as much as time allows.

The present implementation is Step 1 - I think we can see it as an almost ready MVP, as we can use it to spot "uncommon" combinations of taxonomy topics.

Step 2: I suspect it will still need some filtering to remove noise (eg, low frequency, non-interesting, random combinations of topics) and light sense-checking (eg, browsing the most and least "novel" papers and seeing if it seems to make sense)

Step 3: I would like to focus on aggregating the novelty scores to provide useful input into the dashboard (will need to discuss the details during our catch up)

Step 4: Apply the same pipeline to patents.

Step 5: Only then consider improvements or changes to the novelty score. I think trying re-implementing the citation-based version would be a good bet - would be interesting to compare the results with topic-based version (if we can get the journal names of the cited papers). In terms of computational cost, if it's an order of magnitude increase then it might be still doable with the present implementation.

I think the other option you suggest, to use the abstracts of the cited papers and detect topics in those will be a bigger challenge (lots more data, more optimisation) - I'm doubtful I'll be able to complete that by end of March...

beingkk · 2023-03-03T11:45:23Z

Hey @ampudia19 and @emily-bicks, thanks again for the quick review!

I've responded to most of your comments (see below). I haven't changed from typer to argparse - if you insist, perhaps I can create a new issue and address it a bit later? If yes, than happy to merge now.

Saving the outputs in S3 and then write getters
Moving the script out of the notebook directory and into pipelines
Adding a command line option for --save-to-local to save locally
Typing
Missing "Returns" in docstrings
More informative function names
Briefly describing the method / calculation, for code legacy sake
Saving everything as parquets

emily-bicks · 2023-03-03T11:59:54Z

Merge away!

…

On Fri, 3 Mar 2023 at 11:45, Karlis Kanders ***@***.***> wrote: Hey @ampudia19 <https://github.com/ampudia19> and @emily-bicks <https://github.com/emily-bicks>, thanks again for the quick review! I've responded to most of your comments (see below). I haven't changed from typer to argparse - if you insist, perhaps I can create a new issue and address it a bit later? If yes, than happy to merge now. - Saving the outputs in S3 and then write getters - Moving the script out of the notebook directory and into pipelines - Adding a command line option for --save-to-local to save locally - Typing - Missing "Returns" in docstrings - More informative function names - Briefly describing the method / calculation, for code legacy sake - Saving everything as parquets — Reply to this email directly, view it on GitHub <#59 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A3ANJEV7M4DNVVBADFZP6VTW2HKV7ANCNFSM6AAAAAAVDLWUCE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -- Emily Bicks | Principal Data Scientist, Data Analytics Practice Pronouns: she/her

-- 58 Victoria Embankment London EC4Y 0DS www.nesta.org.uk <http://www.nesta.org.uk/> | Twitter <http://www.twitter.com/nesta_uk> | LinkedIn <http://www.linkedin.com/groups?gid=1868227&goback=%2Egdr_1274367066783_1> | Facebook <http://www.facebook.com/pages/NESTA/116788428355432?v=wall&ref=sgm> If you no longer want to receive emails from Nesta, send an email to ***@***.*** ***@***.***>. This email and any attachments are confidential and may be subject to legal privilege. Any use, copying or disclosure other than by the intended recipient is unauthorised. If you have received this message in error, please notify the sender immediately or by email to ***@***.*** ***@***.***> and delete this message and any copies from your computer and network. The views expressed in this email are those of the author and do not necessarily reflect the views of Nesta. Nesta is a company limited by guarantee and registered in England and Wales with company number 7706036 and charity number 1144091. Registered as a charity in Scotland number SC042833. Registered office: 58 Victoria Embankment, London, EC4Y 0DS.

--

beingkk added 8 commits February 21, 2023 18:11

getting started

5eb3e30

testing out novelty measures on dummy data

fa9121b

factoring out novelty utils

15b2d7f

cleaning up

f8c7943

pipeline script

60661e7

pipeline script

705d6b5

fixing gitignore

b97ec58

faster version of utils and pipeline

1d856a2

beingkk requested review from ampudia19 and emily-bicks March 1, 2023 19:13

beingkk added the enhancement New feature or request label Mar 1, 2023

add typer to requirements

6854006

beingkk marked this pull request as ready for review March 2, 2023 12:27

emily-bicks reviewed Mar 2, 2023

View reviewed changes

ampudia19 reviewed Mar 2, 2023

View reviewed changes

beingkk and others added 4 commits March 3, 2023 11:18

fixed typing, docstrings; used parquet; saving to s3; added getters

5da4343

fixed docstring

14303d4

more informative function name

a98aee7

Merge branch 'dev' into 54_novelty_measure

1bfb675

beingkk merged commit ada42bf into dev Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

54 measuring novelty #59

54 measuring novelty #59

beingkk commented Feb 21, 2023 •

edited

Loading

emily-bicks left a comment

emily-bicks Mar 2, 2023

beingkk Mar 2, 2023

ampudia19 Mar 2, 2023

beingkk Mar 3, 2023 •

edited

Loading

beingkk Mar 3, 2023

emily-bicks Mar 2, 2023

beingkk Mar 2, 2023

emily-bicks Mar 2, 2023

beingkk Mar 2, 2023

emily-bicks Mar 2, 2023

beingkk Mar 2, 2023

emily-bicks Mar 2, 2023

beingkk commented Mar 2, 2023

ampudia19 Mar 2, 2023

ampudia19 Mar 2, 2023

ampudia19 Mar 2, 2023

beingkk Mar 2, 2023

ampudia19 commented Mar 2, 2023

beingkk commented Mar 2, 2023 •

edited

Loading

beingkk commented Mar 3, 2023

emily-bicks commented Mar 3, 2023 via email

		@@ -0,0 +1,81 @@
		"""
		Script/pipeline to calculate novelty scores for OpenAlex papers

54 measuring novelty #59

54 measuring novelty #59

Conversation

beingkk commented Feb 21, 2023 • edited Loading

Checklist:

emily-bicks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beingkk Mar 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beingkk commented Mar 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ampudia19 commented Mar 2, 2023

beingkk commented Mar 2, 2023 • edited Loading

beingkk commented Mar 3, 2023

emily-bicks commented Mar 3, 2023 via email

beingkk commented Feb 21, 2023 •

edited

Loading

beingkk Mar 3, 2023 •

edited

Loading

beingkk commented Mar 2, 2023 •

edited

Loading