-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
54 measuring novelty #59
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
General comments are:
- would be really helpful to save the outputs in S3 and then write getters so we can use them to generate the data for the app
- Is there a reason we're only doing novelty on publications and not patents?
- Suggest moving the script out of the notebook directory and into pipelines
@@ -0,0 +1,81 @@ | |||
""" | |||
Script/pipeline to calculate novelty scores for OpenAlex papers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would suggest creating pipelines/novelty for this script rather than "notebooks"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that note, I see you are using typer - any reason for using it over argparse? I have no preference, but if you are going to build CLIs, we've been doing it exclusively with the latter (plenty of examples throughout the pipeline folder).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't used argparse, but typer is super easy to use (looks like a bit simpler than argparse in terms of the length of boilerplate code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns: | ||
None (saved novelty scores to file) | ||
""" | ||
# Fetch OpenAlex metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we're only doing this for papers? can we do it for patents as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gotten to patent data yet - I would like to keep it for another PR. I think I need to first sense-check the results for papers, see if the novelty score actually yields any interesting results and insights.
work_novelty_df, topic_pair_commonness_df = nu.document_novelty( | ||
topics_df, "work_id" | ||
) | ||
# Export novelty scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest adding a command line option for --local but otherwise save results to S3 using nesta_ds_utils saving and loading function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, will try to add that one in this PR.
|
||
|
||
def preprocess_topics_dict( | ||
topics_dict: dict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't matter too much, but we've been using Dict from typing so you can specify the types of the keys/values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense, will fix.
) | ||
|
||
|
||
def novelty_score(commonness_score: float) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function name could be a bit more specific as it's not really calculating the full novelty score and I was a bit confused with def document_novelty()
Thanks very much @emily-bicks, I will aim to adjust according to your comments and resubmit by end of week or Monday. |
OUTPUT_FOLDER = PROJECT_DIR / "outputs/interim" | ||
|
||
|
||
def main( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you migrate your code to pipeline, I'd suggest making the function name a bit more informative.
n_test_sample: int = 1000, | ||
): | ||
""" | ||
Calculate novelty scores for OpenAlex papers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be quite nice if you briefly describe the method / calculation, for code legacy sake (and not relying on checking the very nice PR message you wrote on it)
# Export novelty scores | ||
OUTPUT_FOLDER.mkdir(parents=True, exist_ok=True) | ||
export_path = OUTPUT_FOLDER / f"openalex_novelty_{level}{test_suffix}.csv" | ||
work_novelty_df.to_csv(export_path, index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've generally saved everything as parquets. I don't think this is an issue at all, but for consistency sake maybe it's worth considering changing it to that format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - I've never used parquet but will try
Very nice PR Karlis, I haven't had a chance at running it (my laptop is hovering over 90% CPU & Mem usage unfortunately), but I can't immediately see anything wrong. A question I do have is where you plan to take this. If you aim to recreate the paper measure, ie. not pairs of topics (or journals) but rather pairs of referenced topics (or journals), iterrowing pandas DFs is not going to work, and you'll have to resort either to parallelizable approaches, polars (no idea how this one works), or vectorizing your ops. I'm sure you've given this some thought, but I still want to flag it while we have time. Being able to recreate the Lee 2015 measure would be nice for validation (as we can then claim Bornmann's argument that it roughly works). It would also serve as the perfect baseline for using our simpler (no reference papers, only use within-paper topic combs) approach, in a way that we can potentially check how much we lose as we ignore citation info. |
Thanks very much @ampudia19! I will aim to incorporate the easy to do suggestions in this PR by end of Monday. In terms of where to take this next, I'd like to approach this iteratively, with developing the minimum viable product and building on top of it as much as time allows. The present implementation is Step 1 - I think we can see it as an almost ready MVP, as we can use it to spot "uncommon" combinations of taxonomy topics. Step 2: I suspect it will still need some filtering to remove noise (eg, low frequency, non-interesting, random combinations of topics) and light sense-checking (eg, browsing the most and least "novel" papers and seeing if it seems to make sense) Step 3: I would like to focus on aggregating the novelty scores to provide useful input into the dashboard (will need to discuss the details during our catch up) Step 4: Apply the same pipeline to patents. Step 5: Only then consider improvements or changes to the novelty score. I think trying re-implementing the citation-based version would be a good bet - would be interesting to compare the results with topic-based version (if we can get the journal names of the cited papers). In terms of computational cost, if it's an order of magnitude increase then it might be still doable with the present implementation. I think the other option you suggest, to use the abstracts of the cited papers and detect topics in those will be a bigger challenge (lots more data, more optimisation) - I'm doubtful I'll be able to complete that by end of March... |
Hey @ampudia19 and @emily-bicks, thanks again for the quick review! I've responded to most of your comments (see below). I haven't changed from typer to argparse - if you insist, perhaps I can create a new issue and address it a bit later? If yes, than happy to merge now.
|
Merge away!
…On Fri, 3 Mar 2023 at 11:45, Karlis Kanders ***@***.***> wrote:
Hey @ampudia19 <https://github.com/ampudia19> and @emily-bicks
<https://github.com/emily-bicks>, thanks again for the quick review!
I've responded to most of your comments (see below). I haven't changed
from typer to argparse - if you insist, perhaps I can create a new issue
and address it a bit later? If yes, than happy to merge now.
- Saving the outputs in S3 and then write getters
- Moving the script out of the notebook directory and into pipelines
- Adding a command line option for --save-to-local to save locally
- Typing
- Missing "Returns" in docstrings
- More informative function names
- Briefly describing the method / calculation, for code legacy sake
- Saving everything as parquets
—
Reply to this email directly, view it on GitHub
<#59 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3ANJEV7M4DNVVBADFZP6VTW2HKV7ANCNFSM6AAAAAAVDLWUCE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
--
Emily Bicks | Principal Data Scientist, Data Analytics Practice
Pronouns: she/her
--
58 Victoria Embankment
London
EC4Y 0DS
www.nesta.org.uk
<http://www.nesta.org.uk/> | Twitter <http://www.twitter.com/nesta_uk>
| LinkedIn
<http://www.linkedin.com/groups?gid=1868227&goback=%2Egdr_1274367066783_1>
| Facebook
<http://www.facebook.com/pages/NESTA/116788428355432?v=wall&ref=sgm>
If
you no longer want to receive emails from Nesta, send an email to
***@***.*** ***@***.***>. This email and any
attachments are confidential and may be subject to legal privilege. Any
use, copying or disclosure other than by the intended recipient is
unauthorised. If you have received this message in error, please notify
the sender immediately or by email to ***@***.***
***@***.***> and delete this message and any
copies from your computer and network. The views expressed in this
email are those of the author and do not necessarily reflect the views
of Nesta. Nesta is a company limited by guarantee and registered in
England and Wales with company number 7706036 and charity number
1144091. Registered as a charity in Scotland number SC042833.
Registered office: 58 Victoria Embankment, London, EC4Y 0DS.
--
|
Closes #54
This is a first PR for novelty measurement component of the project.
It contributes:
dap_aria_mapping/utils/novelty_utils.py
dap_aria_mapping/notebooks/novelty/pipeline_openalex_novelty.py
The usage examples of the script are:
All levels, full dataset (takes about 30 mins)
One level, test dataset (a few seconds)
The outputs for each taxonomy level are:
The outputs are presently stored locally. Happy to create new issues for adding storing on s3, and any other improvements you'd like to suggest.
Also let me know if I need to write tests - would be happy to leave that for another issue to move on with generating results.
For more context: The novelty score is calculated using the approach described in this paper (Lee et al 2015). This so-called "U-measure" has been shown (Bornmann et al 2019) to have some correlation with what researchers consider novel papers. However, note that I have adapted it for combinations of topics - whereas originally it has been used for combinations of citations/cited journals. This will likely create some challenges (eg, citations in a way reflect the full content of the paper, whereas abstracts are much shorter)
From Lee et al 2015:
From Bornmann et al 2019
Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s