Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement automated storing to db/backend #70

Open
Tracked by #68
jpwahle opened this issue Nov 2, 2021 · 4 comments
Open
Tracked by #68

Implement automated storing to db/backend #70

jpwahle opened this issue Nov 2, 2021 · 4 comments
Assignees
Labels
enhancement Pull Request: A new feature

Comments

@jpwahle
Copy link
Owner

jpwahle commented Nov 2, 2021

Is your feature request related to a problem? Please describe.
We need to store author, venue, and publication data into our backend automatically when the next d3 version is released.

Describe the solution you'd like
Implement a backend class that:

  • creates papers and authors
  • updates papers and authors
  • delets papers and authors

Additional context
https://www.mongodb.com/docs/database-tools/mongoimport/#std-label-ex-mongoimport-merge

@jpwahle jpwahle added Epic Larger stories of issues with sub-issues enhancement Pull Request: A new feature and removed Epic Larger stories of issues with sub-issues labels Nov 2, 2021
@jpwahle jpwahle self-assigned this Nov 6, 2021
@github-actions
Copy link

github-actions bot commented Nov 6, 2021

Branch issue-70 created!

@jpwahle jpwahle assigned alexandertv and unassigned jpwahle Nov 16, 2021
@jpwahle jpwahle assigned jpwahle and unassigned alexandertv Nov 21, 2021
@jpwahle
Copy link
Owner Author

jpwahle commented Aug 23, 2022

The final layer missing here is the automatic update of the backend.
One of the main issues is that more than 6 million queries have to be sent to the backend to check whether a paper exists and has to be updated/inserted.

One solution would be to hash all papers and let the backend return a list of all hashes that the crawler can compare to without sending any other request. Then the crawler can decide what to update/write which results in few requests per update.

@jpwahle jpwahle mentioned this issue Aug 23, 2022
6 tasks
@trannel
Copy link
Contributor

trannel commented Aug 23, 2022

So far we used the code I created on branch https://github.com/gipplab/cs-insights-crawler/tree/data-upload-full in the file upload/d3_full.py. There might be some helpful things there, that could help with this issue.

@jpwahle jpwahle changed the title Implement backend client Implement automated storing to db/backend Sep 12, 2022
@jpwahle jpwahle assigned muhammadtalha242 and unassigned jpwahle Sep 18, 2023
@jpwahle
Copy link
Owner Author

jpwahle commented Sep 18, 2023

@muhammadtalha242 Is the new data ingestion through SemanticScholar already ready?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Pull Request: A new feature
Projects
Status: 🏗 In progress
Development

No branches or pull requests

4 participants