Skip to content
This repository has been archived by the owner on May 11, 2024. It is now read-only.

Automatic enrichment #66

Open
nileshtrivedi opened this issue Feb 23, 2023 · 8 comments
Open

Automatic enrichment #66

nileshtrivedi opened this issue Feb 23, 2023 · 8 comments

Comments

@nileshtrivedi
Copy link
Contributor

Starting from nothing but a URL, we need tooling to automatically determine:

  • Media type (format) such as: article, course, video, podcast, game etc
  • Topics
  • Creators
  • Expert Reviews
  • Related Items
  • Description
  • Quality Tags
  • Image
  • Year
  • Identifiers (eg: ISBN etc)
  • Rating
@nileshtrivedi
Copy link
Contributor Author

nileshtrivedi commented Oct 7, 2023

Here is how it could work:

  • Create a public API that takes a URL of a learning resource, and possibly OpenAI's API key
  • Use GPT on the page contents to determine the above metadata. Media type and Topics are the only fields mandatory.
  • Programmaticaly Create a commit to make this change in your fork
  • Raise a pull request in this repository so that it can be reviewed, merged and deployed on learnawesome.org. Preventing spam and keeping a high quality bar is an important goal of this project. In any case, you can maintain your own database as you like.
  • This API can then called by a form, a browser extension, a discord/slack bot etc.

This is a realistic approach till somebody invents a "Github for Datasets".

@Maria-Aidarus
Copy link
Contributor

Hello, I am working on a university project and would like to try to solve this issue. Could you please assign it to me?

@nileshtrivedi
Copy link
Contributor Author

@Maria-Aidarus Done. DM me if you'd like to get familiar with the codebase. Can give you a walkthrough.

@nileshtrivedi
Copy link
Contributor Author

For potential contributors:

  • This requires an API to be created on the server. To keep the infra minimal, we can implement this as a Netlify Function - which allows us to use all of NodeJS capabilities. Cloudflare Workers is another option but it's more complex as it is not a standard NodeJS environment.

  • This will be implemented as an API that takes two parameters, a URL and an OpenAI API Key.

  • First it obtains the contents of the webpage. This can be done by using web scraping services like ScrapeNinja or Browserless etc.

  • These contents are simplified and sent to GPT for inferring two values: media type (for eg: whether the webpage represents a book or a video or a course etc) and topics.

  • Another potential approach is to take a screenshot of the page and send to GPT4-Vision model.

  • The format must be one of these: https://github.com/learn-awesome/learndb/blob/main/src/formats.js

  • Topics can be one of these: https://github.com/learn-awesome/learndb/blob/main/db/topics.json

You can skip the other attributes for now. Just extracting these two attributes with high quality will be a good contribution.

There is some complexity involved in keeping the topic taxonomy clean. This may be achievable by some prompt engineering.

@rama0711
Copy link

Hello, I am working with Maria. Could you please assign it to me too?

@aishaalsubaie
Copy link
Contributor

Hello, could you assign me this please!

@Skultrix
Copy link
Contributor

Hi, I'm working with Maria too. Can you assign me to this? Thanks.

@hjoad
Copy link
Contributor

hjoad commented Nov 18, 2023

Hello, can you please assign me as well?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants