Automatic enrichment #66

nileshtrivedi · 2023-02-23T21:31:07Z

Starting from nothing but a URL, we need tooling to automatically determine:

nileshtrivedi · 2023-10-07T05:02:02Z

Here is how it could work:

Create a public API that takes a URL of a learning resource, and possibly OpenAI's API key
Use GPT on the page contents to determine the above metadata. Media type and Topics are the only fields mandatory.
Programmaticaly Create a commit to make this change in your fork
Raise a pull request in this repository so that it can be reviewed, merged and deployed on learnawesome.org. Preventing spam and keeping a high quality bar is an important goal of this project. In any case, you can maintain your own database as you like.
This API can then called by a form, a browser extension, a discord/slack bot etc.

This is a realistic approach till somebody invents a "Github for Datasets".

Maria-Aidarus · 2023-11-14T10:14:10Z

Hello, I am working on a university project and would like to try to solve this issue. Could you please assign it to me?

nileshtrivedi · 2023-11-14T10:33:36Z

@Maria-Aidarus Done. DM me if you'd like to get familiar with the codebase. Can give you a walkthrough.

nileshtrivedi · 2023-11-18T13:04:35Z

For potential contributors:

This requires an API to be created on the server. To keep the infra minimal, we can implement this as a Netlify Function - which allows us to use all of NodeJS capabilities. Cloudflare Workers is another option but it's more complex as it is not a standard NodeJS environment.
This will be implemented as an API that takes two parameters, a URL and an OpenAI API Key.
First it obtains the contents of the webpage. This can be done by using web scraping services like ScrapeNinja or Browserless etc.
These contents are simplified and sent to GPT for inferring two values: media type (for eg: whether the webpage represents a book or a video or a course etc) and topics.
Another potential approach is to take a screenshot of the page and send to GPT4-Vision model.
The format must be one of these: https://github.com/learn-awesome/learndb/blob/main/src/formats.js
Topics can be one of these: https://github.com/learn-awesome/learndb/blob/main/db/topics.json

You can skip the other attributes for now. Just extracting these two attributes with high quality will be a good contribution.

There is some complexity involved in keeping the topic taxonomy clean. This may be achievable by some prompt engineering.

rama0711 · 2023-11-18T18:05:38Z

Hello, I am working with Maria. Could you please assign it to me too?

aishaalsubaie · 2023-11-18T18:06:02Z

Hello, could you assign me this please!

Skultrix · 2023-11-18T18:08:18Z

Hi, I'm working with Maria too. Can you assign me to this? Thanks.

hjoad · 2023-11-18T18:18:56Z

Hello, can you please assign me as well?

nileshtrivedi added the data-enrichment label Mar 4, 2023

nileshtrivedi assigned Maria-Aidarus Nov 14, 2023

nileshtrivedi assigned aishaalsubaie, rama0711, Skultrix and hjoad Nov 19, 2023

Maria-Aidarus mentioned this issue Nov 29, 2023

API Microservice #76

Merged

rama0711 mentioned this issue Nov 30, 2023

Firebase separation #77

Open

Provide feedback