Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype: sitemap to index #4

Open
fsteeg opened this issue Feb 10, 2020 · 2 comments
Open

Prototype: sitemap to index #4

fsteeg opened this issue Feb 10, 2020 · 2 comments
Assignees
Projects

Comments

@fsteeg
Copy link
Member

@fsteeg fsteeg commented Feb 10, 2020

Following up on #2 (comment): Given a sitemap URL like https://www.hoou.de/sitemap.xml, a URL prefix like https://www.hoou.de/materials/, an Elasticsearch index location URL like http://localhost:9200, a Flux like:

open-http | decode-html | fix | encode-json

And a Fix like:

map(html.head.title.value, title)
map(html.body.div.div.div.div.div.div.div.p.value, description)

We want to:

  • Convert all resources listed in the given sitemap, if their url loc starts with the given prefix
  • Index the results in an Elasticsearch index in the given index location
  • Return a sample query against the used index

The long term idea is to do all of this in a single new UI. For this prototype, I suggest we provide the config parameters (sitemap, prefix, index, flux, fix) in some kind of config file and run the workflow from the command line. The Flux and the Fix can be tested in http://test.lobid.org/fix.

@fsteeg

This comment has been minimized.

Copy link
Member Author

@fsteeg fsteeg commented Feb 10, 2020

Basically using Flux as the config file for all the perameters would look something like this:

"https://www.hoou.de/sitemap.xml"
| list-sitemap(prefix="https://www.hoou.de/materials/")
| open-http
| decode-html
| fix("
  map(html.head.title.value, title)
  map(html.body.div.div.div.div.div.div.div.p.value, description)")
| encode-json
| index-elasticsearch("http://localhost:9200/")

This would require new list-sitemap and index-elasticsearch metafacture modules.

@acka47 acka47 added this to Backlog in OER-Index via automation Feb 10, 2020
@fsteeg

This comment has been minimized.

Copy link
Member Author

@fsteeg fsteeg commented Feb 13, 2020

We just discussed this as a fast way to a first index:

  • Set up a program to index resources from https://www.hoou.de/sitemap.xml
  • Hard-code config into the program (final config should happen in Flux as described above)
  • Create JSON containing the url, title, description, and license, index in Elasticsearch
@fsteeg fsteeg self-assigned this Feb 13, 2020
@fsteeg fsteeg moved this from Backlog to Working in OER-Index Feb 13, 2020
fsteeg added a commit to metafacture/metafacture-fix that referenced this issue Feb 14, 2020
To depend on Metafix from hbz/oerindex#4
fsteeg added a commit that referenced this issue Feb 14, 2020
fsteeg added a commit that referenced this issue Feb 14, 2020
fsteeg added a commit that referenced this issue Feb 14, 2020
See #4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
OER-Index
  
Working
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.