Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Build pages from data source #5074
Currently Hugo handles internal and external data source with getJSON/getCSV which is great for using data source in a template.
But Hugo cannot, using a data set of items, build a page for each of them plus related list pages like it does from the content directory files.
Here is a fresh start on specing this important step in the roadmap.
As a user, I can only see the configuration aspect of the task.
I don’t see many configuration tied issues except for the mapping of the key/values collected from the data source and the obvious external or internal endpoint of the data set. The following are suggestions regarding how those configurations could be managed by the users followed by a code block example.
Depending on use cases there may be a need of one or several url/path.
For many projects, not every page types (post, page etc…) may be built from the same source. Type could be defined from a data source key or as a source parameter.
I suppose there could be other parameters per sources.
Front Matter mapping
User must be able to map the keys from the data source to Hugo’s commonly used Front Matter Variables (title, permalink, slug, taxonomies, etc…).
This is a realtor agency, a branch of a bigger one.
Their pages are built with hugo's local markdown
They have an old wordpress site whose > 100 blog posts they did not want to convert to markdown. So they load those blog posts from a local data file on top of the local Hugo's own markdown posts.
They use a tier service to create job posts when they need to fill a new position. They want to hosts those job listing on their site though. Their jobs are served by
The most important part of the website are their realty listings. They add their listing to their mother company's own website whose API in turn serves those at
title: George and Son (A MTL Realtors Agency) dataSources: - source: data/old_site_posts.json contentPath: blog mapping: Title: post_title Date: post_date Type: post_type Content: post_content Params.location.city: post_meta.city Params.location.country: post_meta.country - source: https://ourjobs.com/api/client/george-and-son/jobs.json contentPath: jobs mapping: Title: job_label Content: job_description - source: https://api.mtl-realtors/listings/?branch=george-and-son&status=available contentPath: listings/:Type/ grabAllFrontMatter: true mapping: Type: amenity_kind Title: name Content: description Params.neighbourhood: geo.neighbour Params.city: geo.city
This results in a content "shadow" structure. Hard lines dir/files are local, while dashed ones are remote.
Thanks for starting this discussion. I suspect we have to go some rounds on this to get to where we want.
Yes, we need field mapping. But when I thought about this problem, I imagined something more than a 1:1 mapping between an article with a permalink and some content in Hugo. I have thought about it as content adapters. I think it even helps to think of the current filesystem as a filesystem Hugo content adapter.
So, if this is how it looks like on disk:
content ├── _index.md ├── blog │ └── first-post │ ├── index.md │ └── sunset.jpg └── logo.png
What would the above look like if the data source was JSON or XML? Or even WordPress?
It should, of course, be possible to set the URL "per post" (like it is in content files), but it should also be possible to be part of the content section tree with permalink config per section, translations etc.. So, when you have 1 content dir + some other data sources, it ends up as one merged view.
As most data sources are usually a flat list of items, I suppose building the content directory structure will require some more mapping.
There are the
I suppose there is no way around having many source configuration params/mapping which Hugo may need to best adapt the data source to the desired structure. Maybe even having to use some pattern/regex/glob to best adapt those like the
As for default structure. If there is no configured data source with a type parameter of
@bep now I understand more fully what you meant (I think). The config needs to tell Hugo how to model the content structure so it can build its pages from that.
To reflect this here I added to the desc a better project example to illustrate both configuration possibilities and the resulting "content" structure.
referenced this issue
Sep 10, 2018
@regisphilibert I have been thinking about this, and I think the challenge with all of this isn't the mapping (we can experiment until we get a "working and good looking scheme"), but more the practical workflow -- esp. how to handle state/updates.
I understand that in a dynamic world with JS APIs etc., the above will not be entirely true, always. But it should be a core requirement whenever possible.
A person in another thread mentioned GatsbyJS's create-source-plugin.
I don't think their approach of emulating the file system is a good match for Hugo, but I'm more curious about how they pull in data.
This is me guessing a little, but if I commit my GatsbyJS with some
Given the above assumptions, the Gatsby approach does not meet the "static content" criteria above. I'm not sure how they can assure that the data is "100% accurate", but the important part here is that you have no way of knowing if the source has changed.
So, I was thinkering about:
The output of 2) is what we use to build the final site.
There are probably some practical holes in the above. But there are several upsides.
I'm not sure about this. And I already apologize if some of my lack of understanding of the technology/feature at hands bias my view.
I guess most of the use cases for this will be using contentful or WordPress Rest API or FireBase to manage your content, and let Hugo build the site from this remote source plus maybe a few other ones (remote and local).
But this does not change the fact that we need caching and being able to tell the difference between the cached source and the remote one efficiently.
In order to handle the "when", by this I mean the decision between calling the remote source or using the cached one, I was thinking about a setting per source indicating at which rate it should be checked.
I'm not sure I understand the process described with
My talk about "database etc." clutters the discussion. This process cannot be stateless/dumb, was my main point. With 10 remote resources, some of them possibly out of your control, you (or I) would want some kind of control over:
None of the above allows for a simple "pull and push". So, if you do your builds on a CI server (Netlify), but do your editing on your local PC, that state must be handled somehow so Netlify knows ... what. Note that the answer to 1) and 2) could possibly be to "publish everything, always", if that's your cup of tea.
Yeah, maybe some people want it or default to it but offering more control is definetly a must have I think.
True but I didn't really saw it as Hugo's business. In my mind, a CI pipeline would have to be put into place above Hugo.
Or a simple cronjobs (don't know what to call those in the modern JAMstack) could be set in place so website is build every hour with
OK, I'm possibly overthinking it (at least for a v1 of this). But for the above to work at speed and for big sites, you need a proper cache you can depend on. I notice the GatsbyJS WordPress plugin saying that "this should work for any number of posts", but if you want this to work for your 10K WP blog, you really need to avoid pulling down everything all the time. I will investigate this vs Netlify and CircleCI.
Yes. Time is essence!
And this is precisely why big content projects want to turn to Hugo.
referenced this issue
Nov 20, 2018
After spending time with playing with the friendly competition and its data source solutions.
3 will be unique to each project and potentially source.
We could group the settings of 1 and 2 into one Data Source Type (DST).
This way any DST could be potentially:
Rough example of DataSourceType/DataSources settings:
DataSourceTypes - name: wordpress endpoint_base: wp-json/v2/ endpoints: ['posts', 'page', 'listings'] pagination: true pagination_param: page=:page [...] DataSources: - source: https://api.wordpress.blog.com/ type: wordpress contentPath: blog/ [...]
I wanted to throw this into the discussion because it's a demonstration of how I generated temporary .md files from two merged sets of JSON data (Google Sheets API). These .md files are only generated and used during compilation and are not saved into the repository.
This is a fairly simple script, but you can see that I needed to do some filtering of the data source and mapping the 2 source JSON data sets to front matter parameters per page.