-
Notifications
You must be signed in to change notification settings - Fork 0
Improve populate tools #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* add normalizeListMarkers function with pre-compiled regex * process markdown after htmlToMarkdown, before normalizeImageURLs * preserve indentation for nested lists * only affect line-start list markers, not mid-sentence hyphens
* add commitDateOptions struct for author/committer dates * include dates field in createFileRequest with RFC3339 format * ensures commit timestamp reflects when the tool runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances the wiki2md/article-creator tooling so that Wikipedia-derived Markdown is normalized for Forkana (list markers, internal links) and initial repository commits carry explicit timestamps, while dropping YAML front matter from generated articles.
Changes:
- Add
normalizeListMarkersto convert hyphen-based unordered list items to asterisk markers while preserving indentation and leaving non-list hyphens untouched. - Add
normalizeInternalLinks,stripLinkTitle, andextractWikiArticleNameto rewrite various Wikipedia-style internal links into/:root/subject/...URLs, with comprehensive unit tests for list and link normalization. - Update
article-creatorto send commit date metadata via adatesfield when creating README files, adjust the initial commit message, and update copyrights.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| custom/services/wiki2md/main.go | Integrates list marker normalization and internal Wikipedia link rewriting into the article processing pipeline, and comments out front matter injection while adding the supporting regex/utility functions. |
| custom/services/wiki2md/main_test.go | Adds unit tests for normalizeListMarkers, normalizeInternalLinks, stripLinkTitle, and extractWikiArticleName to validate the new normalization behavior and edge cases. |
| custom/services/article-creator/main.go | Refactors request payloads to include commit date options for the initial README commit, updates the commit message, and slightly tidies config struct field alignment. |
| custom/services/article-creator/main_test.go | Updates the file header copyright and removes a trailing blank line; no functional behavior changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@pedrogaudencio seems counterintuitive to me to choose "*" over "-" list markers, i.e. to normalise the latter into the former. Was this just a flip of a coin, i.e. just making a choice (any), and have it normalised, or was there any reason for favouring asterisks? I looked at it from a typing perspective (I use dashes normally, seems easier, and asterisks are used for marking cursive and bold). |
| contentB64 := base64.StdEncoding.EncodeToString([]byte(content)) | ||
|
|
||
| // Set commit timestamp to current time in RFC3339 format | ||
| now := time.Now().Format(time.RFC3339) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what we want this time to be. if i understand correctly, first we run wiki2md in batches,
and then later on article-creator from the fetched files. so if those run in relative close succession,
the now time is probably good. but it could also be, that the files sit for a while in the "staging" phase, and then the time now does not reflect when we fetched the article from wikipedia. i see that we have fetched_at data available, as an option.
i'd lean to use fetched_at. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. This was mostly to address the first commit timestamp since this is what we generally display as "when the article was created":
Instead of:
(26 years ago means 0001-01-01T00:00:00Z which is the the zero value of time.Time in Go)
fetched_at was being used for the front matter (also removed in this PR) in wiki2md:
We could potentially keep it, but at this point it would be only for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pedrogaudencio I think I got that. We don't want to show when the article got created on Wikipedia, but rather when it got imported into our system. My point here was more about that our import is broken down in two phases, which don't need to happen in rapid succession, so the shown import date (the commit date) might be slightly off when the second phase happens later, so it might say the import is from today, but the article got fetched from Wikipedia 2 weeks ago. So the article we tag as from today might diverge how it looks on Wikipedia as of today. So this is a question of historical accuracy (which we might not care about at all, but in case this would be easy to address, one can do it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you're right - understood now! I guess a safe fix would be to set the first commit timestamp as the article file creation? Then it would match that first phase of importing the article. @eighttrigrams let me know if this sounds good and I'll create an issue to track it. 🙏
|
Great work @pedrogaudencio! If we're good to go let me know and I'll merge, otherwise let me know once you've addressed whatever you feel is worth addressing and I'll merge! Opus 4.5 (no tools)DetailsBugs
|
* fix linkTitleRE to use alternation for matching quote pairs, preventing mismatched quotes like "title' from matching * decode and re-encode URL fragments consistently with article names for uniform encoding * add test case for images with wiki-like paths (no file extension) to verify ! check works correctly
* remove commented-out addFrontMatter call with spurious nolint directives * delete unused addFrontMatter function from main code * remove orphaned TestAddFrontMatter and TestAddFrontMatterSourceURL test functions * remove unused gopkg.in/yaml.v3 import from test file
@eighttrigrams I too choose to use hyphens when writing markdown (unless it's in git commits), but for some reason the wysiwyg editor tool we use in Forkana forces converting the bullet points into asterisks once it renders the content to be edited and that often "breaks" the diffs (after submitting it as changes):
IMO it doesn't make sense to consider that the user edited any bullet point from hyphen format onto asterisk when actually they didn't:
Note: the change here only parses the hyphens into asterisks in case they are bullet points (i.e. beginning of the sentence, followed by a space: Of course, another solution is to fix this instead/also directly in the wysiwyg editor tool we use in the frontend. Additional context: this tool might only be used in development to populate Forkana with articles and doesn't necessarily mean that when batch importing articles from other sources to populate production will face the same issues parsing the content. So in the end, this change (when importing the articles from Wikipedia) is solving a specific problem and not exactly mandating the format of bullet points in Forkana - whereas the wysiwyg editor certainly does. |
Addressed
Skipped
|
@pedrogaudencio Yeah, ok, I wasn't under the impression that this means that it mandates the use of "" in Forkana, I only thought if we cared about how the initial pages of Wikipedia imports then look like. Because those are many, and people use those as starting points. Then they would use "" for continuity, they wouldn't make change requests necessarily just for changing those to "-". But I think this is really not a pressing issue right now. Rather something to be addressed when we actually do a trial run of the imports, right before we do them. |




Closes /issues/69