Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement Dotclear importer #512

Merged
merged 7 commits into from
Mar 10, 2023
Merged

Conversation

ashmaroli
Copy link
Member

Re-implement Dotclear importer based on export file provided by @jrfern in #510 (comment).

This drops dependency on activesupport, includes associated tests and adds provided export file for future development.

Closes #510

@ashmaroli
Copy link
Member Author

Hello @jrfern,
The Dotclear importer has been rewritten based the export file you had provided.
I would now like to know the directory structure of the "media folder" (media.zip unpacked) so as to implement the functionality behind --mediafolder and maintain backwards-compatibility.

If you wish to try this out, you may edit your Gemfile as follows:

# Gemfile

gem "jekyll"
gem "jekyll-import", github: "jekyll/jekyll-import", ref: "refs/pull/512/head"

(There is no need to include activesupport or any of the previous dependency gems).


TODO:

  • Implement --mediafolder
  • Document that the imported posts are in _drafts to avoid unintentional overwriting of existing namesake in _posts.
  • Document that comments received for the imported posts will not be imported from the export file.

@jrfern
Copy link

jrfern commented Mar 5, 2023

Great! Thank you very much. Now I get

invalid option: --mediafolder (OptionParser::InvalidOption)

When run without this option (and after following your instructions)

$ bundle exec jekyll import dotclear --datafile path_to_backup.txt
jekyll 4.3.2 | Error:  Illegal quoting in line 1.
/usr/lib/ruby/3.1.0/csv/parser.rb:955:in `parse_quotable_robust': Illegal quoting in line 1. (CSV::MalformedCSVError)

I would now like to know the directory structure of the "media folder" (media.zip unpacked)

Inside the zip archive there's a "img" directory with the image files and subdirectories.

lib/jekyll-import/importers/dotclear.rb Outdated Show resolved Hide resolved
front_matter_data["dotclear_post_url"] = post["post_url"]

Jekyll.logger.info "Creating:", path
File.write(path, "#{YAML.dump(front_matter_data)}---\n\n#{ReverseMarkdown.convert(content).strip}\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReverseMarkdown-- should this be an optional thing? Or do we always want to reverse it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation on this branch creates .md files, so reversing the (X)HTML to Markdown felt necessary.

Do you think it's best to create .html files with markup as in the backup file?

FileUtils.mkdir_p("_drafts") unless posts.empty?

posts.each do |post|
date, title, content = post.values_at("post_creadt", "post_title", "post_content")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post_created? Might have forgotten an e?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strings here mirror the header keys in the backup file (attached as part of this PR):

[post post_id,blog_id,user_id,cat_id,post_dt,post_tz,post_creadt,post_upddt,post_password,...]

lib/jekyll-import/importers/dotclear.rb Outdated Show resolved Hide resolved
test/test_dotclear_importer.rb Show resolved Hide resolved
@ashmaroli ashmaroli marked this pull request as draft March 6, 2023 12:46
@ashmaroli
Copy link
Member Author

ashmaroli commented Mar 6, 2023

@jrfern The CSV::MalformedCSVError is a bug that needs to be fixed. I would like to take a look at the actual backup file you used to test this branch. You may email the file to me directly instead of exposing it here.
(email address is attached to all of my commits on GitHub)

I would now like to know the directory structure of the "media folder" (media.zip unpacked)

Inside the zip archive there's a "img" directory with the image files and subdirectories.

In the backup file you provided previously, the value to key media.media_file is "MiUser/250px-MonaLisaGraffiti.JPG". So, is the "img" dir parent directory to "MiUser`?

@jrfern
Copy link

jrfern commented Mar 6, 2023

@jrfern The CSV::MalformedCSVError is a bug that needs to be fixed.

Yes, please, @ashmaroli

I would like to take a look at the actual backup file you used to test this branch. You may email the file to me directly instead of exposing it here. (email address is attached to all of my commits on GitHub)

ashmaroli at users.noreply.github.com? Impossible. I'm feeling silly, but I haven't been able to find your email, just your jekyll-talk, github, reddit, linkedin accounts... Mine is jrfern at gmail...

I would now like to know the directory structure of the "media folder" (media.zip unpacked)
In the backup file you provided previously, the value to key media.media_file is "MiUser/250px-MonaLisaGraffiti.JPG". So, is the "img" dir parent directory to "MiUser`?

I don't understand, the MiUser phrase was a reference to the path. I unzipped the media.zip file, and it created media/img/image_files. Then run the command with --mediafolder path/media/img/ (as it never worked I don't know if it should be simply --mediafolder path/media/ ).

Hope this helps. One more thing, for my tests your suggested

gem "jekyll-import", github: "jekyll/jekyll-import", ref: "refs/pull/512/head

Should I change that now that the PR has been approved?

@ashmaroli
Copy link
Member Author

I'm feeling silly, but I haven't been able to find your email..

Ah! I should have just mentioned it right away instead.. it's ashmaroli at gmail..

run the command with --mediafolder path/media/img/ (as it never worked I don't know if it should be simply --mediafolder path/media/ )...

The original implementation (in existing releases) was to expect just path/media/. The importer would then copy the contents into destination path assets/images/. For example, say I provide --mediafolder media. Then the importer would look for media/MiUser/250px-MonaLisaGraffiti.JPG and if found, copy to assets/images/MiUser/250-px....JPG.
The proposed implementation in this branch hasn't actually exposed the --mediafolder yet. (So, it will always fail if you try). But it will eventually have similar behavior to maintain backwards-compatibility.

Should I change that now that the PR has been approved?

The reference is permanent. It would be valid even if the pull request branch gets deleted after the pull request is merged. However, since the pull request is still a work-in-progress, you may have to run bundle update jekyll-import to get the latest state of this branch. (You don't have to update until I ask you for feedback.)

@jrfern
Copy link

jrfern commented Mar 7, 2023

@ashmaroli Real backup file sent privately.
I'm learning so much - thank you again.

@ashmaroli
Copy link
Member Author

Thanks @jrfern
Received the backup file. Will use it to make changes to this branch.

@ashmaroli
Copy link
Member Author

Hello @jrfern
You may update your bundle reference to this branch by running bundle update jekyll-import to test at your end.
I have also updated the importer documentation for better understanding. You may preview the document here.

@jrfern
Copy link

jrfern commented Mar 8, 2023

Recuperated 60 entries into _drafts and their images! Great! I'm fighting at the moment with the paginate-v2 plugin and so can't check but I would say that the import worked.

Thank you again, @ashmaroli

@ashmaroli
Copy link
Member Author

Happy to hear that, @jrfern. Good luck tackling the pagination plugin 🙂
Thank you for testing and giving feedback.

@ashmaroli ashmaroli marked this pull request as ready for review March 8, 2023 16:53
@ashmaroli ashmaroli requested a review from parkr March 8, 2023 16:54
@jrfern
Copy link

jrfern commented Mar 8, 2023

First analysis of the new plugin. I moved the older post ('Informe K-12 Open Minds Conference 2007 - parte I: Europeos') to the posts directory.

Works quite well, not totally well.

The excerpt part is missing (it's in the backup). For example:

"Informe K-12 Open Minds Conference 2007 - parte I: Europeos","<blockquote>\r\n<p><em>I was invited to&nbsp; attend the Conference held in Indianapolis. It was the start of something, I have to say. This is part one of my report in Spanish.</em></p>\r\n\r\n<p>La ventaja de dar tiempo a las cosas para ...
...
... y perfilar matices.</p>\r\n</blockquote>\r\n\r\n<p> </p>","<p style=\"text-align: justify;\">Escribo un informe sobre la K-12 Open Minds Conference....

This is converted into

<p style="text-align: justify;">Escribo un informe sobre la K-12 Open Minds Conference. Si eres impaciente puedes leer ya mucha información sobre lo que allí se habló en el <a href="http://k12openminds.wikispaces.com/" hreflang="es">K-12 Open Minds Conference Resource Site</a>.</p>

The blockquote (the whole header) is missing in the import.

ERROR `/assets/dotclear/img/.dia_1_m.jpg' not found.

In the backup

<p style=\"text-align: justify;\"><a class=\"media-link\" href=\"/dotclear/public/img/dia_1.jpg\"><img alt=\"\" class=\"media\" src=\"/dotclear/public/img/.dia_1_m.jpg\" style=\"float: left; margin: 0 1em 1em 0;\" /></a>

Now it is

<p style="text-align: justify;"><a class="media-link" href="/assets/dotclear/img/dia_1.jpg"><img alt="" class="media" src="/assets/dotclear/img/.dia_1_m.jpg" style="float: left; margin: 0 1em 1em 0;" /></a>

The images are treated as links. That was OK in the sense that there used to be two versions of each image, and the small one is a link to the big one, but there are no names starting with a dot in assets/dotclear and the link shoud be turned into an >img> tag.

So we miss the introductions to the entries and the images are treated as links. Can any of these points be fixed programmatically?

@ashmaroli
Copy link
Member Author

@jrfern Added support for importing excerpts. While I had seen the post_excerpt field earlier, I did not realise that post_content doesn't start with the excerpt. Jekyll-generated HTML generally has excerpt as the first paragraph of the contents. (The exception being when user had supplied a custom excerpt string to Jekyll during the build process).

ERROR /assets/dotclear/img/.dia_1_m.jpg not found.. but there are no names starting with a dot in assets/dotclear..

These files do not have separate identity in the media table in the export file. So they won't be imported / mentioned in the log.

the link shoud be turned into an >img> tag.

They're already valid img tags. You don't see it or a placeholder holder for missing image because of CSS.

@jrfern
Copy link

jrfern commented Mar 9, 2023

Great! The excerpt was the only problem with the import, the issue with the images was a problem with the backup, not the import.

From my side the new code works and I have recuperated the posts from this old blog.

* "Categories" are not currently imported from the export-file.
* "Tags" however will be imported and added to relevant posts' front matter.
* Post URLs are imported from the export-file into front matter with key `original_url`.
* Jekyll doesn't manage timezone for individual posts. Therefore, timezone metadata of individual posts will be ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a timezone offset is present, it would be preferable to preserve it for each post, I think. I always use timezone offsets in my post dates. If it's too hard to extract then we can skip it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. Unfortunately, the timezone here is the IANA id (e.g. Europe/Madrid) instead of the offset.
In the attached mock export-file, it is "CET".

csv
pp
))
JekyllImport.require_with_fallback(%w())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omit this if we don't require anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for compatibility with the local plugin docs/_plugins/importer_metadata.rb.

@ashmaroli
Copy link
Member Author

@jekyllbot: merge +minor

2 similar comments
@ashmaroli
Copy link
Member Author

@jekyllbot: merge +minor

@ashmaroli
Copy link
Member Author

@jekyllbot: merge +minor

@jekyllbot jekyllbot merged commit 06205f7 into jekyll:master Mar 10, 2023
jekyllbot added a commit that referenced this pull request Mar 10, 2023
@ashmaroli ashmaroli deleted the dotclear-reloaded branch March 10, 2023 06:18
github-actions bot pushed a commit that referenced this pull request Mar 10, 2023
Ashwin Maroli: Re-implement Dotclear importer (#512)

Merge pull request 512
@jekyll jekyll locked and limited conversation to collaborators Mar 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No dotclear command in jekyll/commands/import.rb
4 participants