Re-implement Dotclear importer #512

ashmaroli · 2023-03-05T15:46:57Z

Re-implement Dotclear importer based on export file provided by @jrfern in #510 (comment).

This drops dependency on activesupport, includes associated tests and adds provided export file for future development.

Closes #510

ashmaroli · 2023-03-05T17:13:59Z

Hello @jrfern,
The Dotclear importer has been rewritten based the export file you had provided.
I would now like to know the directory structure of the "media folder" (media.zip unpacked) so as to implement the functionality behind --mediafolder and maintain backwards-compatibility.

If you wish to try this out, you may edit your Gemfile as follows:

# Gemfile

gem "jekyll"
gem "jekyll-import", github: "jekyll/jekyll-import", ref: "refs/pull/512/head"

(There is no need to include activesupport or any of the previous dependency gems).

TODO:

Implement --mediafolder
Document that the imported posts are in _drafts to avoid unintentional overwriting of existing namesake in _posts.
Document that comments received for the imported posts will not be imported from the export file.

jrfern · 2023-03-05T19:29:33Z

Great! Thank you very much. Now I get

invalid option: --mediafolder (OptionParser::InvalidOption)

When run without this option (and after following your instructions)

$ bundle exec jekyll import dotclear --datafile path_to_backup.txt
jekyll 4.3.2 | Error:  Illegal quoting in line 1.
/usr/lib/ruby/3.1.0/csv/parser.rb:955:in `parse_quotable_robust': Illegal quoting in line 1. (CSV::MalformedCSVError)

I would now like to know the directory structure of the "media folder" (media.zip unpacked)

Inside the zip archive there's a "img" directory with the image files and subdirectories.

lib/jekyll-import/importers/dotclear.rb

parkr · 2023-03-05T19:46:45Z

lib/jekyll-import/importers/dotclear.rb

+            front_matter_data["dotclear_post_url"] = post["post_url"]
+
+            Jekyll.logger.info "Creating:", path
+            File.write(path, "#{YAML.dump(front_matter_data)}---\n\n#{ReverseMarkdown.convert(content).strip}\n")


ReverseMarkdown-- should this be an optional thing? Or do we always want to reverse it?

The current implementation on this branch creates .md files, so reversing the (X)HTML to Markdown felt necessary.

Do you think it's best to create .html files with markup as in the backup file?

parkr · 2023-03-05T19:47:12Z

lib/jekyll-import/importers/dotclear.rb

+          FileUtils.mkdir_p("_drafts") unless posts.empty?
+
+          posts.each do |post|
+            date, title, content = post.values_at("post_creadt", "post_title", "post_content")


post_created? Might have forgotten an e?

The strings here mirror the header keys in the backup file (attached as part of this PR):

[post post_id,blog_id,user_id,cat_id,post_dt,post_tz,post_creadt,post_upddt,post_password,...]

lib/jekyll-import/importers/dotclear.rb

test/test_dotclear_importer.rb

ashmaroli · 2023-03-06T12:59:12Z

@jrfern The CSV::MalformedCSVError is a bug that needs to be fixed. I would like to take a look at the actual backup file you used to test this branch. You may email the file to me directly instead of exposing it here.
(email address is attached to all of my commits on GitHub)

I would now like to know the directory structure of the "media folder" (media.zip unpacked)

Inside the zip archive there's a "img" directory with the image files and subdirectories.

In the backup file you provided previously, the value to key media.media_file is "MiUser/250px-MonaLisaGraffiti.JPG". So, is the "img" dir parent directory to "MiUser`?

jrfern · 2023-03-06T18:40:54Z

@jrfern The CSV::MalformedCSVError is a bug that needs to be fixed.

Yes, please, @ashmaroli

I would like to take a look at the actual backup file you used to test this branch. You may email the file to me directly instead of exposing it here. (email address is attached to all of my commits on GitHub)

ashmaroli at users.noreply.github.com? Impossible. I'm feeling silly, but I haven't been able to find your email, just your jekyll-talk, github, reddit, linkedin accounts... Mine is jrfern at gmail...

I would now like to know the directory structure of the "media folder" (media.zip unpacked)
In the backup file you provided previously, the value to key media.media_file is "MiUser/250px-MonaLisaGraffiti.JPG". So, is the "img" dir parent directory to "MiUser`?

I don't understand, the MiUser phrase was a reference to the path. I unzipped the media.zip file, and it created media/img/image_files. Then run the command with --mediafolder path/media/img/ (as it never worked I don't know if it should be simply --mediafolder path/media/ ).

Hope this helps. One more thing, for my tests your suggested

gem "jekyll-import", github: "jekyll/jekyll-import", ref: "refs/pull/512/head

Should I change that now that the PR has been approved?

ashmaroli · 2023-03-07T09:35:16Z

I'm feeling silly, but I haven't been able to find your email..

Ah! I should have just mentioned it right away instead.. it's ashmaroli at gmail..

run the command with --mediafolder path/media/img/ (as it never worked I don't know if it should be simply --mediafolder path/media/ )...

The original implementation (in existing releases) was to expect just path/media/. The importer would then copy the contents into destination path assets/images/. For example, say I provide --mediafolder media. Then the importer would look for media/MiUser/250px-MonaLisaGraffiti.JPG and if found, copy to assets/images/MiUser/250-px....JPG.
The proposed implementation in this branch hasn't actually exposed the --mediafolder yet. (So, it will always fail if you try). But it will eventually have similar behavior to maintain backwards-compatibility.

Should I change that now that the PR has been approved?

The reference is permanent. It would be valid even if the pull request branch gets deleted after the pull request is merged. However, since the pull request is still a work-in-progress, you may have to run bundle update jekyll-import to get the latest state of this branch. (You don't have to update until I ask you for feedback.)

jrfern · 2023-03-07T11:46:55Z

@ashmaroli Real backup file sent privately.
I'm learning so much - thank you again.

ashmaroli · 2023-03-07T16:19:31Z

Thanks @jrfern
Received the backup file. Will use it to make changes to this branch.

ashmaroli · 2023-03-08T15:12:18Z

Hello @jrfern
You may update your bundle reference to this branch by running bundle update jekyll-import to test at your end.
I have also updated the importer documentation for better understanding. You may preview the document here.

jrfern · 2023-03-08T16:14:34Z

Recuperated 60 entries into _drafts and their images! Great! I'm fighting at the moment with the paginate-v2 plugin and so can't check but I would say that the import worked.

Thank you again, @ashmaroli

ashmaroli · 2023-03-08T16:47:27Z

Happy to hear that, @jrfern. Good luck tackling the pagination plugin 🙂
Thank you for testing and giving feedback.

jrfern · 2023-03-08T17:19:47Z

First analysis of the new plugin. I moved the older post ('Informe K-12 Open Minds Conference 2007 - parte I: Europeos') to the posts directory.

Works quite well, not totally well.

The excerpt part is missing (it's in the backup). For example:

"Informe K-12 Open Minds Conference 2007 - parte I: Europeos","<blockquote>\r\n<p><em>I was invited to&nbsp; attend the Conference held in Indianapolis. It was the start of something, I have to say. This is part one of my report in Spanish.</em></p>\r\n\r\n<p>La ventaja de dar tiempo a las cosas para ...
...
... y perfilar matices.</p>\r\n</blockquote>\r\n\r\n<p> </p>","<p style=\"text-align: justify;\">Escribo un informe sobre la K-12 Open Minds Conference....

This is converted into

<p style="text-align: justify;">Escribo un informe sobre la K-12 Open Minds Conference. Si eres impaciente puedes leer ya mucha información sobre lo que allí se habló en el <a href="http://k12openminds.wikispaces.com/" hreflang="es">K-12 Open Minds Conference Resource Site</a>.</p>

The blockquote (the whole header) is missing in the import.

ERROR `/assets/dotclear/img/.dia_1_m.jpg' not found.

In the backup

<p style=\"text-align: justify;\"><a class=\"media-link\" href=\"/dotclear/public/img/dia_1.jpg\"><img alt=\"\" class=\"media\" src=\"/dotclear/public/img/.dia_1_m.jpg\" style=\"float: left; margin: 0 1em 1em 0;\" /></a>

Now it is

<p style="text-align: justify;"><a class="media-link" href="/assets/dotclear/img/dia_1.jpg"><img alt="" class="media" src="/assets/dotclear/img/.dia_1_m.jpg" style="float: left; margin: 0 1em 1em 0;" /></a>

The images are treated as links. That was OK in the sense that there used to be two versions of each image, and the small one is a link to the big one, but there are no names starting with a dot in assets/dotclear and the link shoud be turned into an >img> tag.

So we miss the introductions to the entries and the images are treated as links. Can any of these points be fixed programmatically?

ashmaroli · 2023-03-09T11:52:12Z

@jrfern Added support for importing excerpts. While I had seen the post_excerpt field earlier, I did not realise that post_content doesn't start with the excerpt. Jekyll-generated HTML generally has excerpt as the first paragraph of the contents. (The exception being when user had supplied a custom excerpt string to Jekyll during the build process).

ERROR /assets/dotclear/img/.dia_1_m.jpg not found.. but there are no names starting with a dot in assets/dotclear..

These files do not have separate identity in the media table in the export file. So they won't be imported / mentioned in the log.

the link shoud be turned into an >img> tag.

They're already valid img tags. You don't see it or a placeholder holder for missing image because of CSS.

jrfern · 2023-03-09T17:49:24Z

Great! The excerpt was the only problem with the import, the issue with the images was a problem with the backup, not the import.

From my side the new code works and I have recuperated the posts from this old blog.

parkr · 2023-03-10T05:24:12Z

docs/_importers/dotclear.md

+* "Categories" are not currently imported from the export-file.
+* "Tags" however will be imported and added to relevant posts' front matter.
+* Post URLs are imported from the export-file into front matter with key `original_url`.
+* Jekyll doesn't manage timezone for individual posts. Therefore, timezone metadata of individual posts will be ignored.


If a timezone offset is present, it would be preferable to preserve it for each post, I think. I always use timezone offsets in my post dates. If it's too hard to extract then we can skip it.

I agree with this. Unfortunately, the timezone here is the IANA id (e.g. Europe/Madrid) instead of the offset.
In the attached mock export-file, it is "CET".

parkr · 2023-03-10T05:25:25Z

lib/jekyll-import/importers/dotclear.rb

-            csv
-            pp
-          ))
+          JekyllImport.require_with_fallback(%w())


Omit this if we don't require anything.

This is required for compatibility with the local plugin docs/_plugins/importer_metadata.rb.

ashmaroli · 2023-03-10T06:07:11Z

@jekyllbot: merge +minor

ashmaroli · 2023-03-10T06:08:59Z

@jekyllbot: merge +minor

ashmaroli · 2023-03-10T06:18:40Z

@jekyllbot: merge +minor

Ashwin Maroli: Re-implement Dotclear importer (#512) Merge pull request 512

ashmaroli and others added 2 commits March 5, 2023 21:10

Re-implement Dotclear importer

41e2580

Fix RuboCop offenses

bb086aa

parkr approved these changes Mar 5, 2023

View reviewed changes

ashmaroli mentioned this pull request Mar 6, 2023

Refactor Dotclear importer methods into singleton instance methods #523

Merged

Merge upstream branch 'master' into this branch

4c936e6

ashmaroli force-pushed the dotclear-reloaded branch from 9168da1 to 4c936e6 Compare March 6, 2023 12:23

ashmaroli marked this pull request as draft March 6, 2023 12:46

Better error message for --datafile flag

73135d6

ashmaroli added 2 commits March 8, 2023 18:07

Implement --mediafolder support and refactor

78b1392

Update documentation highlighting relevant details

eeac99e

ashmaroli marked this pull request as ready for review March 8, 2023 16:53

ashmaroli requested a review from parkr March 8, 2023 16:54

Prepend post excerpts onto post content

ee9af77

parkr approved these changes Mar 10, 2023

View reviewed changes

jekyllbot merged commit 06205f7 into jekyll:master Mar 10, 2023

jekyllbot added the enhancement label Mar 10, 2023

jekyllbot added a commit that referenced this pull request Mar 10, 2023

Update history to reflect merge of #512 [ci skip]

758adc8

ashmaroli deleted the dotclear-reloaded branch March 10, 2023 06:18

github-actions bot pushed a commit that referenced this pull request Mar 10, 2023

Deploy docs from 06205f7

7358f3a

Ashwin Maroli: Re-implement Dotclear importer (#512) Merge pull request 512

jekyll locked and limited conversation to collaborators Mar 9, 2024

jekyllbot added the frozen-due-to-age label Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement Dotclear importer #512

Re-implement Dotclear importer #512

ashmaroli commented Mar 5, 2023

ashmaroli commented Mar 5, 2023

jrfern commented Mar 5, 2023

parkr Mar 5, 2023

ashmaroli Mar 6, 2023

parkr Mar 5, 2023

ashmaroli Mar 6, 2023

ashmaroli commented Mar 6, 2023 •

edited

Loading

jrfern commented Mar 6, 2023

ashmaroli commented Mar 7, 2023

jrfern commented Mar 7, 2023

ashmaroli commented Mar 7, 2023

ashmaroli commented Mar 8, 2023

jrfern commented Mar 8, 2023

ashmaroli commented Mar 8, 2023

jrfern commented Mar 8, 2023 •

edited

Loading

ashmaroli commented Mar 9, 2023

jrfern commented Mar 9, 2023

parkr Mar 10, 2023

ashmaroli Mar 10, 2023

parkr Mar 10, 2023

ashmaroli Mar 10, 2023

ashmaroli commented Mar 10, 2023

ashmaroli commented Mar 10, 2023

ashmaroli commented Mar 10, 2023

Re-implement Dotclear importer #512

Re-implement Dotclear importer #512

Conversation

ashmaroli commented Mar 5, 2023

ashmaroli commented Mar 5, 2023

jrfern commented Mar 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashmaroli commented Mar 6, 2023 • edited Loading

jrfern commented Mar 6, 2023

ashmaroli commented Mar 7, 2023

jrfern commented Mar 7, 2023

ashmaroli commented Mar 7, 2023

ashmaroli commented Mar 8, 2023

jrfern commented Mar 8, 2023

ashmaroli commented Mar 8, 2023

jrfern commented Mar 8, 2023 • edited Loading

The excerpt part is missing (it's in the backup). For example:

ERROR `/assets/dotclear/img/.dia_1_m.jpg' not found.

ashmaroli commented Mar 9, 2023

jrfern commented Mar 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashmaroli commented Mar 10, 2023

ashmaroli commented Mar 10, 2023

ashmaroli commented Mar 10, 2023

ashmaroli commented Mar 6, 2023 •

edited

Loading

jrfern commented Mar 8, 2023 •

edited

Loading