Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not-issue] usage with SingleFile #1

Closed
andrey-jef opened this issue Sep 10, 2022 · 12 comments
Closed

[not-issue] usage with SingleFile #1

andrey-jef opened this issue Sep 10, 2022 · 12 comments
Labels
good first issue Good for newcomers

Comments

@andrey-jef
Copy link

Hi.

Just wanna drop in and say thank you. I'm using SingleFile to archive some website into a single html file. Now with your plugin, I can view the archived html file from within Obsidian. Not counting the fact that these notes have also their own metadata in frontmatter, which is awesome.

@nuthrash
Copy link
Owner

You are welcome! Thanks for your testing report of SingleFile.
Meanwhile, in addition to SingleFile, I'm also using "Print Edit WE"(with Page Save WE) firefox/chrome extensions to archive a website's page into a single html file. It is flexible to delete or hide useless html visual elements.

@nuthrash nuthrash added the good first issue Good for newcomers label Sep 12, 2022
@gildas-lormeau
Copy link

It is flexible to delete or hide useless html visual elements.

FYI, SingleFile also includes an annotation editor to do the same thing without installing any additional extension.

@nuthrash
Copy link
Owner

FYI, SingleFile also includes an annotation editor to do the same thing without installing any additional extension.

That's great! I think the SingleFile's annotation editor is very suitable for common cases.
However, in my case, sometimes I prefer to cut out useless visual elements unrestrictedly, which means the main text area where I want to keep would extend to fill almost full page. It seems that SingleFile cannot satisfy my need, or maybe I didn't find the right way?

In fact, I often use SingleFile and other tools to save different kinds of html files depend on which tool can satisfy what I want to keep data. For example, I wanna keep image source address from original remote site, so I simply use the internal saving facility of browser to save html files.

@gildas-lormeau
Copy link

However, in my case, sometimes I prefer to cut out useless visual elements unrestrictedly, which means the main text area where I want to keep would extend to fill almost full page. It seems that SingleFile cannot satisfy my need, or maybe I didn't find the right way?

I confirm the editor cannot remove margins. The use case I have in mind is rather the removal of ads and other unwanted elements while keeping the style. However, the editor also allows to save the page as it appears in the reader mode. I think this feature could be suitable for your needs.

@nuthrash
Copy link
Owner

nuthrash commented Sep 13, 2022

I confirm the editor cannot remove margins. The use case I have in mind is rather the removal of ads and other unwanted elements while keeping the style. However, the editor also allows to save the page as it appears in the reader mode. I think this feature could be suitable for your needs.

Oh, the SingleFile's reader mode is my ever seen best read mode, it works perfectly in 99% cases.

By the way, I think you already know there are some problems about the reader mode:

  1. The reader mode is not supported by all websites and all pages. (I guess SingleFile's reader mode is co-operating with this mechanism after I tested some unsupported pages)
  2. It would also disable many styling effects.

The first problem can be overcomed by using "force enable reader mode" extensions.
But the second problem is unresolvable, because the name of reader mode itself means some styling effects shall be disabled.
Therefore, I cannot take reader mode to capture web contents, especially when I wanna keep styling effects of code snippets.

@gildas-lormeau
Copy link

gildas-lormeau commented Sep 15, 2022

Actually, SingleFile uses the Firefox implementation for the reader mode, see https://github.com/mozilla/readability. I agree that the reader mode might be too destructive for your use case.

BTW, I added a link to your project here: https://github.com/gildas-lormeau/SingleFile/blob/master/README.MD#projects-usingcompatible-with-singlefile

@nuthrash
Copy link
Owner

nuthrash commented Sep 15, 2022

Actually, SingleFile uses the Firefox implementation for the reader mode, see https://github.com/mozilla/readability.

That's a very useful clue! I might investigate something interesting about the reader mode.

BTW, I added a link to your project here: https://github.com/gildas-lormeau/SingleFile/blob/master/README.MD#projects-usingcompatible-with-singlefile

It's my honor for my project to be added to SingleFile's compatible list. This means a lot to me.
And thank you very much to bring us such an excellent extension, SingleFile make the capturing information simpler.

@scruel
Copy link

scruel commented Oct 19, 2022

Can you support Mozilla Archive Format that generated by SignleFileZ? Thanks~

@gildas-lormeau Catch ya! :)

@nuthrash
Copy link
Owner

Can you support Mozilla Archive Format that generated by SignleFileZ? Thanks~

Hmm, I think this plugin is not available to open the HTML files generated by SingleFileZ.

I am new to SingleFileZ web extension, therefore I tried to parse the file generated by it(e.g.: abc.zip.html) and a standard .maff (Mozilla Archive Format) file.
I think the file generated by SingleFileZ is not a Mozilla Archive Format file, that means they use different document format to store HTML and related files.

Refer to this .maff file https://www.amadzone.org/mozilla-archive-format/maff-test-cases/test-basic-type-html.maff, it is a pure ZIP file, and its content starts with "PK" string following by standard ZIP binary code.

By the contrast, the abc.zip.html is a pure HTML file, and its content looks like <html> .... <xmp>![CDATA....</xmp></html> . It seems that the SingleFileZ compress web content to binary code and put them in the <xmp>...</xmp> section.

The SingleFileZ project declare it "save a webpage as a self-extracting HTML file", I think it explain many things.
In Obsidian, it block many access operations to avoid XSS attacks, that means the "self-extracting" operation would be blocked.

If you really want to see the content of compressed HTML files in Obsidian, I think the simplest way is to re-save them by original SingleFile browser extension to plain text HTML files.

@gildas-lormeau
Copy link

gildas-lormeau commented Oct 19, 2022

@nuthrash Files produced by SingleFileZ are not pure HTML files. These are invalid HTML files (the HTML specification does not allow embedding binary data as is in the markup) but 100% valid zip files in fact. Indeed, the zip specification does not require a zip file to begin with "PK". It allows to store some random data before (and after) the zip data. I know that because I'm the author of zip.js, see https://github.com/gildas-lormeau/zip.js.
So, files produced by SingleFileZ are zip files but disguised as HTML files. From a technical point of view, this is actually very similar to self-extracting executable files (e.g. driver installation programs on Windows). The main difference is that instead of embedding an additional binary program to unzip the file, the HTML page embeds a JavaScript script to unzip the file (and display the saved page). Thus, if you/Obsidian allow the JavaScript code to run, the page saved with SingleFileZ should simply work. This is typically what happens when you open https://gildas-lormeau.github.io/. Otherwise, you would need to add the code that unzips the file and displays the page in your plugin.
Finally, there are also options in SingleFileZ to save pages as non-extractable zip files (i.e. pure binary files beginning with "PK") and compatible with the MAFF specification. I guess this is what @scruel is referring to. In this case, you would also need to add the code that unzips the file and displays the page in your plugin.

@nuthrash
Copy link
Owner

nuthrash commented Oct 20, 2022

Thus, if you/Obsidian allow the JavaScript code to run, the page saved with SingleFileZ should simply work.

@gildas-lormeau The Obsidian has blocked such Javascript operations in external files (such as .html .md, etc.) by default. I've confirmed it by the most dangerous function insertAdjacentHTML(), and it would show

Error: Cannot open the page from the filesystem.
    Chrome: Install SingleFileZ and enable the option "Allow access to file URLs" in the details page of the extension (chrome://extensions/?id=offkdfbbigofcgdokjemgjpdockaafjg).
    Microsoft Edge: Install SingleFileZ and enable the option "Allow access to file URLs" in the details page of the extension (edge://extensions/?id=gofneaifncimeglaecpnanbnmnpfjekk).
    Safari: Select "Disable Local File Restrictions" in the "Develop" menu.

I have some questions:

  1. Is the "SingleFileZ" web extension necessary? I opened a xxx.zip.html in Opera, it shows the same message. (NOTE: the Obsidian is based on Electron, which embedded a Chromium browser)
  2. How to detect a .html file made by "SingleFileZ"?
  3. Is there a npm package can convert/decode SingleFileZ's .html content to standard HTML string?

@gildas-lormeau
Copy link

gildas-lormeau commented Oct 20, 2022

I have some questions:

1. Is the "SingleFileZ" web extension necessary? I opened a xxx.zip.html in Opera, it shows the same message. (NOTE: the Obsidian is based on [Electron](https://www.electronjs.org/), which embedded a Chromium browser)

It is unfortunately necessary to install SingleFileZ to view pages from the filesystem in Chromium-based browsers because they don't allow to run fetch("") (in order to retrieve the displayed page in binary) when the page is opened from the filesystem. It looks like the same limitation is applied in Obsidian.

2. How to detect a .html file made by "SingleFileZ"?

The file can be unzipped and it contains an index.html file in the root folder or the first folder of the zip file (for MAFF files). In addition, for self-extracting pages, the <html> tag contains the attribute data-sfz.

3. Is there a npm package to convert/decode SingleFileZ's .html content to standard HTML string?

The function extractPage in the code below (heavily inspired from this gist) should help you.

import { extract } from "https://raw.githubusercontent.com/gildas-lormeau/SingleFileZ/master/src/single-file/processors/compression/compression-extract.js";
import * as zip from "https://raw.githubusercontent.com/gildas-lormeau/zip.js/master/index.js";
globalThis.zip = zip;

async function extractPage(zipBlob) {
  const { docContent } = await extract(zipBlob, { noBlobURL: true });
  return docContent;
}

You can also use local imports instead of retrieving scripts from raw.githubusercontent.com by importing single-filez-core and zip.js from NPM, and replacing "https://raw.githubusercontent.com/gildas-lormeau/SingleFileZ/master/src/single-file" with "single-filez-core" and "https://raw.githubusercontent.com/gildas-lormeau/zip.js/master/index.js" with "@zip.js/zip.js".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants