Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config revamp #82

Merged
merged 15 commits into from
Oct 4, 2022
80 changes: 65 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,30 +26,73 @@ To do this, go to the "Settings and Members" page in Notion. You should see an "

Finally, in Notion you'll need to share the relevant pages with your internal integration---just like you'd share a page with another person.

## Example Usage
## Configuration

N2y is configured using a single YAML file. This file contains a few top-level keys:

| Top-level key | Description |
| --- | --- |
| media_url | Sets the base URL for all downloaded media files (e.g., images, videos, PDFs, etc.) |
| media_root | The directory where media files should be downloaded to |
| exports | A list of export configuration items, indicating how a notion page or database is to be exported. See below for the keys. |
| export_defaults | Default values for the export configuration items. |

The export configuration items may contain the following keys:

| Export key | Description |
| --- | --- |
| id | The notion database or page id, taken from the "share URL". |
| node_type | Either "database_as_yaml", "database_as_files", or "page". |
| output | The path the output file, or directory, where the data will be written. |
| pandoc_format | The [pandoc format](https://pandoc.org/MANUAL.html#general-options) that we're generating. |
| pandoc_options | A list of strings that are [writer options](https://pandoc.org/MANUAL.html#general-writer-options) for pandoc. |
| content_property | When set, it indicates the property name that will contain the content of the notion pages in that databse. If set to `None`, then only the page's properties will be included in the export. (Only applies to the `database_as_files` node type.) |
| id_property | When set, this indicates the property name in which to place the page's underlying notion ID. |
| url_property | When set, this indicates the property name in which to place the page's underlying notion url. |
| filename_property | This key is required for the "database_as_files" node type; when set, it indicates which property to use when generating the file name. |
| plugins | A list of python modules to use as plugins. |
| notion_filter | A [notion filter object](https://developers.notion.com/reference/post-database-query-filter) to be applied to the database. |
| notion_sorts | A [notion sorts object](https://developers.notion.com/reference/post-database-query-sort) to be applied to the database. |
| property_map | A mapping between the name of properties in Notion, and the name of the properties in the exported files. |

## Example Configuration Files

The command is run using `n2y configuration.yaml`.

### Convert a Database to YAML

Copy the link for the database you'd like to export to YAML. Note that linked databases aren't supported. Then run:
A notion database (e.g., with a share URL like this https://www.notion.so/176fa24d4b7f4256877e60a1035b45a4?v=130ffd3224fd4512871bb45dbceaa7b2) could be exported into a YAML file using this minimal configuration file:

```
n2y DATABASE_LINK > database.yml
exports:
- id: 176fa24d4b7f4256877e60a1035b45a4
node_type: database_as_yaml
output: database.yml
```

### Convert a Database to a set of Markdown Files

The same database could be exported into a set of markdown files as follows:

```
n2y -f markdown DATABASE_LINK
exports:
- id: 176fa24d4b7f4256877e60a1035b45a4
node_type: database_as_files
output: directory
filename_property: "Name"
```

This process will automatically skip untitled pages or pages with duplicate names.
Each page in the database will generate a single markdown file, named according to the `filename_property`. This process will automatically skip pages whose "Name" property is empty.

### Convert a Page to a Markdown File

If the page is in a database, then it's properties will be included in the YAML front matter. If the page is not in a database, then the title of the page will be included in the YAML front matter.
An individual notion page (e.g., with a share URL like this https://www.notion.so/All-Blocks-Test-Page-5f18c7d7eda44986ae7d938a12817cc0) could be exported to markdown with this minimal configuration file:

```
n2y PAGE_LINK > page.md
exports:
- id: 5f18c7d7eda44986ae7d938a12817cc0
node_type: page
output: page.md
```

### Audit a Page and it's Children For External Links
Expand Down Expand Up @@ -85,7 +128,7 @@ The default implementation of these classes can be modified using a plugin syste

1. Create a new Python module
2. Subclass the various notion classes, modifying their constructor or `to_pandoc` method as desired
3. Run n2y with the `--plugin` argument pointing to your python module
3. Set the `plugins` property in your export config to the module name (e.g., `n2y.plugins.deepheaders`)

See the [builtin plugins](https://github.com/innolitics/n2y/tree/main/n2y/plugins) for examples.

Expand All @@ -95,6 +138,10 @@ You can use multiple plugins. If two plugins provide classes for the same notion

Often you'll want to use a different class only in certain situations. For example, you may want to use a different Page class with its own unique behavior only for pages in a particular database. To accomplish this you can use the `n2y.errors.UseNextClass` exception. If your plugin class raise the `n2y.errors.UseNextClass` exception in its constructor, then n2y will move on to the next class (which may be the builtin class if only one plugin was used).

### Different Plugins for Different Exports

You may use different plugins for different export items, but keep in mind that the plugin module is imported only once. Also, if you export the same `Page` or `Database` multiple times with different plugins, due to an internal cache, the plugins that were enabled during the first run will be used.

### Default Block Class's

Here are the default block classes that can be extended:
Expand Down Expand Up @@ -132,7 +179,6 @@ Here are the default block classes that can be extended:
| ToggleBlock | Convert the toggles into a bulleted list. |
| VideoBlock | Acts the same way as the Image block |


Most of the Notion blocks can generate their pandoc AST from _only_ their own data. The one exception is the list item blocks; pandoc, unlike Notion, has an encompassing node in the AST for the entire list. The `ListItemBlock.list_to_pandoc` class method is responsible for generating this top-level node.

## Built-in Plugins
Expand Down Expand Up @@ -175,12 +221,12 @@ Note that any link to a page that the integration doesn't have access to will be

## Architecture

N2y's architecture is divided into four main steps:
An n2y run is divided into four stages:

1. Configuration
1. Loading the configuration (mostly in `config.py`)
2. Retrieve data from Notion (by instantiating various Notion object instances, e.g., `Page`, `Block`, `RichText`, etc.)
3. Convert to the pandoc AST (by calling `block.to_pandoc()`)
4. Writing the pandoc AST into markdown or YAML
4. Writing the pandoc AST into one of the various output formats (mostly in `export.py`)

Every page object has a `parent` property, which may be a page, a database, or a workspace.

Expand Down Expand Up @@ -219,12 +265,16 @@ Here are some features we're planning to add in the future:
- Add support for recursively dumping sets of pages and preserving links between them
- Add some sort of Notion API caching mechanism
- Add more examples to the documentation
- Make it so that plugins and other configuration can be set for only a sub-set
of the exported pages, that way multiple configurations can be applied in a
single export

## Changelog

### v0.6.0

- The export is now configured using a single YAML file instead of the growing list of commandline arguments. Using a configuration file allows multiple page and database exports to be made in a single run, which in turn improves caching and will enable future improvements, like preserving links between generated HTML or markdown pages.
- Added the `pandoc_format` and `pandoc_options` fields, making it possible to output to any format that pandoc supports.
- Removed the ability to export a set of related databases (this is less useful now that we have a configuration file).
- Add support for remapping property names in the exports using the `property_map` option

### v0.5.0

- Add support for dumping the notion urls using `--url-property`.
Expand Down
7 changes: 6 additions & 1 deletion n2y/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -582,8 +582,13 @@ def __init__(self, client, notion_data, page, get_children=True):
def to_pandoc(self):
# TODO: in the future, if we are exporting the linked page too, then add
# a link to the page. For now, we just display the text of the page.
if self.link_type == "page_id":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're inlining this logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought that, since we know what type of link it is, it would be better to avoid the unnecessary API request that's present if we use get_page_or_database.

node = self.client.get_page(self.linked_page_id)
elif self.link_type == "database_id":
node = self.client.get_database(self.linked_page_id)
else:
raise NotImplementedError(f"Unknown link type: {self.link_type}")

node = self.client.get_page_or_database(self.linked_page_id)
if node is None:
msg = "Permission denied when attempting to access linked node [%s]"
logger.warning(msg, self.notion_url)
Expand Down
139 changes: 116 additions & 23 deletions n2y/config.py
Original file line number Diff line number Diff line change
@@ -1,41 +1,134 @@
import json
import logging
import copy

import yaml

from n2y.utils import strip_hyphens


logger = logging.getLogger(__name__)


def database_config_json_to_dict(config_json):
DEFAULTS = {
"media_root": "media",
"media_url": "./media/",
}


EXPORT_DEFAULTS = {
"id_property": None,
"content_property": None,
"url_property": None,
"notion_filter": [],
"notion_sorts": [],
"pandoc_format": "gfm+tex_math_dollars+raw_attribute",
"pandoc_options": [
'--wrap', 'none', # don't hard line-wrap
'--eol', 'lf', # use linux-style line endings
],
"plugins": [],
"property_map": {},
}


def load_config(path):
try:
config = json.loads(config_json)
except json.JSONDecodeError as exc:
logger.error("Error parsing the data config JSON: %s", exc.msg)
with open(path, "r") as config_file:
config = yaml.safe_load(config_file)
except yaml.YAMLError as exc:
logger.error("Error parsing the config file: %s", exc)
return None
except FileNotFoundError:
logger.error("The config file '%s' does not exist", path)
return None
if not validate_database_config(config):
if not validate_config(config):
logger.error("Invalid config file: %s", path)
return None
Comment on lines 52 to 63
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can extract this try/except out?

def load_config(path):
    config = load_config_yaml(config_file)
    if config is None:
        return None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Done!


defaults_copy = copy.deepcopy(DEFAULTS)
config = {**defaults_copy, **config}

merged_exports = merge_config(
config.get("exports", []),
EXPORT_DEFAULTS,
config.get("export_defaults", {}),
)
config["exports"] = merged_exports
return config


def validate_database_config(config):
try:
for database_id, config_values in config.items():
if not _valid_id(database_id):
logger.error("Invalid database id in database config: %s", database_id)
return False
for key, values in config_values.items():
if key not in ["sorts", "filter"]:
logger.error("Invalid key in database config: %s", key)
return False
if not isinstance(values, dict) and not isinstance(values, list):
logger.error(
"Invalid value of type '%s' for key '%s' in database config, "
"expected dict or list", type(values), key,
)
return False
except AttributeError:
def merge_config(config_items, builtin_defaults, defaults):
"""
For each config item, merge in both the user provided defaults and the
builtin defaults for each key value pair."
"""
merged_config_items = []
for config_item in config_items:
master_defaults_copy = copy.deepcopy(builtin_defaults)
defaults_copy = copy.deepcopy(defaults)
config_item_copy = copy.deepcopy(config_item)
merged_config_item = {**master_defaults_copy, **defaults_copy, **config_item_copy}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know you could update dictionaries this way. Cool!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure you looked into this but I assume that with this syntax, the dictionaries get updated in the order they appear right? So it's equivalent to:

merged_config_item = master_defaults_copy
merged_config_item = merged_config_item.update(defaults_copy)
merged_config_item = merged_config_item.update(config_item_copy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, the updates apply from left to right!

merged_config_items.append(merged_config_item)
return merged_config_items


def validate_config(config):
if "exports" not in config:
logger.error("Config missing the 'exports' key")
return False
if not isinstance(config["exports"], list) and len(config["exports"]) > 0:
logger.error("Config 'exports' key must be a non-empty list")
return False
for export in config["exports"]:
if not _validate_config_item(export):
return False
# TODO: validate the export defaults key
return True


def _validate_config_item(config_item):
if "id" not in config_item:
logger.error("Export config item missing the 'id' key")
return False
if not _valid_id(config_item["id"]):
logger.error("Invalid id in export config item: %s", config_item["id"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a return False here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! good catch!

if "node_type" not in config_item:
logger.error("Export config item missing the 'node_type' key")
return False
if config_item["node_type"] not in ["page", "database_as_yaml", "database_as_files"]:
logger.error("Invalid node_type in export config item: %s", config_item["node_type"])
return False
if config_item["node_type"] == "database_as_files" and "filename_property" not in config_item:
logger.error("Missing the 'filename_property' key when node_type is 'database_as_files'")
return False
if "output" not in config_item:
logger.error("Export config item missing the 'output' key")
return False
if "notion_filter" in config_item:
if not _valid_notion_filter(config_item["notion_filter"]):
return False
if "notion_sorts" in config_item:
if not _valid_notion_sort(config_item["notion_sorts"]):
return False
# TODO: validate pandoc_formation
# TODO: validate pandoc_options
# TODO: property map
return True


def _valid_notion_filter(notion_filter):
if not (isinstance(notion_filter, list) or isinstance(notion_filter, dict)):
logger.error("notion_filter must be a list or dict")
return False
# TODO validate keys and values
return True


def _valid_notion_sort(notion_sorts):
if not (isinstance(notion_sorts, list) or isinstance(notion_sorts, dict)):
logger.error("notion_sorts must be a list or dict")
return False
# TODO validate keys and values
return True


Expand Down
Loading