Skip to content

Microsoft OneNote import script

Ashley Cawley edited this page Sep 5, 2019 · 2 revisions

by: Ashley Cawley, updated: September 2019

This script is for anyone who is looking to move away from Microsoft OneNote to Zim Desktop.

This script can convert hundreds of OnteNote HTML Pages (Exported via Azure's API using this method) to ZimWiki format for use with Zim Desktop. Credit to user danmou on superuser.com who suggested a working method for extracting Microsoft OneNote Notebooks via Azure's API platform.

The resulting HTML files you're left with are great in that they're more versatile and open than Microsoft's proprietary format, however they are still not ideal for Zim Desktop which does not allow you to import .html files.

I created this script to convert those HTML files using pandoc into a ZimWiki Markup language. This script also fixes a whole host of formatting and structual issues to make the process work.

I hope you find this useful, I've commented the code so you can see what is going on. I'm not bash guru so please be kind and feel free to contribute improvements if you wish, you can find the git page here.

For the sake of completeness I will list danmou's guide at the bottom of this article that describes how to export your OneNote Notebook via the Azure API (just incase anything ever happens to the SuperUser page).

One setting to configure: you will need to update the ONENOTE_EXPORT_FOLDER variable in the script so that it points to the directory that holds your exported Notebook from danmou's method.

The Bash Script

Project: https://github.com/ashleycawley/Convert-HTML-to-ZimWiki.

#!/bin/bash

# Author: Ashley Cawley // ash@ashleycawley.co.uk // @ashleycawley
#
# Description: This script is designed to convert (OneNote) HTML files into a Zim Desktop format (ZimWiki).
# Specifically this is designed for people trying to move away from OneNote to Zim Desktop - https://zim-wiki.org/
#
# Follow this guide: https://superuser.com/a/1449705 to export your OneNote Notebook as HTML files and then you can
# use this script to convert those HTML files into a format compatible with Zim Desktop (ZimWiki Markup).
#
# This script backs-up your HTML files before doing its work. It fixes formatting/layout issues, moves & renames
# files & folders into the correct hireachry, embeds document titles in articles, cleans up files afterwards and 
# much more. I have commented code where I can to explain the process step by step.

# OneNote Export Folder
ONENOTE_EXPORT_FOLDER='/home/acawley/Notebooks/testing/cloudabove' # Configure to point at your Exported OneNote Notebook

# Tests to see if you are logged in as root, if you are it stops. (A standard user is prefered)
if
[ `whoami` == root ]
then
	echo "Please do not run this script as root. Please re-run the script as a normal user."
	exit 1
fi

# A check to ensure the correct software is already installed
declare -r U_CMDS="pandoc tar grep rsync find sed"
for the_command in $U_CMDS
do
        type -P $the_command >> /dev/null && : || {
        echo -e "$the_command command was not found, please install it $the_command and then try using this script again." >&2
        exit 1
    }
done

# Functions

# Takes input via arguement and swaps out any spaces for underscores
function SWAP_SPACES_FOR_UNDERSCORES () {
	sed 's/ /_/g'
}

SAVEIFS=$IFS

# Changing the delimiter used by arrays from a space to a new line, this allows my for loops to iterate through a vertical list provided by the likes of ls -1
IFS=$'\n'

echo "" # Just creating a space from the last line

# This stores script-name.sh inside the variable $SCRIPTNAME
SCRIPTNAME=`basename "$0"`

# Archive / Backup of original file set before processing
tar --exclude="$ONENOTE_EXPORT_FOLDER/pre-conversion-backup-`date +"%d-%m-%y_%T"`.tar.gz" -zcvf $ONENOTE_EXPORT_FOLDER/pre-conversion-backup-`date +"%d-%m-%y_%T"`.tar.gz $ONENOTE_EXPORT_FOLDER

# Gathers list of top level folder names whilst excluding this script from the list
LVL1_FOLDER_NAMES=(`ls -1 $ONENOTE_EXPORT_FOLDER | grep -v $SCRIPTNAME`)

# Main for loop which iterates through all of the various folders to do the work
for FOLDER_NAME in "${LVL1_FOLDER_NAMES[@]}"
do
	SECOND_LVL_FOLDERS=(`ls -1 $FOLDER_NAME`)

	for SECOND_FOLDER in "${SECOND_LVL_FOLDERS[@]}"
	do
		# Removing number prefixes like: 23_ 24_ etc.
		SECOND_FOLDER_SANITISED=(`echo $SECOND_FOLDER | cut -d '_' -f2-`)

		# Renaming document title / file from main.html to whatever the folder name is
		mv "$FOLDER_NAME/$SECOND_FOLDER/main.html" "$FOLDER_NAME/$SECOND_FOLDER/$SECOND_FOLDER_SANITISED.html" &>/dev/null

		STATUS=(`echo $?`)

		# Setting up communal image directory
		mkdir -p $FOLDER_NAME/$SECOND_FOLDER/images/

		# Copies image files out of subdirectory and into one central image folder that Zim can use and reference
		rsync -ah --remove-source-files $FOLDER_NAME/$SECOND_FOLDER/images/ $FOLDER_NAME/images/ &>/dev/null

		if [ $STATUS != 0 ]
		then

			THIRD_LEVEL_FOLDER=(`ls -1 $FOLDER_NAME/$SECOND_FOLDER/`)

			# Removing number prefixes like: 23_ 24_ etc.
			THIRD_LEVEL_FOLDER=(`echo $THIRD_LEVEL_FOLDER | cut -d '_' -f2-`)

			# echo "Going to a Third Level:"

			# Sanitises the filename replacing spaces with underscores
			FILENAME_WITH_UNDERSCORES=(`echo "$THIRD_LEVEL_FOLDER" | SWAP_SPACES_FOR_UNDERSCORES`)

			# echo "mv $FOLDER_NAME/$SECOND_FOLDER/$THIRD_LEVEL_FOLDER/main.html $FOLDER_NAME/$SECOND_FOLDER/$THIRD_LEVEL_FOLDER/$FILENAME_WITH_UNDERSCORES.html"
			mv "$FOLDER_NAME/$SECOND_FOLDER/$THIRD_LEVEL_FOLDER/main.html" "$FOLDER_NAME/$SECOND_FOLDER/$THIRD_LEVEL_FOLDER/$FILENAME_WITH_UNDERSCORES.html"

			# Moves the document up one level as it doesn't want to be in a subfolder for itself to work with Zim nicely
			mv "$FOLDER_NAME/$SECOND_FOLDER/$THIRD_LEVEL_FOLDER/$FILENAME_WITH_UNDERSCORES.html" "$FOLDER_NAME/$SECOND_FOLDER/$FILENAME_WITH_UNDERSCORES.html"

			# Converts the document from HTML to Zim Wiki Format
			pandoc -s -r html "$FOLDER_NAME/$SECOND_FOLDER/$THIRD_LEVEL_FOLDER/$FILENAME_WITH_UNDERSCORES.html" -t zimwiki -o "$FOLDER_NAME/$SECOND_FOLDER/$FILENAME_WITH_UNDERSCORES.txt"

			# Inserts a title into the article itself
			#echo "sed -i "3i====== $SECOND_FOLDER_SANITISED ======" "$FOLDER_NAME/$SECOND_FOLDER/$FILENAME_WITH_UNDERSCORES.txt""
			sed -i "3i====== $SECOND_FOLDER_SANITISED ======" "$FOLDER_NAME/$SECOND_FOLDER/$FILENAME_WITH_UNDERSCORES.txt" # &>/dev/null
			#read -p "Pausing - Enter to continue" INPUT

		fi

		if [ $STATUS == 0 ]
		then
			# Sanitises the filename replacing spaces with underscores
			FILENAME_WITH_UNDERSCORES=(`echo "$SECOND_FOLDER_SANITISED" | SWAP_SPACES_FOR_UNDERSCORES`)
			mv "$FOLDER_NAME/$SECOND_FOLDER/$SECOND_FOLDER_SANITISED.html" "$FOLDER_NAME/$SECOND_FOLDER/$FILENAME_WITH_UNDERSCORES.html"

			# Moves the document up one level as it doesn't want to be in a subfolder for itself to work with Zim nicely
			mv "$FOLDER_NAME/$SECOND_FOLDER/$FILENAME_WITH_UNDERSCORES.html" "$FOLDER_NAME/$FILENAME_WITH_UNDERSCORES.html"

			# Converts the document from HTML to Zim Wiki Format
			pandoc -s -r html "$FOLDER_NAME/$FILENAME_WITH_UNDERSCORES.html" -t zimwiki -o "$FOLDER_NAME/$FILENAME_WITH_UNDERSCORES.txt"

			# Inserts a title into the article itself
			#echo "sed -i "3i====== $SECOND_FOLDER_SANITISED ======" "$FOLDER_NAME/$FILENAME_WITH_UNDERSCORES.txt""
			sed -i "3i====== $SECOND_FOLDER_SANITISED ======" "$FOLDER_NAME/$FILENAME_WITH_UNDERSCORES.txt" # &>/dev/null
			#read -p "Pausing - Enter to continue" INPUT
		fi
	done

	echo ""
done
echo ""

#  The three three commands below attempt to remove all spaces from directory names, the problem is made harder by trailing spaces or spaces at the start of a filename
find $ONENOTE_EXPORT_FOLDER -name "* *" -print0 | sort -rz | while read -d $'\0' f; do mv -v "$f" "$(dirname "$f")/$(basename "${f// /_}")"; done
find $ONENOTE_EXPORT_FOLDER -name "* " -print0 | sort -rz | while read -d $'\0' f; do mv -v "$f" "$(dirname "$f")/$(basename "${f// /_}")"; done
find $ONENOTE_EXPORT_FOLDER -name " *" -print0 | sort -rz | while read -d $'\0' f; do mv -v "$f" "$(dirname "$f")/$(basename "${f// /_}")"; done

# Fix Broken Image Paths, the default syntax pandoc seems to be inserting for zimwiki format doesn't seem to be working, so I'm replacing it with working syntax
IMAGE_PATHS=(`grep -ril "{{:images" $ONENOTE_EXPORT_FOLDER/ | grep -v "$SCRIPTNAME"`)
for IMAGE in "${IMAGE_PATHS[@]}"
do
	sed -i s,{{:images,{{../\images,g $IMAGE
done

FILE_LIST=(`find $ONENOTE_EXPORT_FOLDER -type f -name "*.txt"`)

# Replaces doubled-up (two newlines) for just one, which makes the layout more sensible in Zim
for FILE in "${FILE_LIST[@]}"
do
	sed -i '/^$/N;/^\n$/D' $FILE
done

# Deletes empty folders after things have been moved around
find $ONENOTE_EXPORT_FOLDER -type d -empty -delete

# Deletes OneNote HTML files which are no longer needed (they've already been backed up toward the top of this script)
find $ONENOTE_EXPORT_FOLDER -type f -name "*.html" -delete

# Resets $IFS this changes the delimiter that arrays use from new lines (\n) back to just spaces (which is what it normally is)
IFS=$SAVEIFS

Exporting OneNote Notebooks from Azure API

Source: Posted by danmou on June 2019.

I found a solution using Microsoft's Graph API. This means you don't even have to run OneNote, it just requires that your notes are synced to your Microsoft account and then you can get your notes as perfectly formatted HTML (which you can view in the browser or convert to whatever format you prefer using Pandoc).

The magic happens in this Python script. It runs a simple local web server that you can use to log in to your Microsoft account and once you do that it downloads all your notes as HTML, plus images and attachments in their original formats, and stores them in file hierarchy preserving the original structure of your notebooks (including page order and subpages).

Before you can run the script, you have to register an "app" in Microsoft Azure so it can access the Graph API:

Go to https://aad.portal.azure.com/ and log in with your Microsoft account.

Select "Azure Active Directory" and then "App registrations" under "Manage".

Select "New registration". Choose any name, set "Supported account types" to "Accounts in any organizational directory and personal Microsoft accounts" and under "Redirect URI", select Web and enter http://localhost:5000/getToken. Register.

Copy the "Application (client) ID" and paste it as client_id in the beginning of the Python script.

Select "Certificates & secrets" under "Manage". Press "New client secret", choose a name and confirm.

Copy the client secret and paste it as secret in the Python script.

Select "API permissions" under "Manage". Press "Add a permission", scroll down and select OneNote, choose "Delegated permissions" and check "Notes.Read" and "Notes.Read.All". Press "Add permissions".

Then you need to install the Python dependencies. Make sure you have Python 3.7 (or newer) installed and install the dependencies using the command pip install flask msal requests_oauthlib.

Now you can run the script. In a terminal, navigate to the directory where the script is located and run it using python onenote_export.py. This will start a local web server on port 5000.

In your browser navigate to http://localhost:5000 and log in to your Microsoft account. The first time you do it, you will also have to accept that the app can read your OneNote notes. (This does not give any third parties access to your data, as long as you don't share the client id and secret you created on the Azure portal). After this, go back to the terminal to follow the progress.

Note: Microsoft limits how many requests you can do within a given time period. Therefore, if you have many notes you might eventually see messages like this in the terminal: Too many requests, waiting 20s and trying again. This is not a problem, but it means the entire process can take a while. Also, the login session can expire after a while, which results in a TokenExpiredError. If this happens, simply reload http://localhost:5000 and the script will continue (skipping the files it already downloaded).

Clone this wiki locally
You can’t perform that action at this time.