From 4748b248b28cc69d31b73cb9d07c9174cf18052d Mon Sep 17 00:00:00 2001
From: Matt Nowzari <matt.nowzari@elastic.co>
Date: Wed, 5 Mar 2025 10:15:48 -0500
Subject: [PATCH] App Search to Open Crawler Migration Notebook (#416)

---
 bin/find-notebooks-to-test.sh                 |   1 +
 ...ch-crawler-to-open-crawler-migration.ipynb | 387 ++++++++++++++++++
 2 files changed, 388 insertions(+)
 create mode 100644 notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb

diff --git a/bin/find-notebooks-to-test.sh b/bin/find-notebooks-to-test.sh
index 01a410ed8..cb65fbb88 100755
--- a/bin/find-notebooks-to-test.sh
+++ b/bin/find-notebooks-to-test.sh
@@ -31,6 +31,7 @@ EXEMPT_NOTEBOOKS=(
     "notebooks/integrations/azure-openai/vector-search-azure-openai-elastic.ipynb"
     "notebooks/enterprise-search/app-search-engine-exporter.ipynb",
     "notebooks/enterprise-search/elastic-crawler-to-open-crawler-migration.ipynb",
+    "notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb",
     "notebooks/playground-examples/bedrock-anthropic-elasticsearch-client.ipynb",
     "notebooks/playground-examples/openai-elasticsearch-client.ipynb",
     "notebooks/integrations/hugging-face/huggingface-integration-millions-of-documents-with-cohere-reranking.ipynb",
diff --git a/notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb b/notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb
new file mode 100644
index 000000000..ec1ef4e96
--- /dev/null
+++ b/notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb
@@ -0,0 +1,387 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0cccdff9-5ef4-4bc8-a139-6aefa7609f1e",
+   "metadata": {},
+   "source": [
+    "## Hello, future Open Crawler user!\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb)\n",
+    "\n",
+    "This notebook is designed to help you migrate your [App Search Web Crawler](https://www.elastic.co/guide/en/app-search/current/web-crawler.html) configurations to Open Crawler-friendly YAML!\n",
+    "\n",
+    "We recommend running each cell individually in a sequential fashion, as each cell is dependent on previous cells having been run. Furthermore, we recommend that you only run each cell once as re-running cells may result in errors or incorrect YAML files.\n",
+    "\n",
+    "### Setup\n",
+    "First, let's start by making sure `elasticsearch` and other required dependencies are installed and imported by running the following cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "db796f1b-ce29-432b-879b-d517b84fbd9f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install elasticsearch\n",
+    "\n",
+    "from getpass import getpass\n",
+    "from elasticsearch import Elasticsearch\n",
+    "\n",
+    "import os\n",
+    "import yaml"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "08b84987-19c5-4716-9534-1225ea993a9c",
+   "metadata": {},
+   "source": [
+    "We are going to need a few things from your Elasticsearch deployment before we can migrate your configurations:\n",
+    "- Your **Elasticsearch Endpoint URL**\n",
+    "- Your **Elasticsearch Endpoint Port number**\n",
+    "- An **API key**\n",
+    "\n",
+    "You can find your Endpoint URL and port number by visiting your Elasticsearch Overview page in Kibana.\n",
+    "\n",
+    "You can create a new API key from the Stack Management -> API keys menu in Kibana. Be sure to copy or write down your key in a safe place, as it will be displayed only once upon creation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5de84ecd-a055-4e73-ae4a-1cc2ad4be0a7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ELASTIC_ENDPOINT = getpass(\"Elastic Endpoint: \")\n",
+    "ELASTIC_PORT = getpass(\"Port\")\n",
+    "API_KEY = getpass(\"Elastic Api Key: \")\n",
+    "\n",
+    "es_client = Elasticsearch(\n",
+    "    \":\".join([ELASTIC_ENDPOINT, ELASTIC_PORT]),\n",
+    "    api_key=API_KEY,\n",
+    ")\n",
+    "\n",
+    "# ping ES to make sure we have positive connection\n",
+    "es_client.info()[\"tagline\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aa8dfff3-713b-44a4-b694-2699eebe665e",
+   "metadata": {},
+   "source": [
+    "Hopefully you received our tagline 'You Know, for Search'. If so, we are connected and ready to go!\n",
+    "\n",
+    "If not, please double-check your Cloud ID and API key that you provided above. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4756ffaa-678d-41aa-865b-909038034104",
+   "metadata": {},
+   "source": [
+    "### Step 1: Get information on all App Search engines and their Web Crawlers\n",
+    "\n",
+    "First, we need to establish what Crawlers you have and their basic configuration details.\n",
+    "This next cell will attempt to pull configurations for every distinct App Search Engine you have in your Elasticsearch instance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2e6d2f86-7aea-451f-bed2-107aaa1d4783",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# in-memory data structure that maintains current state of the configs we've pulled\n",
+    "inflight_configuration_data = {}\n",
+    "\n",
+    "# get each engine's engine_oid\n",
+    "app_search_engines = es_client.search(\n",
+    "    index=\".ent-search-actastic-engines_v26\",\n",
+    ")\n",
+    "\n",
+    "engine_counter = 1\n",
+    "for engine in app_search_engines[\"hits\"][\"hits\"]:\n",
+    "    # pprint.pprint(engine)\n",
+    "    source = engine[\"_source\"]\n",
+    "    if not source[\"queued_for_deletion\"]:\n",
+    "        engine_oid = source[\"id\"]\n",
+    "        output_index = source[\"name\"]\n",
+    "\n",
+    "        # populate a temporary hashmap\n",
+    "        temp_conf_map = {\"output_index\": output_index}\n",
+    "        # pre-populate some necessary fields in preparation for upcoming steps\n",
+    "        temp_conf_map[\"domains_temp\"] = {}\n",
+    "        temp_conf_map[\"output_sink\"] = \"elasticsearch\"\n",
+    "        temp_conf_map[\"full_html_extraction_enabled\"] = False\n",
+    "        temp_conf_map[\"elasticsearch\"] = {\n",
+    "            \"host\": \"\",\n",
+    "            \"port\": \"\",\n",
+    "            \"api_key\": \"\",\n",
+    "        }\n",
+    "        # populate the in-memory data structure\n",
+    "        inflight_configuration_data[engine_oid] = temp_conf_map\n",
+    "        print(f\"{engine_counter}.) {output_index}\")\n",
+    "        engine_counter += 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98657282-4027-40bb-9287-02b4a0381aec",
+   "metadata": {},
+   "source": [
+    "### Step 2: URLs, Sitemaps, and Crawl Rules\n",
+    "\n",
+    "In the next cell, we will need to query Elasticsearch for information about each Crawler's domain URLs, seed URLs, sitemaps, and crawling rules."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "94121748-5944-4dc4-9456-cdc754c7126c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "crawler_counter = 1\n",
+    "for engine_oid, crawler_config in inflight_configuration_data.items():\n",
+    "    # get each crawler's domain details\n",
+    "    crawler_domains = es_client.search(\n",
+    "        index=\".ent-search-actastic-crawler_domains_v6\",\n",
+    "        query={\"match\": {\"engine_oid\": engine_oid}},\n",
+    "        _source=[\"crawl_rules\", \"id\", \"name\", \"seed_urls\", \"sitemaps\"],\n",
+    "    )\n",
+    "\n",
+    "    print(f\"{crawler_counter}.) Engine ID {engine_oid}\")\n",
+    "    crawler_counter += 1\n",
+    "\n",
+    "    for domain_info in crawler_domains[\"hits\"][\"hits\"]:\n",
+    "        source = domain_info[\"_source\"]\n",
+    "\n",
+    "        # extract values\n",
+    "        domain_oid = str(source[\"id\"])\n",
+    "        domain_url = source[\"name\"]\n",
+    "        seed_urls = source[\"seed_urls\"]\n",
+    "        sitemap_urls = source[\"sitemaps\"]\n",
+    "        crawl_rules = source[\"crawl_rules\"]\n",
+    "\n",
+    "        print(f\"    Domain {domain_url} found!\")\n",
+    "\n",
+    "        # transform seed, sitemap, and crawl rules into arrays\n",
+    "        seed_urls_list = []\n",
+    "        for seed_obj in seed_urls:\n",
+    "            seed_urls_list.append(seed_obj[\"url\"])\n",
+    "\n",
+    "        sitemap_urls_list = []\n",
+    "        for sitemap_obj in sitemap_urls:\n",
+    "            sitemap_urls_list.append(sitemap_obj[\"url\"])\n",
+    "\n",
+    "        crawl_rules_list = []\n",
+    "        for crawl_rules_obj in crawl_rules:\n",
+    "            crawl_rules_list.append(\n",
+    "                {\n",
+    "                    \"policy\": crawl_rules_obj[\"policy\"],\n",
+    "                    \"type\": crawl_rules_obj[\"rule\"],\n",
+    "                    \"pattern\": crawl_rules_obj[\"pattern\"],\n",
+    "                }\n",
+    "            )\n",
+    "\n",
+    "        # populate a temporary hashmap\n",
+    "        temp_domain_conf = {\"url\": domain_url}\n",
+    "        if seed_urls_list:\n",
+    "            temp_domain_conf[\"seed_urls\"] = seed_urls_list\n",
+    "            print(f\"    Seed URls found: {seed_urls_list}\")\n",
+    "        if sitemap_urls_list:\n",
+    "            temp_domain_conf[\"sitemap_urls\"] = sitemap_urls_list\n",
+    "            print(f\"    Sitemap URLs found: {sitemap_urls_list}\")\n",
+    "        if crawl_rules_list:\n",
+    "            temp_domain_conf[\"crawl_rules\"] = crawl_rules_list\n",
+    "            print(f\"    Crawl rules found: {crawl_rules_list}\")\n",
+    "\n",
+    "        # populate the in-memory data structure\n",
+    "        inflight_configuration_data[engine_oid][\"domains_temp\"][\n",
+    "            domain_oid\n",
+    "        ] = temp_domain_conf\n",
+    "        print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cd711055-17a5-47c6-ba73-ff7e386a393b",
+   "metadata": {},
+   "source": [
+    "### Step 3: Creating the Open Crawler YAML configuration files\n",
+    "In this final step, we will create the actual YAML files you need to get up and running with Open Crawler!\n",
+    "\n",
+    "The next cell performs some final transformations to the in-memory data structure that is keeping track of your configurations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e3e080c5-a528-47fd-9b79-0718d17d2560",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Final transform of the in-memory data structure to a form we can dump to YAML\n",
+    "# for each crawler, collect all of its domain configurations into a list\n",
+    "for engine_oid, crawler_config in inflight_configuration_data.items():\n",
+    "    all_crawler_domains = []\n",
+    "\n",
+    "    for domain_config in crawler_config[\"domains_temp\"].values():\n",
+    "        all_crawler_domains.append(domain_config)\n",
+    "    # create a new key called \"domains\" that points to a list of domain configs only - no domain_oid values as keys\n",
+    "    crawler_config[\"domains\"] = all_crawler_domains\n",
+    "    # delete the temporary domain key\n",
+    "    del crawler_config[\"domains_temp\"]\n",
+    "    print(f\"Transform for {engine_oid} complete!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7681f3c5-2f88-4e06-ad6a-65bedb5634e0",
+   "metadata": {},
+   "source": [
+    "#### **Wait! Before we continue onto creating our YAML files, we're going to need your input on a few things.**\n",
+    "\n",
+    "In the next cell, please enter the following details about the _Elasticsearch instance you will be using with Open Crawler_. This instance can be Elastic Cloud Hosted, Serverless, or a local instance.\n",
+    "\n",
+    "- The Elasticsearch endpoint URL\n",
+    "- The port number of your Elasticsearch endpoint _(Optional, will default to 443 if left blank)_\n",
+    "- An API key"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e6cfab1b-cf2e-4a41-9b54-492f59f7f61a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ENDPOINT = getpass(\"Elasticsearch endpoint URL: \")\n",
+    "PORT = getpass(\"[OPTIONAL] Elasticsearch endpoint port number: \")\n",
+    "OUTPUT_API_KEY = getpass(\"Elasticsearch API key: \")\n",
+    "\n",
+    "# set the above values in each Crawler's configuration\n",
+    "for crawler_config in inflight_configuration_data.values():\n",
+    "    crawler_config[\"elasticsearch\"][\"host\"] = ENDPOINT\n",
+    "    crawler_config[\"elasticsearch\"][\"port\"] = int(PORT) if PORT else 443\n",
+    "    crawler_config[\"elasticsearch\"][\"api_key\"] = OUTPUT_API_KEY\n",
+    "\n",
+    "# ping ES to make sure we have positive connection\n",
+    "es_client = Elasticsearch(\n",
+    "    \":\".join([ENDPOINT, PORT]),\n",
+    "    api_key=OUTPUT_API_KEY,\n",
+    ")\n",
+    "\n",
+    "es_client.info()[\"tagline\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "14ed1150-a965-4739-9111-6dfc435ced4e",
+   "metadata": {},
+   "source": [
+    "#### **This is the final step! You have two options here:**\n",
+    "\n",
+    "- The \"Write to YAML\" cell will create _n_ number of YAML files, one for each Crawler you have.\n",
+    "- The \"Print to output\" cell will print each Crawler's configuration YAML in the Notebook, so you can copy-paste them into your Open Crawler YAML files manually.\n",
+    "\n",
+    "Feel free to run both! You can run Option 2 first to see the output before running Option 1 to save the configs into YAML files."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7a203432-d0a6-4929-a109-5fbf5d3f5e56",
+   "metadata": {},
+   "source": [
+    "#### Option 1: Write to YAML file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d6df1ff-1b47-45b8-a057-fc10cadf7dc5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Dump each Crawler's configuration into its own YAML file\n",
+    "for crawler_config in inflight_configuration_data.values():\n",
+    "    base_dir = os.getcwd()\n",
+    "    file_name = (\n",
+    "        f\"{crawler_config['output_index']}-config.yml\"  # autogen a custom filename\n",
+    "    )\n",
+    "    output_path = os.path.join(base_dir, file_name)\n",
+    "\n",
+    "    if os.path.exists(base_dir):\n",
+    "        with open(output_path, \"w\") as file:\n",
+    "            yaml.safe_dump(crawler_config, file, sort_keys=False)\n",
+    "            print(f\" Wrote {file_name} to {output_path}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "49cedb05-9d4e-4c09-87a8-e15fb326ce12",
+   "metadata": {},
+   "source": [
+    "#### Option 2: Print to output"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e99d6c91-914b-4b58-ad7d-3af2db8fff4b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for crawler_config in inflight_configuration_data.values():\n",
+    "    yaml_out = yaml.safe_dump(crawler_config, sort_keys=False)\n",
+    "\n",
+    "    print(f\"YAML config => {crawler_config['output_index']}-config.yml\\n--------\")\n",
+    "    print(yaml_out)\n",
+    "    print(\n",
+    "        \"--------------------------------------------------------------------------------\"\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "26b35570-f05d-4aa2-82fb-89ec37f84015",
+   "metadata": {},
+   "source": [
+    "### Next Steps\n",
+    "\n",
+    "Now that the YAML files have been generated, you can visit the Open Crawler GitHub repository to learn more about how to deploy Open Crawler: https://github.com/elastic/crawler#quickstart\n",
+    "\n",
+    "Additionally, you can learn more about Open Crawler via the following blog posts:\n",
+    "- [Open Crawler's promotion to beta release](https://www.elastic.co/search-labs/blog/elastic-open-crawler-beta-release)\n",
+    "- [How to use Open Crawler with Semantic Text](https://www.elastic.co/search-labs/blog/semantic-search-open-crawler) to easily crawl websites and make them semantically searchable\n",
+    "\n",
+    "If you find any problems with this Notebook, please feel free to create an issue in the elasticsearch-labs repository: https://github.com/elastic/elasticsearch-labs/issues"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}