Skip to content

Commit

Permalink
FEATURE: Add OneNote document loader (#13841)
Browse files Browse the repository at this point in the history
- **Description:** Added OneNote document loader
  - **Issue:** #12125
  - **Dependencies:** msal

Co-authored-by: Bagatur <baskaryan@gmail.com>
  • Loading branch information
Dylan20XX and baskaryan committed Nov 27, 2023
1 parent ff7d4d9 commit 1983a39
Show file tree
Hide file tree
Showing 5 changed files with 506 additions and 5 deletions.
118 changes: 118 additions & 0 deletions docs/docs/integrations/document_loaders/onenote.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "6125a85e",
"metadata": {},
"source": [
"# Microsoft OneNote\n",
"\n",
"This notebook covers how to load documents from `OneNote`.\n",
"\n",
"## Prerequisites\n",
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
"2. When registration finishes, the Azure portal displays the app registration's Overview pane. You see the Application (client) ID. Also called the `client ID`, this value uniquely identifies your application in the Microsoft identity platform.\n",
"3. During the steps you will be following at **item 1**, you can set the redirect URI as `http://localhost:8000/callback`\n",
"4. During the steps you will be following at **item 1**, generate a new password (`client_secret`) under Application Secrets section.\n",
"5. Follow the instructions at this [document](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-configure-app-expose-web-apis#add-a-scope) to add the following `SCOPES` (`Notes.Read`) to your application.\n",
"6. You need to install the msal and bs4 packages using the commands `pip install msal` and `pip install beautifulsoup4`.\n",
"7. At the end of the steps you must have the following values: \n",
"- `CLIENT_ID`\n",
"- `CLIENT_SECRET`\n",
"\n",
"## 🧑 Instructions for ingesting your documents from OneNote\n",
"\n",
"### 🔑 Authentication\n",
"\n",
"By default, the `OneNoteLoader` expects that the values of `CLIENT_ID` and `CLIENT_SECRET` must be stored as environment variables named `MS_GRAPH_CLIENT_ID` and `MS_GRAPH_CLIENT_SECRET` respectively. You could pass those environment variables through a `.env` file at the root of your application or using the following command in your script.\n",
"\n",
"```python\n",
"os.environ['MS_GRAPH_CLIENT_ID'] = \"YOUR CLIENT ID\"\n",
"os.environ['MS_GRAPH_CLIENT_SECRET'] = \"YOUR CLIENT SECRET\"\n",
"```\n",
"\n",
"This loader uses an authentication called [*on behalf of a user*](https://learn.microsoft.com/en-us/graph/auth-v2-user?context=graph%2Fapi%2F1.0&view=graph-rest-1.0). It is a 2 step authentication with user consent. When you instantiate the loader, it will call will print a url that the user must visit to give consent to the app on the required permissions. The user must then visit this url and give consent to the application. Then the user must copy the resulting page url and paste it back on the console. The method will then return True if the login attempt was successful.\n",
"\n",
"\n",
"```python\n",
"from langchain.document_loaders.onenote import OneNoteLoader\n",
"\n",
"loader = OneNoteLoader(notebook_name=\"NOTEBOOK NAME\", section_name=\"SECTION NAME\", page_title=\"PAGE TITLE\")\n",
"```\n",
"\n",
"Once the authentication has been done, the loader will store a token (`onenote_graph_token.txt`) at `~/.credentials/` folder. This token could be used later to authenticate without the copy/paste steps explained earlier. To use this token for authentication, you need to change the `auth_with_token` parameter to True in the instantiation of the loader.\n",
"\n",
"```python\n",
"from langchain.document_loaders.onenote import OneNoteLoader\n",
"\n",
"loader = OneNoteLoader(notebook_name=\"NOTEBOOK NAME\", section_name=\"SECTION NAME\", page_title=\"PAGE TITLE\", auth_with_token=True)\n",
"```\n",
"\n",
"Alternatively, you can also pass the token directly to the loader. This is useful when you want to authenticate with a token that was generated by another application. For instance, you can use the [Microsoft Graph Explorer](https://developer.microsoft.com/en-us/graph/graph-explorer) to generate a token and then pass it to the loader.\n",
"\n",
"```python\n",
"from langchain.document_loaders.onenote import OneNoteLoader\n",
"\n",
"loader = OneNoteLoader(notebook_name=\"NOTEBOOK NAME\", section_name=\"SECTION NAME\", page_title=\"PAGE TITLE\", access_token=\"TOKEN\")\n",
"```\n",
"\n",
"### 🗂️ Documents loader\n",
"\n",
"#### 📑 Loading pages from a OneNote Notebook\n",
"\n",
"`OneNoteLoader` can load pages from OneNote notebooks stored in OneDrive. You can specify any combination of `notebook_name`, `section_name`, `page_title` to filter for pages under a specific notebook, under a specific section, or with a specific title respectively. For instance, you want to load all pages that are stored under a section called `Recipes` within any of your notebooks OneDrive.\n",
"\n",
"\n",
"```python\n",
"from langchain.document_loaders.onenote import OneNoteLoader\n",
"\n",
"loader = OneNoteLoader(section_name=\"Recipes\", auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n",
"\n",
"#### 📑 Loading pages from a list of Page IDs\n",
"\n",
"Another possibility is to provide a list of `object_ids` for each page you want to load. For that, you will need to query the [Microsoft Graph API](https://developer.microsoft.com/en-us/graph/graph-explorer) to find all the documents ID that you are interested in. This [link](https://learn.microsoft.com/en-us/graph/onenote-get-content#page-collection) provides a list of endpoints that will be helpful to retrieve the documents ID.\n",
"\n",
"For instance, to retrieve information about all pages that are stored in your notebooks, you need make a request to: `https://graph.microsoft.com/v1.0/me/onenote/pages`. Once you have the list of IDs that you are interested in, then you can instantiate the loader with the following parameters.\n",
"\n",
"\n",
"```python\n",
"from langchain.document_loaders.onenote import OneNoteLoader\n",
"\n",
"loader = OneNoteLoader(object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bb36fe41",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
216 changes: 216 additions & 0 deletions libs/langchain/langchain/document_loaders/onenote.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
"""Loads data from OneNote Notebooks"""
from pathlib import Path
from typing import Dict, Iterator, List, Optional

import requests

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
from langchain.pydantic_v1 import BaseModel, BaseSettings, Field, FilePath, SecretStr


class _OneNoteGraphSettings(BaseSettings):
client_id: str = Field(..., env="MS_GRAPH_CLIENT_ID")
client_secret: SecretStr = Field(..., env="MS_GRAPH_CLIENT_SECRET")

class Config:
"""Config for OneNoteGraphSettings."""

env_prefix = ""
case_sentive = False
env_file = ".env"


class OneNoteLoader(BaseLoader, BaseModel):
"""Load pages from OneNote notebooks."""

settings: _OneNoteGraphSettings = Field(default_factory=_OneNoteGraphSettings)
"""Settings for the Microsoft Graph API client."""
auth_with_token: bool = False
"""Whether to authenticate with a token or not. Defaults to False."""
access_token: str = ""
"""Personal access token"""
onenote_api_base_url: str = "https://graph.microsoft.com/v1.0/me/onenote"
"""URL of Microsoft Graph API for OneNote"""
authority_url = "https://login.microsoftonline.com/consumers/"
"""A URL that identifies a token authority"""
token_path: FilePath = Path.home() / ".credentials" / "onenote_graph_token.txt"
"""Path to the file where the access token is stored"""
notebook_name: Optional[str] = None
"""Filter on notebook name"""
section_name: Optional[str] = None
"""Filter on section name"""
page_title: Optional[str] = None
"""Filter on section name"""
object_ids: Optional[List[str]] = None
""" The IDs of the objects to load data from."""

def lazy_load(self) -> Iterator[Document]:
"""
Get pages from OneNote notebooks.
Returns:
A list of Documents with attributes:
- page_content
- metadata
- title
"""
self._auth()

try:
from bs4 import BeautifulSoup
except ImportError:
raise ImportError(
"beautifulsoup4 package not found, please install it with "
"`pip install bs4`"
)

if self.object_ids is not None:
for object_id in self.object_ids:
page_content_html = self._get_page_content(object_id)
soup = BeautifulSoup(page_content_html, "html.parser")
page_title = ""
title_tag = soup.title
if title_tag:
page_title = title_tag.get_text(strip=True)
page_content = soup.get_text(separator="\n", strip=True)
yield Document(
page_content=page_content, metadata={"title": page_title}
)
else:
request_url = self._url

while request_url != "":
response = requests.get(request_url, headers=self._headers, timeout=10)
response.raise_for_status()
pages = response.json()

for page in pages["value"]:
page_id = page["id"]
page_content_html = self._get_page_content(page_id)
soup = BeautifulSoup(page_content_html, "html.parser")
page_title = ""
title_tag = soup.title
if title_tag:
page_content = soup.get_text(separator="\n", strip=True)
yield Document(
page_content=page_content, metadata={"title": page_title}
)

if "@odata.nextLink" in pages:
request_url = pages["@odata.nextLink"]
else:
request_url = ""

def load(self) -> List[Document]:
"""
Get pages from OneNote notebooks.
Returns:
A list of Documents with attributes:
- page_content
- metadata
- title
"""
return list(self.lazy_load())

def _get_page_content(self, page_id: str) -> str:
"""Get page content from OneNote API"""
request_url = self.onenote_api_base_url + f"/pages/{page_id}/content"
response = requests.get(request_url, headers=self._headers, timeout=10)
response.raise_for_status()
return response.text

@property
def _headers(self) -> Dict[str, str]:
"""Return headers for requests to OneNote API"""
return {
"Authorization": f"Bearer {self.access_token}",
}

@property
def _scopes(self) -> List[str]:
"""Return required scopes."""
return ["Notes.Read"]

def _auth(self) -> None:
"""Authenticate with Microsoft Graph API"""
if self.access_token != "":
return

if self.auth_with_token:
with self.token_path.open("r") as token_file:
self.access_token = token_file.read()
else:
try:
from msal import ConfidentialClientApplication
except ImportError as e:
raise ImportError(
"MSAL package not found, please install it with `pip install msal`"
) from e

client_instance = ConfidentialClientApplication(
client_id=self.settings.client_id,
client_credential=self.settings.client_secret.get_secret_value(),
authority=self.authority_url,
)

authorization_request_url = client_instance.get_authorization_request_url(
self._scopes
)
print("Visit the following url to give consent:")
print(authorization_request_url)
authorization_url = input("Paste the authenticated url here:\n")

authorization_code = authorization_url.split("code=")[1].split("&")[0]
access_token_json = client_instance.acquire_token_by_authorization_code(
code=authorization_code, scopes=self._scopes
)
self.access_token = access_token_json["access_token"]

try:
if not self.token_path.parent.exists():
self.token_path.parent.mkdir(parents=True)
except Exception as e:
raise Exception(
f"Could not create the folder {self.token_path.parent} "
+ "to store the access token."
) from e

with self.token_path.open("w") as token_file:
token_file.write(self.access_token)

@property
def _url(self) -> str:
"""Create URL for getting page ids from the OneNoteApi API."""
query_params_list = []
filter_list = []
expand_list = []

query_params_list.append("$select=id")
if self.notebook_name is not None:
filter_list.append(
"parentNotebook/displayName%20eq%20"
+ f"'{self.notebook_name.replace(' ', '%20')}'"
)
expand_list.append("parentNotebook")
if self.section_name is not None:
filter_list.append(
"parentSection/displayName%20eq%20"
+ f"'{self.section_name.replace(' ', '%20')}'"
)
expand_list.append("parentSection")
if self.page_title is not None:
filter_list.append(
"title%20eq%20" + f"'{self.page_title.replace(' ', '%20')}'"
)

if len(expand_list) > 0:
query_params_list.append("$expand=" + ",".join(expand_list))
if len(filter_list) > 0:
query_params_list.append("$filter=" + "%20and%20".join(filter_list))

query_params = "&".join(query_params_list)
if query_params != "":
query_params = "?" + query_params
return f"{self.onenote_api_base_url}/pages{query_params}"
10 changes: 5 additions & 5 deletions libs/langchain/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions libs/langchain/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ azure-ai-textanalytics = {version = "^5.3.0", optional = true}
google-cloud-documentai = {version = "^2.20.1", optional = true}
fireworks-ai = {version = "^0.6.0", optional = true, python = ">=3.9,<4.0"}
javelin-sdk = {version = "^0.1.8", optional = true}
msal = {version = "^1.25.0", optional = true}


[tool.poetry.group.test.dependencies]
Expand Down Expand Up @@ -341,6 +342,7 @@ extended_testing = [
"atlassian-python-api",
"mwparserfromhell",
"mwxml",
"msal",
"pandas",
"telethon",
"psychicapi",
Expand Down

0 comments on commit 1983a39

Please sign in to comment.