# Microsoft PowerPoint

## Overview

[Microsoft PowerPoint](https://en.wikipedia.org/wiki/Microsoft_PowerPoint) is a presentation program developed by Microsoft.

This tutorial demonstrates two different approaches to process PowerPoint documents for downstream use:
1. Using ```Unstructured``` to load and parse PowerPoint files into document elements
2. Using ```MarkItDown``` to convert PowerPoint files into markdown format and LangChain Document objects

Both methods enable effective text extraction and processing, with different strengths for various use cases.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Converting PPTX to Langchain Documents Using Unstructured](#converting-pptx-to-langchain-documents-using-unstructured)
- [Converting PPTX to Langchain Documents Using MarkItDown](#converting-pptx-to-langchain-documents-using-markitdown)

### References

- [Unstructured: official documentation](https://docs.unstructured.io/open-source/core-functionality/overview)
- [MarkItDown: GitHub Repository](https://github.com/microsoft/markitdown)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial 

In [1]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain-community",
        "langchain-core",
        "unstructured",
        "markitdown"
    ],
    verbose=False,
    upgrade=False,
)

## Converting PPTX to Langchain Documents Using Unstructured

```Unstructured``` is a robust document processing library that excels at converting various document formats into clean, structured text. <br/>It is well integrated with LangChain's ecosystem and provides reliable document parsing capabilities. 

The library includes:

- Local processing with open-source package
- Remote processing via Unstructured API
- Comprehensive document format support
- Built-in OCR capabilities

In [2]:
from langchain_community.document_loaders import UnstructuredPowerPointLoader

# Initialize UnstructuredPowerPointLoader
loader = UnstructuredPowerPointLoader("data/07-ppt-loader-sample.pptx")

# Load PowerPoint document
docs = loader.load()

# Print number of loaded documents
print(len(docs))

1


In [3]:
print(docs[0].page_content[:100])

Natural Language Processing with Deep Learning

CS224N/Ling284

Christopher Manning

Lecture 2: Word


```Unstructured``` generates various "elements" for different **chunks** of text.

By default, they are combined and returned as a single document, but elements can be easily separated by specifying ```mode="elements"```.

In [4]:
# Create UnstructuredPowerPointLoader with elements mode
loader = UnstructuredPowerPointLoader("data/07-ppt-loader-sample.pptx", mode="elements")

# Load PowerPoint elements
docs = loader.load()

# Print number of elements extracted
print(len(docs))

498


In [5]:
docs[0]

Document(metadata={'source': 'data/07-ppt-loader-sample.pptx', 'category_depth': 0, 'file_directory': 'data', 'filename': '07-ppt-loader-sample.pptx', 'last_modified': '2025-01-16T21:42:19', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'category': 'Title', 'element_id': 'bb6cdc142e5062b564541bfbc10f7f8c'}, page_content='Natural Language Processing with Deep Learning')

In [6]:
# Get and display the first element
first_element = docs[0]
print(first_element)

# To see its metadata and content separately, you could do:
print("Content:", first_element.page_content)
print("Metadata:", first_element.metadata)

page_content='Natural Language Processing with Deep Learning' metadata={'source': 'data/07-ppt-loader-sample.pptx', 'category_depth': 0, 'file_directory': 'data', 'filename': '07-ppt-loader-sample.pptx', 'last_modified': '2025-01-16T21:42:19', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'category': 'Title', 'element_id': 'bb6cdc142e5062b564541bfbc10f7f8c'}
Content: Natural Language Processing with Deep Learning
Metadata: {'source': 'data/07-ppt-loader-sample.pptx', 'category_depth': 0, 'file_directory': 'data', 'filename': '07-ppt-loader-sample.pptx', 'last_modified': '2025-01-16T21:42:19', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'category': 'Title', 'element_id': 'bb6cdc142e5062b564541bfbc10f7f8c'}


In [7]:
# Print elements with formatted output and enumerate for easy reference
for idx, doc in enumerate(docs[:3], 1):
    print(f"\nElement {idx}/{len(docs)}")
    print(f"Category: {doc.metadata['category']}")
    print("="*50)
    print(f"Content:\n{doc.page_content.strip()}")
    print("="*50)


Element 1/498
Category: Title
Content:
Natural Language Processing with Deep Learning

Element 2/498
Category: Title
Content:
CS224N/Ling284

Element 3/498
Category: Title
Content:
Christopher Manning


## 使用 MarkItDown 將 PPTX 轉換為 Langchain 文檔

[```MarkItDown```](https://github.com/microsoft/markitdown "訪問 GitHub 頁面") 是 Microsoft 的開源函式庫，可將非結構化文檔轉換為結構化 Markdown，這是 LLMs 能夠輕鬆處理和理解的格式。這使它對 RAG（檢索增強生成）系統特別有價值，能夠實現乾淨的語義文本表示。

```MarkItDown``` 支援 PDF、PowerPoint、Word、Excel、圖像（含 EXIF/OCR）、音頻（含轉錄）、HTML 等格式，保留語義結構並精確處理複雜數據（如表格）。這確保了高檢索品質，並增強了 LLMs 從多種內容類型中提取洞察的能力。

> ⚠️**注意**：MarkItDown 不會解釋 PowerPoint 文件中嵌入圖像的內容。相反，它會按原樣提取圖像，使 LLMs 無法獲取其語義含義。

例如，投影片中的物件會像這樣處理：

```![object #](object#.jpg)```

安裝很簡單：

**詳細說明：**

**MarkItDown 的核心優勢：**

1. **語義保留**
   - 維持文檔的邏輯結構和層次關係
   - 保留標題、段落、列表等格式化元素
   - 確保轉換後的內容對 LLM 友好

2. **多格式支援**
   - **Office 文檔**：PowerPoint、Word、Excel 的完整支援
   - **PDF 處理**：文字提取和結構化表示
   - **圖像處理**：OCR 功能和 EXIF 數據提取
   - **多媒體**：音頻轉錄和 HTML 內容解析

3. **RAG 系統優化**
   - **高品質檢索**：結構化的 Markdown 提高語義搜尋精度
   - **上下文保持**：維持文檔的原始邏輯流程
   - **分塊友好**：適合文檔分割和向量化處理

**PowerPoint 處理的特殊考量：**
- **文字內容**：完整提取投影片標題、內容和註釋
- **表格數據**：保持表格結構的 Markdown 格式
- **圖像限制**：圖像僅以檔案引用形式保存，不進行內容解析
- **佈局信息**：保留投影片的層次結構

**在 LangChain 生態系統中的應用：**
- 與 `Document` 物件無縫整合
- 支援分塊策略和元數據保存
- 便於建立向量資料庫和知識檢索系統

這種轉換能力對於企業級 RAG 應用特別重要，能夠處理組織內部的各種文檔格式並建立統一的知識庫。

In [8]:
# %pip install markitdown

### 使用 MarkItDown 從 PPTX 提取文字
在本節中，我們將使用 ```MarkItDown``` 來：
* 將 PowerPoint 投影片轉換為 markdown 格式
* 保留語義結構和視覺格式
* 維持投影片編號和標題
* 生成乾淨、易讀的文字輸出

首先，我們需要初始化 ```MarkItDown``` 並運行 ```convert``` 函數以從本地載入 ```.pptx``` 檔案。

**流程說明：**

**初始化步驟：**
1. **導入 MarkItDown**：從函式庫中導入必要的類別
2. **創建轉換器實例**：初始化 MarkItDown 物件
3. **載入 PPTX 檔案**：指定本地 PowerPoint 檔案路徑

**轉換過程：**
- **投影片解析**：逐一分析每張投影片的內容
- **結構識別**：識別標題、內容、列表、表格等元素
- **格式保留**：將原始格式轉換為對應的 Markdown 語法
- **編號維護**：保持投影片的順序和編號信息

**輸出特點：**
- **層次結構**：使用 Markdown 標題層級反映投影片結構
- **可讀性**：生成適合人類閱讀和 LLM 處理的格式
- **語義完整性**：保留原始內容的邏輯關係和重要信息

**實際應用場景：**
- **知識管理**：將培訓投影片轉換為可搜尋的文檔
- **內容分析**：為 RAG 系統準備結構化輸入
- **文檔整合**：統一不同格式的企業內容

In [9]:
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data/07-ppt-loader-sample.pptx")
result_text = result.text_content
print(result_text[:500])

<!-- Slide number: 1 -->

![object 2](object2.jpg)
# Natural Language Processing with Deep Learning
CS224N/Ling284
Christopher Manning
Lecture 2: Word Vectors, Word Senses, and Neural Classifiers

<!-- Slide number: 2 -->
# Lecture Plan
10
Lecture 2: Word Vectors, Word Senses, and Neural Network Classifiers
Course organization (3 mins)
Optimization basics (5 mins)
Review of word2vec and looking at word vectors (12 mins)
More on word2vec (8 mins)
Can we capture the essence of word meaning more ef


### Convert markdown format to Langchain Document format

The code below processes PowerPoint slides by splitting them into individual Document objects. <br/>Each slide is converted into a Langchain Document object with metadata including the ```slide number``` and ```title```. 

In [10]:
from langchain_core.documents import Document
import re

# Initialize document processing for PowerPoint slides
# Format: <!-- Slide number: X --> where X is the slide number

# Split the input text into individual slides using HTML comment markers
slides = re.split(r'<!--\s*Slide number:\s*(\d+)\s*-->', result_text)

# Initialize list to store Document objects
documents = []

# Process each slide:
# - Start from index 1 since slides[0] is empty from the initial split
# - Step by 2 because the split result alternates between:
#   1. slide number (odd indices)
#   2. slide content (even indices)
# Example: ['', '1', 'content1', '2', 'content2', '3', 'content3']
for i in range(1, len(slides), 2):
    # Extract slide number and content
    slide_number = slides[i]
    content = slides[i + 1].strip() if i + 1 < len(slides) else ""
    
    # Extract slide title from first markdown header if present
    title_match = re.search(r'#\s*(.+?)(?=\n|$)', content)
    title = title_match.group(1).strip() if title_match else ""
    
    # Create Document object with slide metadata
    doc = Document(
        page_content=content,
        metadata={
            "source": "data/07-ppt-loader-sample.pptx",
            "slide_number": int(slide_number),
            "slide_title": title
        }
    )
    documents.append(doc)

documents[:2]

[Document(metadata={'source': 'data/07-ppt-loader-sample.pptx', 'slide_number': 1, 'slide_title': 'Natural Language Processing with Deep Learning'}, page_content='![object 2](object2.jpg)\n# Natural Language Processing with Deep Learning\nCS224N/Ling284\nChristopher Manning\nLecture 2: Word Vectors, Word Senses, and Neural Classifiers'),
 Document(metadata={'source': 'data/07-ppt-loader-sample.pptx', 'slide_number': 2, 'slide_title': 'Lecture Plan'}, page_content='# Lecture Plan\n10\nLecture 2: Word Vectors, Word Senses, and Neural Network Classifiers\nCourse organization (3 mins)\nOptimization basics (5 mins)\nReview of word2vec and looking at word vectors (12 mins)\nMore on word2vec (8 mins)\nCan we capture the essence of word meaning more effectively by counting? (12m)\nEvaluating word vectors (10 mins)\nWord senses (10 mins)\nReview of classification and how neural nets differ (10 mins)\nIntroducing neural networks (10 mins)\n\nKey Goal: To be able to read and understand word embed

```MarkItDown``` efficiently handles tables in PowerPoint slides by converting them into clean Markdown table syntax. 

This makes tabular data easily accessible for LLMs while preserving the original structure and formatting.

In [11]:
print(documents[15].page_content)

# Example: Window based co-occurrence matrix
10
Window length 1 (more common: 5–10)
Symmetric (irrelevant whether left or right context)

Example corpus:
I like deep learning
I like NLP
I enjoy flying

| counts | I | like | enjoy | deep | learning | NLP | flying | . |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| I | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 |
| like | 2 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| enjoy | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| deep | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| learning | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| NLP | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| flying | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| . | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
