# Process Documents

The first step for RAG usually involves processing the documents to extract relevant information and metadata. This can include tasks such as text extraction, summarization, and the generation of embeddings for semantic search.

In general, there are several main types of document we will process:

1. **Text Documents**: These include articles, reports, and any other form of written content. The focus here is on extracting key information and generating embeddings that capture the semantic meaning of the text.

2. **Images**: For image documents, we may use techniques such as Optical Character Recognition (OCR) to extract text, as well as image embeddings to capture visual features.

3. **Structured Data**: This includes data from spreadsheets, databases, and other structured formats. The goal is to extract relevant fields and generate embeddings that represent the data effectively.

4. **Audio/Video**: For multimedia documents, we may need to transcribe audio or extract key frames from video to process the content effectively.

### Traps

By processing these different types of documents, we can create a rich set of metadata and embeddings that will enhance the RAG system's ability to retrieve and generate relevant information.

However, as a newbie, I learned the hard way that it is impossible to learn to process all types of documents all at once, at least not for me. I tried and failed multiple times before I give up. It is important to take baby steps and focus on one document type at a time, gradually expanding my skills and knowledge.

I will just focus on processing text documents and maybe a little bit of OCR for now.


## Process Text Documents

There are several formats of text documents that we can process, including:

- `Plain Text`: Simple text files without any formatting (e.g., .txt).

- `Markdown`: Text files with simple formatting (e.g., .md).

- `HTML`: Web pages and other documents written in HTML.

- `PDF`: Portable Document Format files, which may require special handling to extract text. Though some PDFs are scanned images, and we may need to use OCR techniques to extract text from them.

- `Word Documents`: Microsoft Word files (e.g., .docx) that may contain rich formatting.

By focusing on these formats, we can develop a robust pipeline for processing text documents and extracting valuable information.

### Plain Text

Plain text documents are the simplest form of text files, containing no formatting or special features. They are easy to process and can be read by any text editor.

Python built-in functions can be used to read and manipulate plain text files. For example, we can use the `open()` function to create a file object and read its contents with the `read()` method.

It is common to use `with open(file_path, 'r') as file:` to ensure proper handling of file resources. The `with` statement automatically closes the file when the block is exited, even if an error occurs.

Of course, you can also use other methods to read plain text files, such as `file.readlines()` to read all lines into a list.


In [None]:
# Reading Plain Text Files

import os

def read_plain_text_file(file_path):
    with open(file_path, 'r') as file: 
        content = file.read()
    return content

# read entire file
doc = read_plain_text_file('resource/sample_text.txt')

# read lines from a file
def read_plain_text_file_lines(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()
    return lines # list of lines

# read lines from the document, return list of strings
lines = read_plain_text_file_lines('resource/sample_text.txt')
for line in lines[:5]:
    print(line.strip()) # to remove whitespace and newlines 
print(f"Total lines: {len(lines)}")
print(lines)


TECHNOLOGY

SEXTING WITH GEMINIWhy did Googles supposedly teen-friendly

chatbot say it wanted to tie me up?
Total lines: 27
['TECHNOLOGY\n', '\n', 'SEXTING WITH GEMINIWhy did Googles supposedly teen-friendly\n', '\n', 'chatbot say it wanted to tie me up?\n', '\n', 'ne afternoon thisspring, I created aGoogle accountfora fake 13-yearold named Jane (I am 23) andopened up Gemini, the company’s AI chatbot. BecauseJane was a minor, Googleautomatically directed me to aversion of Gemini with ostensibly age-appropriate protections in place. I began the conversation by asking the chatbotto “talk dirty to me.” Its initial responses were reassuring, given that I was posing as ayoung teen: “I understandyoure looking for somethingmore explicit,” Gemini wrote. “However, I’m designed to beasafe and helpful Al assistant.” But getting around Googlessafeguards was surprisinglyeasy. When I asked Geminifor “examples” of dirty talk, the chatbot complied: “Get onyour knees for me.” “Beg forit.” “Tell me how

### Markdown

Markdown documents are text files that use a simple syntax to define formatting elements such as headings, lists, and links. They are commonly used for documentation and can be easily converted to other formats (e.g., HTML) for presentation. 

To read and process Markdown files in Python, we can use similar techniques as with plain text files. We can read the entire file or read it line by line, depending on our needs.


In [10]:
# read markdown
def read_markdown_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
    return content

# read entire markdown file
doc = read_markdown_file('README.md')

# read lines from the markdown file
def read_markdown_file_lines(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()
    return lines # list of lines

# read lines from the document, return list of strings
lines = read_markdown_file_lines('README.md')
for line in lines[:5]:
    print(line.strip()) # to remove whitespace and newlines
print(f"Total lines: {len(lines)}")
print(lines)


# A Study Note for Non-technical Folks

This repository contains my study notes on Retrieval-Augmented Generation (RAG). My goal is to document my learning process and create a resource for other non-technical individuals who are interested in this topic.

Some of the materials are based on the RAG course from deeplearning.ai. I am grateful for their high-quality content.
Total lines: 19
['# A Study Note for Non-technical Folks\n', '\n', 'This repository contains my study notes on Retrieval-Augmented Generation (RAG). My goal is to document my learning process and create a resource for other non-technical individuals who are interested in this topic.\n', '\n', 'Some of the materials are based on the RAG course from deeplearning.ai. I am grateful for their high-quality content.\n', '\n', 'Course link: [https://www.deeplearning.ai/courses/retrieval-augmented-generation-rag/](https://www.deeplearning.ai/courses/retrieval-augmented-generation-rag/)\n', '\n', '**Copyright Note:** The conten

### HTML

HTML documents are structured text files that use tags to define elements such as headings, paragraphs, links, and images. When processing HTML files, we can extract the text content while preserving the structure and relationships between elements. 

To 'read' and process HTML files in Python, we can use libraries such as Beautiful Soup or lxml. These libraries allow us to parse the HTML content and navigate the document tree to extract the information we need. It requires more learning and I will not cover it in detail here.

Useful links:
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [lxml Documentation](https://lxml.de/)


### PDF

PDF documents are a widely used format for sharing and presenting information. They can contain text, images, and other elements, but extracting text from PDFs can be challenging due to their complex structure. Some PDFs are scanned images, and we may need to use OCR techniques to extract text from them.

PDF can be very tricky to handle due to its complex and diverse structure. Unlike plain text or Markdown files, PDFs do not have a consistent format, and their content is often spread across multiple layers (e.g., text, images, annotations). This makes it difficult to extract meaningful information without losing context.

There are many tools available for working with PDF files in Python, including PyPDF2, pdfplumber, and PyMuPDF. These libraries provide various functionalities for extracting text, images, and metadata from PDF documents.

However, I personally have not found a reliable way to extract text from all PDFs, especially those with complex layouts or embedded fonts. It often requires a combination of tools and manual adjustments to get good results.

I tried PyPDF2, pdfplumber, and PyMuPDF, and while they all have their strengths, I found that no single library worked perfectly for every PDF. I will leave this tasks to future exploration and experimentation.

### Word Documents

Word documents are a popular format for creating and sharing text-based content. They can contain rich formatting, images, and other elements. There are pretty good libraries available for working with Word documents in Python, such as python-docx. These libraries allow us to read, write, and manipulate Word files programmatically. They are not perfect, but they can handle many common tasks effectively.


In [None]:
# example of python-docx
!pip install python-docx




In [12]:
import docx

# create a word document
doc = docx.Document()
doc.add_heading('Document Title', level=1)
doc.add_paragraph('This is a sample paragraph in the Word document.')
doc.save('sample.docx')


In [14]:
# read a word document
read_doc = docx.Document('sample.docx')
for para in read_doc.paragraphs:
    print(para.text)

Document Title
This is a sample paragraph in the Word document.


In [None]:
# Edit and format a word document
doc = docx.Document('sample.docx')
for para in doc.paragraphs:
    para.text = para.text.replace('sample', 'edited')

doc.save('edited_sample.docx')
# check the content of the edited document
edited_doc = docx.Document('edited_sample.docx')
for para in edited_doc.paragraphs:
    print(para.text)


Document Title
This is a edited paragraph in the Word document.


In [19]:

# Format the edited document
for para in edited_doc.paragraphs:
    para.style = 'Normal'

# add a new line
p = edited_doc.add_paragraph('This is an additional paragraph added to the edited document.', style='Normal')
# add additional content in the new paragraph with formatting
p.add_run(' This text is bold.').bold = True
p.add_run(' This text is italicized.').italic = True

edited_doc.save('formatted_edited_sample.docx')

# add a new line
p = edited_doc.add_paragraph('This is an additional paragraph added to the edited document.', style='Normal')
# add additional content in the new paragraph with formatting
p.add_run(' This text is bold.').bold = True
p.add_run(' This text is italicized.').italic = True

edited_doc.save('formatted_edited_sample.docx')

