# Basic pdf functionalities

This notebook contains a set of functionalities related to pdf processing.

In [None]:
%load_ext autoreload
%autoreload 2

import fitz
from obsidianizer.pdf_tools.annotations import extract_annotation
from obsidianizer.pdf_tools.pages import extract_page_annotations, get_blocks_summary, get_words_data_frame
from obsidianizer.pdf_tools.plots import get_rectangles_from_data_frame
from IPython.display import display

import plotly.graph_objects as go

from obsidianizer.pdf_tools.page_plots import get_page_figure_widget, get_book_figure_widget
from obsidianizer import EXAMPLE_ECCE_HOMMO_PDF_PATH

## Loading of the pdf document

In [None]:
doc = fitz.open(EXAMPLE_ECCE_HOMMO_PDF_PATH) 

## 2. Page functionalities

Set of functionalities related to a page. First we subselect a page index

In [None]:
page = doc[231]

### Get individual words in a dataframe

For each word we have also its rectangle coordinates, and the block, line and word number they belong to.

In [None]:
df_words = get_words_data_frame(page)
df_words

### Get summary statistics of the block codes

The statistics are:
- The  words it contain
- The rectangle that would contain the entire block x0, x1, y0, y1
- The number of lines it contains.
- The height and width of the block.

In [None]:
block_sumary = get_blocks_summary(page)
block_sumary

### Get annotations in a page

Get the annotations within a page and the rectangle that surounds them.
- highlighted_text: The original text in the pdf document that was highlighted.
- annotation_text: The associated text to the annotation.

In [None]:
annotations_df = extract_page_annotations(page)
annotations_df

### Plot Page figure

The following plots the blocks, words and annotations of the page.

In [None]:
fig = get_page_figure_widget(page, width = 600)

In [None]:
fig.show()

## 3. Document functions

Gathering of functions related to a document. 
A document is just a list of pages, but there is a lot to play with when having to guess across pages.

In [None]:
book = [doc[i] for i in range(230,235)]

### Plot the pages of the document

In [None]:
book_tabs = get_book_figure_widget(book, width = 600)

In [None]:
display(book_tabs)