# Ecce Hommo analysis

In this example we will use the Ecce Hommo book from Walter Kauffman. 
It is an scanned work so there is some uncertainty in the OCR that we should overcome with some rules of thumb and basic statistics.


In [None]:
%load_ext autoreload
%autoreload 2

import fitz
from obsidianizer.pdf_tools.annotations import extract_annotation
from obsidianizer.pdf_tools.pages import extract_page_annotations, get_blocks_summary, get_words_data_frame
from obsidianizer.pdf_tools.plots import get_rectangles_from_data_frame
from obsidianizer.pdf_tools.documents import get_book_filtered_blocks, extract_book_annotations

from obsidianizer.obsidian.vault import load_vault, save_vault
from IPython.display import display

import plotly.graph_objects as go

from obsidianizer.pdf_tools.page_plots import get_page_figure_widget, get_book_figure_widget
from obsidianizer.pdf_tools.ecce_homo import is_ecce_hommo_chapter, is_ecce_hommo_subsection
from obsidianizer.obsidian.pdf_tools import get_vault_df_from_pdf
from obsidianizer.machine_learning.outliers import get_outlier_series,get_and_join_outlier_series,modify_predictor
import plotly.express as px

from obsidianizer import EXAMPLE_ECCE_HOMMO_PDF_PATH,EXAMPLE_ECCE_HOMMO_VAULT_PATH

In [None]:
filepath = EXAMPLE_ECCE_HOMMO_PDF_PATH

In [None]:
doc = fitz.open(filepath) 

## Load the book and subselect the pages of the Ecce Homo part.

In [None]:
book = [doc[i] for i in range(224,334)]
book_subset = [book[i] for i in range(0,50)]

# 1. Initial exploratory analysis

### Plot a few pages

In [None]:
pages_to_display = range(5)
book_tabs = get_book_figure_widget([book[i] for i in pages_to_display], width = 600)

In [None]:
display(book_tabs)

### Show the annotations and blocks of a page

In [None]:
page_number = 0
block_sumary = get_blocks_summary(book[page_number])
block_sumary

In [None]:
book_tabs = get_page_figure_widget(book[page_number], width = 600)
book_tabs

# 3. Guessing chapters and subections

One of the main things we need to know is develop a logic that tells us when chapters and sections start. This is important to later organize the obsidian notes accordingly.

## 3.1 Initial rule-thumb round of prediction

In this round we start by a rule of thumb on how to identify the chapters and subsections based on individual blocks. Later we will use machine learning to optimize them.

### Get initial chapter blocks

In [None]:
chapter_blocks = get_book_filtered_blocks(book, is_ecce_hommo_chapter)
chapter_blocks

### Get initial subsection blocks

In [None]:
subsection_blocks = get_book_filtered_blocks(book, is_ecce_hommo_subsection)
subsection_blocks

## 3.2 Use machine learning to improve the indiviual block predictions


Once we have a fair enough set of valid points, we can filter out the outliers and rerun the search again with the new configuration.


### 3.2.3 Filter out outlier chapters

Select columns for chapter machine learning

In [None]:
columns_od_chapter = ["y1", "block_no", "height"] # Columns to be used for prediction

In [None]:
chapter_predictor, chapter_blocks_outlier_dataset = get_and_join_outlier_series(chapter_blocks[columns_od_chapter])

#### Plot outliers found

In [None]:
fig = px.scatter_matrix(chapter_blocks_outlier_dataset,color="outliers")
fig.show()

#### Re-process the blocks with the learned model

In [None]:
is_valid_chapter_block = modify_predictor(chapter_predictor, columns_od_chapter)
chapter_blocks_machine_learning = get_book_filtered_blocks(book, is_valid_chapter_block)
chapter_blocks_machine_learning

### 3.2.2 Filter out outlier subsections

In [None]:
columns_od_subsection = ["x0","x1", "height"] # Columns to be used for prediction

In [None]:
subsection_predictor, subsection_blocks_outlier_dataset = get_and_join_outlier_series(subsection_blocks[columns_od_subsection])

#### Plot outliers found

In [None]:
fig = px.scatter_matrix(subsection_blocks_outlier_dataset,color="outliers")
fig.show()

#### Re-process the blocks with the learned model

In [None]:
is_valid_subsection_block = modify_predictor(subsection_predictor, columns_od_subsection)
subsection_blocks_machine_learning = get_book_filtered_blocks(book, is_valid_subsection_block)
subsection_blocks_machine_learning

# Create vault with the quotes

Assuming that the chapters and subsections have been successfully generated (possibly with some human help in the end) now we can generate the structure of files.

## 1. Get the (chapter_blocks, subsections_blocks, annotations_blocks)

In [None]:
chapter_blocks = chapter_blocks
subsections_blocks = subsection_blocks_machine_learning

In [None]:
annotations_blocks = extract_book_annotations(book)

In [None]:
annotations_blocks

## 2. Create folder structure. 

What it does is:
- It computes to which chapter and subsection each annotation belogns to.
- For each (chapter, subsection):
    - Create corresponding path "chapter/subsection/" path
    - Add quote to the file "chapter/subsection/subsection.md"
   

In [None]:
vault_path = EXAMPLE_ECCE_HOMMO_VAULT_PATH

In [None]:
ecce_hommo_vault = get_vault_df_from_pdf(chapter_blocks, subsection_blocks, annotations_blocks, vault_path)

In [None]:
ecce_hommo_vault

In [None]:
save_vault(ecce_hommo_vault)

## 3. Load folder structure

In [None]:
vault_files = load_vault(vault_path)

In [None]:
vault_files