## **0. Dataset Overview**

Data extracted from Vietnamese Traditional Medicine documents include:

1. **Medicinal herbs**: Detailed information about Oriental medicine

2. **Prescriptions**: Traditional medicine

3. **Formulas**: Dosage and combination of medicinal herbs

## **Processing Pipeline**

1. PDF Chunking: Split 1300 pages into chunks of 200 pages/chunk with 30 pages overlap
2. Gemini API: Upload chunks and extract in groups of 7 pages with 3 pages overlap
3. Deduplication: Fuzzy matching with similarity threshold 0.85
4. Export: JSON + CSV format

## **1. Import libraries**

In [1]:
import sys
import json
import warnings
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime
from pathlib import Path

warnings.filterwarnings('ignore')

# Setup plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
# %matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)  
pd.set_option('display.max_colwidth', 100)

# Add project root to path
root_path = str(Path().resolve().parent)
if root_path not in sys.path:
    sys.path.insert(0, root_path)

In [2]:
from src.settings import settings
from src.dataset import DataExtractor
from src.utils import safe_path

## **2. Run extraction**

In [None]:
extractor = DataExtractor(
    settings=settings,
    logger_name="HerbExtractor",
    logger_path=f"{settings.LOG_PATH}/extraction.log"
)

pdf_file = f"{settings.RAW_DATA_PATH}/sach-nhung-cay-thuoc-va-vi-thuoc-viet-nam.pdf"

start_time = datetime.now()
print(f"Started at: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Processing: {safe_path(pdf_file)}\n")

try:
    results = extractor.process_pdf_file(str(pdf_file))
    
    end_time = datetime.now()
    duration = (end_time - start_time).total_seconds()
    
    print(f"Extraction completed successfully!")
    print(f"Duration: {duration/60:.1f} minutes ({duration:.0f} seconds)")
    print(f"Results saved to: {settings.PROCESSED_DATA_PATH}")
    
except Exception as e:
    print(f"Error during extraction: {e}")
    raise

Started at: 2025-11-23 05:04:20
Processing: data/raw/sach-nhung-cay-thuoc-va-vi-thuoc-viet-nam.pdf

2025-11-23 05:04:20 - HerbExtractor - INFO - Processing: data/raw/sach-nhung-cay-thuoc-va-vi-thuoc-viet-nam.pdf
2025-11-23 05:04:20 - HerbExtractor - INFO - Model for OCR: gemini-2.0-flash-lite
2025-11-23 05:04:20 - HerbExtractor - INFO - Model for text extraction: gemini-2.5-flash-lite
2025-11-23 05:04:20 - HerbExtractor - INFO - Rate limit: 8 RPM
2025-11-23 05:04:21 - HerbExtractor - INFO - Created: chunk_000_pages_0001-0200.pdf (200 pages)
2025-11-23 05:04:21 - HerbExtractor - INFO - Created: chunk_001_pages_0191-0390.pdf (200 pages)
2025-11-23 05:04:22 - HerbExtractor - INFO - Created: chunk_002_pages_0381-0580.pdf (200 pages)
2025-11-23 05:04:22 - HerbExtractor - INFO - Created: chunk_003_pages_0571-0770.pdf (200 pages)
2025-11-23 05:04:23 - HerbExtractor - INFO - Created: chunk_004_pages_0761-0960.pdf (200 pages)
2025-11-23 05:04:23 - HerbExtractor - INFO - Created: chunk_005_pages

## **3. Load processed data**

In [None]:
print("Loading processed datasets...\n")

# Load all three datasets
df_vi_thuoc = pd.read_csv(settings.OUTPUT_CSV_VI_THUOC)
df_bai_thuoc = pd.read_csv(settings.OUTPUT_CSV_BAI_THUOC)
df_cong_thuc = pd.read_csv(settings.OUTPUT_CSV_CONG_THUC)

# Also load JSON for reference
with open(settings.OUTPUT_JSON, 'r', encoding='utf-8') as f:
    data_json = json.load(f)

print(f"Loaded {len(df_vi_thuoc):,} medicinal herbs")
print(f"Loaded {len(df_bai_thuoc):,} prescriptions")
print(f"Loaded {len(df_cong_thuc):,} formulas")

## **4. Dataset dimensions and info**

In [None]:
datasets = {
    'Vị thuốc (Herbs)': df_vi_thuoc,
    'Bài thuốc (Prescriptions)': df_bai_thuoc,
    'Công thức (Formulas)': df_cong_thuc
}

for name, df in datasets.items():
    print(f"{name}:")
    print(f"Rows: {len(df):,}")
    print(f"Columns: {len(df.columns)}")
    print(f"Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    print(f"Columns: {', '.join(df.columns.tolist())}")

In [None]:
def analyze_missing_data(df, name):
    print(f"\n{name}")
    
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    
    missing_df = pd.DataFrame({
        'Column': missing.index,
        'Missing Count': missing.values,
        'Missing %': missing_pct.values
    }).sort_values('Missing %', ascending=False)
    
    # Filter only columns with missing data
    missing_df = missing_df[missing_df['Missing Count'] > 0]
    
    if len(missing_df) > 0:
        print(missing_df.to_string(index=False))
    else:
        print("✓ No missing values detected!")
    
    return missing_df

missing_herbs = analyze_missing_data(df_vi_thuoc, "Vị thuốc")
missing_prescriptions = analyze_missing_data(df_bai_thuoc, "Bài thuốc")
missing_formulas = analyze_missing_data(df_cong_thuc, "Công thức")
