# Processing HTML Files

We will be using **html2parquet transform**

References
- [html2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/html2parquet/python)

## Step-1: Data

We will process data that is downloaded using [1_crawl_site.ipynb](1_crawl_site.ipynb).

We have a couple of crawled HTML files in  `input` directory. 

## Step-2: Configuration

In [1]:
## All config is defined here
from my_config import MY_CONFIG

In [2]:
import os, sys
import shutil

PQ_DIR = os.path.join(MY_CONFIG.WORKSPACE_DIR, "parquet")
shutil.rmtree(PQ_DIR, ignore_errors=True)
shutil.os.makedirs(PQ_DIR, exist_ok=True)
print (f"✅ Cleared  intermediate parquet directory :  {PQ_DIR}")


shutil.rmtree(MY_CONFIG.PROCESSED_DATA_DIR, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.PROCESSED_DATA_DIR, exist_ok=True)
print (f"✅ Cleared  processed data directory :  {MY_CONFIG.PROCESSED_DATA_DIR}")

✅ Cleared  intermediate parquet directory :  workspace/parquet
✅ Cleared  processed data directory :  workspace/processed


## Step-3: HTML2Parquet

Process HTML documents and extract the text in markdown format

In [3]:
from dpk_html2parquet.transform_python import Html2Parquet


x=Html2Parquet(input_folder= MY_CONFIG.CRAWL_DIR, 
               output_folder= PQ_DIR, 
               data_files_to_use=['.html'],
               html2parquet_output_format= "markdown"
               ).transform()

23:26:24 INFO - html2parquet parameters are : {'output_format': <html2parquet_output_format.MARKDOWN: 'markdown'>, 'favor_precision': <html2parquet_favor_precision.TRUE: 'True'>, 'favor_recall': <html2parquet_favor_recall.TRUE: 'True'>}
23:26:24 INFO - pipeline id pipeline_id
23:26:24 INFO - code location None
23:26:24 INFO - data factory data_ is using local data access: input_folder - workspace/crawled output_folder - workspace/parquet
23:26:24 INFO - data factory data_ max_files -1, n_sample -1
23:26:24 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.html'], files to checkpoint ['.parquet']
23:26:24 INFO - orchestrator html2parquet started at 2025-05-12 23:26:24
23:26:24 INFO - Number of files is 96, source profile {'max_file_size': 0.3377838134765625, 'min_file_size': 0.10031604766845703, 'total_file_size': 12.527679443359375}
23:26:25 INFO - Completed 1 files (1.04%) in 0.003 min
23:26:25 INFO - Completed 2 files

## Step-4: Inspect the Output


In [4]:
from file_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(PQ_DIR)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (96, 6)


Unnamed: 0,title,document,contents,document_id,size,date_acquired
0,thealliance_ai_working-groups-hardware-enablem...,thealliance_ai_working-groups-hardware-enablem...,[Hardware Enablement Focus Area](/focus-areas/...,698eddd25c4e6e9f172a19ebec695247c0a72e6ec88c66...,1553,2025-05-12T23:26:26.582991
1,thealliance_ai_blog-open-source-ai-demo-night-...,thealliance_ai_blog-open-source-ai-demo-night-...,"On August 8th, The AI Alliance, in collaborati...",7802bb7e50653e6b21f571b28843fd9a4bcf5023eaab3a...,3151,2025-05-12T23:26:25.864172
2,thealliance_ai_working-groups-applications-and...,thealliance_ai_working-groups-applications-and...,[Applications and Tools Focus Area](/focus-are...,1aaa9d752f74d7abd233abbd8688884c99ea64f575162b...,1565,2025-05-12T23:26:26.541915
3,thealliance_ai_blog-open-innovation-day-tokyo_...,thealliance_ai_blog-open-innovation-day-tokyo_...,"Open innovation in AI software, algorithms, da...",2f82c2d26c751fcb2528eb7c9273ebf3fac4d21b842787...,1304,2025-05-12T23:26:25.854249
4,thealliance_ai_blog-ai-alliance-skills-and-edu...,thealliance_ai_blog-ai-alliance-skills-and-edu...,"By Rebekkah Hogan (Meta), Sowmya Kannan (IBM),...",a8c21ef29afc54923a30393be621693674a1ad23965998...,3615,2025-05-12T23:26:25.588102


In [5]:
output_df.iloc[0,]['title']

'thealliance_ai_working-groups-hardware-enablement_text.html'

In [6]:
output_df.iloc[0,]['document']

'thealliance_ai_working-groups-hardware-enablement_text.html'

In [7]:
## Display markdown text
print ('content length:', len(output_df.iloc[0,]['contents']), '\n')
print (output_df.iloc[0,]['contents'][:200])


content length: 1553 

[Hardware Enablement Focus Area](/focus-areas/hardware-enablement)

# Hardware Enablement Working Group

## Co-leads

- Adam Pingel (IBM)
- Amit Sangani (Meta)

## Frequently Asked Questions (FAQ)

**


In [8]:
## display markdown in pretty format
# from IPython.display import Markdown
# display(Markdown(output_df.iloc[0,]['contents']))


## Step-5: Save the markdown

In [9]:
import os

for index, row in output_df.iterrows():
    html_file = row['document']
    base_name = os.path.splitext(os.path.basename(html_file))[0]
    md_output_file = os.path.join(MY_CONFIG.PROCESSED_DATA_DIR, base_name +  '.md')
    
    with open(md_output_file, 'w') as md_output_file_handle:
        md_output_file_handle.write (row['contents'])
# -- end loop ---       

print (f"✅ Saved {index+1} md files into '{MY_CONFIG.PROCESSED_DATA_DIR}'")

✅ Saved 96 md files into 'workspace/processed'
