# Processing HTML Files

We will be using **html2parquet transform**

References
- [html2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/html2parquet/python)

## Step-1: Data

We will process data that is downloaded using [1_crawl_site.ipynb](1_crawl_site.ipynb).

We have a couple of crawled HTML files in  `input` directory. 

## Step-2: Configuration

In [1]:
## All config is defined here
from my_config import MY_CONFIG

In [2]:
import os, sys
import shutil

shutil.rmtree(MY_CONFIG.OUTPUT_DIR, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR, exist_ok=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR_HTML, exist_ok=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR_MARKDOWN, exist_ok=True)

print ("✅ Cleared  output directory")

✅ Cleared  output directory


## Step-3: HTML2Parquet

Process HTML documents and extract the text in markdown format

In [3]:
from dpk_html2parquet.transform_python import Html2Parquet

x=Html2Parquet(input_folder= MY_CONFIG.INPUT_DIR, 
               output_folder= MY_CONFIG.OUTPUT_DIR_HTML, 
               data_files_to_use=['.html'],
               html2parquet_output_format= "markdown"
               ).transform()

21:46:03 INFO - html2parquet parameters are : {'output_format': <html2parquet_output_format.MARKDOWN: 'markdown'>, 'favor_precision': <html2parquet_favor_precision.TRUE: 'True'>, 'favor_recall': <html2parquet_favor_recall.TRUE: 'True'>}
21:46:03 INFO - pipeline id pipeline_id
21:46:03 INFO - code location None
21:46:03 INFO - data factory data_ is using local data access: input_folder - input output_folder - output/1-html2parquet
21:46:03 INFO - data factory data_ max_files -1, n_sample -1
21:46:03 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.html'], files to checkpoint ['.parquet']
21:46:03 INFO - orchestrator html2parquet started at 2025-01-16 21:46:03
21:46:03 INFO - Number of files is 20, source profile {'max_file_size': 0.2696037292480469, 'min_file_size': 0.10027694702148438, 'total_file_size': 2.5929641723632812}
21:46:03 INFO - Completed 1 files (5.0%) in 0.003 min
21:46:03 INFO - Completed 2 files (10.0%) 

## Step-4: Inspect the Output


In [4]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(MY_CONFIG.OUTPUT_DIR_HTML)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (20, 6)


Unnamed: 0,title,document,contents,document_id,size,date_acquired
0,thealliance_ai_focus-areas-foundation-models-d...,thealliance_ai_focus-areas-foundation-models-d...,# Open Foundation Models and Datasets\n\n### E...,a3b82b0f98d1458175b52e597115f3b329b5d7ba79e224...,5076,2025-01-16T21:46:04.163443
1,thealliance_ai_focus-areas-skills-education_te...,thealliance_ai_focus-areas-skills-education_te...,# Skills & Education\n\n### Supporting global ...,d98ef830df5e293bb7903e021b60194e8b4e529ef4824b...,334,2025-01-16T21:46:04.189883
2,thealliance_ai_core-projects-trust-and-safety-...,thealliance_ai_core-projects-trust-and-safety-...,"Much like other software, generative AI (“GenA...",49b5d414088c84ed87d57b0f58ad01756f86a108eff666...,770,2025-01-16T21:46:04.105742
3,thealliance_ai_governance_text.html,thealliance_ai_governance_text.html,# AI Alliance Program Governance\n\n### The fo...,ca287a7c5a81e4f8c10d5c9cd082b4774fe2dfa4c1093a...,14877,2025-01-16T21:46:04.213902
4,thealliance_ai_core-projects-the-living-guide-...,thealliance_ai_core-projects-the-living-guide-...,# The Living Guide to Applying AI\n\nProjectAp...,2baeeebebd08f790702e5f9eddd659f601e8187f27aad1...,859,2025-01-16T21:46:04.094472


In [5]:
output_df.iloc[0,]['title']

'thealliance_ai_focus-areas-foundation-models-datasets_text.html'

In [6]:
output_df.iloc[0,]['document']

'thealliance_ai_focus-areas-foundation-models-datasets_text.html'

In [7]:
## Display markdown text
print ('content length:', len(output_df.iloc[0,]['contents']), '\n')
print (output_df.iloc[0,]['contents'])


content length: 5076 

# Open Foundation Models and Datasets

### Enabling an ecosystem of open foundation models, including those with multilingual and multi-modal capabilities, and open datasets.

We are responsibly enhancing the ecosystem of open foundation models and datasets. We are embracing multilingual and multimodal models, as well as science models tackling broad societal issues like climate change and education.

To aid AI model builders and application developers, we’re collaborating to develop and promote open-source tools for model training, tuning, and inference. We are also launching programs to foster the open development of AI in safe and beneficial ways, and hosting events to explore AI use cases.

Without good datasets, model training and tuning would be impossible. We are promoting the development of open datasets with clear governance and provenance controls so they can be used without concerns for legal and other risks.

## Current or recent projects

![](https:/

In [8]:
## display markdown in pretty format
# from IPython.display import Markdown
# display(Markdown(output_df.iloc[0,]['contents']))


## Step-5: Save the markdown

In [9]:
import os

for index, row in output_df.iterrows():
    html_file = row['document']
    base_name = os.path.splitext(os.path.basename(html_file))[0]
    md_output_file = os.path.join(MY_CONFIG.OUTPUT_DIR_MARKDOWN, base_name +  '.md')
    
    with open(md_output_file, 'w') as md_output_file_handle:
        md_output_file_handle.write (row['contents'])
# -- end loop ---       

print (f"✅ Saved {index+1} md files into '{MY_CONFIG.OUTPUT_DIR_MARKDOWN}'")

✅ Saved 20 md files into 'output/2-markdown'
