Skip to content

j-szulc/html-cleaning-dataset-merged

Repository files navigation

Merged datasets from various HTML cleaning competitions

Note that I do not own any rights to these datasets, I just merged already available, albeit some hard to find, datasets and mirrored them for ease of accessibility. All rights belong to their respective owners.

For all licensing information, please refer to the original sources.

This repo contains:

which have been all transformed into the same folder structure.

To get a version stripped of any HTML tags and strip.py script to do that, which is my only contribution. To run it, just do:

pip3 install bs4
python3 strip.py dataset-name/GoldStandard dataset-name/stripped

To get all stripped datasets run:

pip3 install bs4
find . -maxdepth 1 -type d ! -name ".*" -exec bash -c 'echo Processing "$0" ; python3 strip.py "$0"/GoldStandard "$0"/stripped' {} \;

To merge them into one dataset, just do:

#!/bin/bash

rm -r merged || true
# Create a new directory to store the merged directories
mkdir -p merged/stripped
mkdir -p merged/GoldStandard
mkdir -p merged/input

# Loop through each folder in the current directory
for folder in */; do
  # Skip the "merged" folder
  if [ "$folder" == "merged/" ]; then
    continue
  fi
  # Check if the GoldStandard directory exists inside the current folder
  if [ -d "$folder/GoldStandard" ]; then
    # Copy the contents of the GoldStandard directory to the merged GoldStandard directory
    cp -R "$folder/GoldStandard/." merged/GoldStandard/
  fi
  if [ -d "$folder/stripped" ]; then
    cp -R "$folder/stripped/." merged/stripped/
  fi
  if [ -d "$folder/input" ]; then
    cp -R "$folder/input/." merged/input/
  fi
done

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages