This module provides a set of utility functions to clean text data, primarily intended for processing license text. The cleaning process involves three main steps:

1.  **HTML Tag Removal:** The `remove_html_tags` function uses `BeautifulSoup` to parse HTML and extract the text content, effectively removing all HTML markup.
2.  **Special Character Cleaning:** The `clean_special_chars` function uses regular expressions to remove any characters that are not alphanumeric, whitespace, or basic punctuation marks. This helps to standardize the text and remove potentially problematic characters.
3.  **Whitespace Normalization:** The `normalize_whitespace` function ensures that there are no multiple spaces within the text and removes any leading or trailing whitespace. This ensures consistent spacing and formatting.

The `preprocess_text` function combines these three steps into a single function, providing a convenient way to perform all cleaning operations at once. It returns an empty string if the input is None.

In [1]:
import json
import os
from bs4 import BeautifulSoup
import re


def remove_html_tags(text):
    """
    Removes HTML tags from the given text.

    Args:
        text (str): The input text containing HTML tags.

    Returns:
        str: The text with HTML tags removed. Returns empty string if input is None.
    """
    if text is None:
        return ""
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text(separator=" ")

def clean_special_chars(text):
    """
    Removes or replaces special characters from the given text.

    This function removes characters that are not alphanumeric, whitespace, or basic punctuation.

    Args:
        text (str): The input text containing special characters.

    Returns:
        str: The text with special characters removed. Returns empty string if input is None.
    """
    if text is None:
        return ""
    # Remove characters that are not alphanumeric, whitespace, or basic punctuation
    text = re.sub(r"http\S+", "", text)  # Removes URLs starting with "http" or "https"

    # Remove characters that are not alphanumeric, whitespace, or basic punctuation
    cleaned_text = re.sub(r"[^a-zA-Z0-9\s.,!?;:'\"-]", "", text)
    
    return cleaned_text

def normalize_whitespace(text):
    """
    Normalizes whitespace in the given text.

    This function replaces multiple spaces with single spaces and removes leading/trailing whitespace.

    Args:
        text (str): The input text with potentially excessive whitespace.

    Returns:
        str: The text with normalized whitespace. Returns empty string if input is None.
    """
    if text is None:
        return ""
    cleaned_text = " ".join(text.split())
    return cleaned_text.strip()

def preprocess_text(text):
    """
    Combines all cleaning functions to process license text.

    This function sequentially applies the following cleaning operations:
    1. Removes HTML tags.
    2. Removes special characters.
    3. Normalizes whitespace.

    Args:
        text (str): The input license text.

    Returns:
        str: The cleaned license text. Returns empty string if input is None.
    """
    if text is None:
        return ""
    text = remove_html_tags(text)
    text = clean_special_chars(text)
    text = normalize_whitespace(text)
    text = text.lower()
    return text

This script preprocesses license text files from a specified input directory and saves the cleaned text to a designated output directory.

The preprocessing steps involve:
  1. Reading license text from .txt files in the input directory.
  2. Calling the `preprocess_text` function to clean the text.
  3. Saving the preprocessed text to a new file with the same name but appended with "_preprocessed.txt" in the output directory.

The script creates the output directory if it doesn't exist to ensure proper file saving.

In [None]:
# Input directory containing the TXT files
input_dir = "../data/text"  # Update with the correct path to your .txt files

# Output directory to save the preprocessed licenses
output_dir = "../preprocessing/preprocessed_licenses_txt"  # Update if needed
os.makedirs(output_dir, exist_ok=True)

# Loop through all TXT files in the input directory
for filename in os.listdir(input_dir):
    if filename.endswith(".txt"):
        filepath = os.path.join(input_dir, filename)

        # Read the license text
        with open(filepath, "r") as f:
            license_text = f.read()

        preprocessed_text = preprocess_text(license_text)

        # Save the preprocessed text
        output_filename = os.path.splitext(filename)[0] + "_preprocessed.txt"
        output_filepath = os.path.join(output_dir, output_filename)

        with open(output_filepath, "w") as outfile:
            outfile.write(preprocessed_text)

This is an additional script if we want to processes JSON files containing license information, extracts the license text, cleans it, and saves the cleaned text to individual text files.

It iterates through JSON files in a specified input directory, extracts the value associated with the "licenseText" key, preprocesses this text using a `preprocess_text` function (assumed to be defined elsewhere), and saves the result as a .txt file in an output directory. The output filename is derived from the input JSON filename.

The script also creates the output directory if it doesn't already exist.

In [None]:
# Input directory containing the JSON files
input_dir = "../data/json/details"  # Replace with the actual path

# Output directory to save the cleaned licenses
output_dir = "../preprocessing/cleaned_licenses"  # Replace with the actual path

os.makedirs(output_dir, exist_ok=True)
for filename in os.listdir(input_dir):
    if filename.endswith(".json"):
        filepath = os.path.join(input_dir, filename)

        # Load the JSON file
        with open(filepath, "r") as f:
            license_data = json.load(f)

        # Extract the license text
        license_text = license_data.get("licenseText", "")

        # Clean the license text
        cleaned_text = preprocess_text(license_text)

        # Save the cleaned license text to a new file
        output_filename = os.path.splitext(filename)[0] + ".txt"  # e.g., Apache-2.0.txt
        output_filepath = os.path.join(output_dir, output_filename)

        with open(output_filepath, "w") as outfile:
            outfile.write(cleaned_text)

In [13]:
input_dir = "../../data/processed/preprocessed_licenses_json_2"
count=0

for filename in os.listdir(input_dir):
    if filename.endswith(".json"):
        filepath = os.path.join(input_dir, filename)
        # Read the license text
        with open(filepath, "r") as f:
            license_data=json.load(f)
            if license_data['text']=="":
                count+=1
                print(license_data['name'])

            