<a href="https://colab.research.google.com/github/mhacc001/Python/blob/main/cleaning_code_errors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import re
import unicodedata

# The provided messy text content structure, representing 'text_sample.txt'
# This dictionary contains the content you want to clean.
text_sample_data = {
    "type": "uploaded file",
    "fileName": "text_sample.txt",
    "fullContent": """
[source: 1] Thís ís à próblemátic téxt fíle!! It contains    **extra spaces** ,,,,, special characters!!!💥🔥🚀
[source: 2] Some words are misspelleddd,   and encoding  issues   liké thís cäusé problëms.
Prices are inconsistent:  $29.99, 29.99 USD, 29,99$.

Emails & phone numbers may be embedded: contact@domain.com, (123)-456-7890.

[source: 3] Repeated punctuations!!!!!
should be removed, along with **random symbols** like @@,##.

stopwords like "the", "is", and "a" appear often.
[source: 4] HTML tags might be present: <div>This is inside a div</div>

And sometimes, contractions won't expand: "can't", "won't", "shouldn't".
[source: 5] Random numeric values: 123456, 98765, 2024.
"""
}

# Extract the full content from the provided dictionary.
# This variable now holds the raw, messy text from 'text_sample.txt'.
messy_text_from_sample = text_sample_data['fullContent']

def clean_text(text):
    """
    Cleans the input text by applying a series of text processing techniques.

    Args:
        text (str): The messy input text string.

    Returns:
        str: The cleaned and standardized text string.
    """

    # Step 1: Remove "[source: X]" tags. These are specific to the provided sample format.
    text = re.sub(r'\[source:\s*\d+\]', '', text)

    # Step 2: Handle encoding issues and non-ASCII characters.
    # unicodedata.normalize('NFKD', text) decomposes characters into their base form
    # (e.g., 'á' becomes 'a' plus an accent).
    # .encode('ascii', 'ignore') then removes the accent (and other non-ASCII chars like emojis).
    # .decode('utf-8', 'ignore') converts it back to a standard string.
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

    # Step 3: Standardize case by converting all text to lowercase.
    # This ensures consistency, so "The" and "the" are treated identically.
    text = text.lower()

    # Step 4: Remove HTML tags (e.g., <div>...</div>).
    # The regex '<[^>]*>' matches any text starting with '<', followed by any characters
    # (except '>'), and ending with '>'.
    text = re.sub(r'<[^>]*>', '', text)

    # Step 5: Remove email addresses (e.g., contact@domain.com).
    # \S* matches any non-whitespace character zero or more times.
    # @ matches the literal '@' symbol.
    # \s? matches an optional whitespace character after the email.
    text = re.sub(r'\S*@\S*\s?', '', text)

    # Step 6: Remove phone numbers (e.g., (123)-456-7890).
    # This specific regex targets the common phone number format with parentheses and hyphens.
    text = re.sub(r'\(\d{3}\)-\d{3}-\d{4}', '', text)

    # Step 7: Remove specific random symbols and excessive/inconsistent punctuation.
    # This is a critical step for comprehensive cleaning.
    # `[^a-z0-9\s]` matches any character that is NOT a lowercase letter (a-z),
    # a digit (0-9), or a whitespace character (\s).
    # This effectively removes:
    # - Special characters like '💥🔥🚀', '@@', '##'
    # - Excessive punctuation like '!!!!!', ',,,,,'
    # - Currency symbols like '$'
    # - Apostrophes from contractions (e.g., "can't" becomes "cant" after this step,
    #   which simplifies text but means contractions are not expanded).
    # THIS IS THE LINE THAT WAS LIKELY CAUSING YOUR ERROR BEFORE.
    text = re.sub(r'[^a-z0-9\s]', '', text)

    # Step 8: Remove extra spaces.
    # `\s+` matches one or more whitespace characters (spaces, tabs, newlines).
    # It replaces them with a single space.
    # `.strip()` removes any leading or trailing whitespace from the entire string.
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Apply the cleaning function to the content from 'text_sample.txt'.
cleaned_text_output = clean_text(messy_text_from_sample)

# Print the cleaned text to the console.
print("--- Cleaned Text from text_sample.txt ---")
print(cleaned_text_output)

# Save the cleaned text to a new file named 'cleaned_text_sample.txt'.
# The 'w' mode means write (create if doesn't exist, overwrite if it does).
# encoding="utf-8" ensures proper character handling.
output_filename = "cleaned_text_sample.txt"
with open(output_filename, "w", encoding="utf-8") as f:
    f.write(cleaned_text_output)

# Confirm to the user that the file has been saved.
print(f"\nCleaned text saved to: {output_filename}")

--- Cleaned Text from text_sample.txt ---
this is a problematic text file it contains extra spaces special characters some words are misspelleddd and encoding issues like this cause problems prices are inconsistent 2999 2999 usd 2999 emails phone numbers may be embedded repeated punctuations should be removed along with random symbols like stopwords like the is and a appear often html tags might be present this is inside a div and sometimes contractions wont expand cant wont shouldnt random numeric values 123456 98765 2024

Cleaned text saved to: cleaned_text_sample.txt
