### Using NLP for Text Data Quality
**Objective**: Enhance text data quality using NLP techniques.

**Task**: Handling Noisy Text Data

**Steps**:
1. Data Set: Obtain a dataset with customer reviews containing noise (e.g., random characters).
2. Clean Data: Use regex patterns to clean the noise from text data.
3. Evaluate: Compare the text before and after cleaning for noise.

In [2]:
import pandas as pd
import re

# ------------------------------
# Step 1: Create a sample dataset
# ------------------------------
data = {
    'ReviewID': [1, 2, 3, 4, 5],
    'CustomerReview': [
        "Loooved it!!! 😍😍 will buy again...!!!$$$$",
        "Terrible serv!ce@#%^. Never @coming again...",
        "5 stars!!!*****     Awesome product \n\n\n",
        "w0rst experienc3 eveR...!!! :(",
        "Th1s pr0duct is S0000 G00D!!! <3 <3 <3"
    ]
}

df = pd.DataFrame(data)

# ------------------------------
# Step 2: Define robust text cleaning function with error handling
# ------------------------------
def clean_text(text):
    """
    Cleans noisy text by removing special characters, numbers, emojis,
    and excess whitespace.

    Parameters:
    text (str): Raw text to clean.

    Returns:
    str: Cleaned text.
    """
    try:
        if not isinstance(text, str):
            raise ValueError("Input is not a string.")

        text = text.lower()
        text = re.sub(r'[^a-z0-9\s]', '', text)  # remove special characters
        text = re.sub(r'\d+', '', text)          # remove digits
        text = re.sub(r'\s+', ' ', text).strip() # remove extra spaces

        return text
    except Exception as e:
        return f"Error: {str(e)}"

# ------------------------------
# Step 3: Apply cleaning to dataset
# ------------------------------
df['CleanedReview'] = df['CustomerReview'].apply(clean_text)

# ------------------------------
# Step 4: Display original vs cleaned
# ------------------------------
print("Original vs Cleaned Reviews:\n")
for _, row in df.iterrows():
    print(f"Original: {row['CustomerReview']}")
    print(f"Cleaned : {row['CleanedReview']}")
    print("-" * 60)

# ------------------------------
# Step 5: Unit Tests
# ------------------------------
def test_clean_text():
    assert clean_text("He!!o 123") == "heo", "Special characters and digits failed"
    assert clean_text("   Hello    World! ") == "hello world", "Whitespace removal failed"
    assert clean_text("") == "", "Empty string test failed"
    assert "Error" in clean_text(None), "Non-string input handling failed"
    print("All unit tests passed!")

test_clean_text()

Original vs Cleaned Reviews:

Original: Loooved it!!! 😍😍 will buy again...!!!$$$$
Cleaned : loooved it will buy again
------------------------------------------------------------
Original: Terrible serv!ce@#%^. Never @coming again...
Cleaned : terrible servce never coming again
------------------------------------------------------------
Original: 5 stars!!!*****     Awesome product 



Cleaned : stars awesome product
------------------------------------------------------------
Original: w0rst experienc3 eveR...!!! :(
Cleaned : wrst experienc ever
------------------------------------------------------------
Original: Th1s pr0duct is S0000 G00D!!! <3 <3 <3
Cleaned : ths prduct is s gd
------------------------------------------------------------
All unit tests passed!
