# 📌 Goal: Clean unstructured text by removing:

**URLs (e.g. https://google.com)**

**Mentions (@username)**

**Hashtags (#topic)**

**Emojis**

**Extra spaces**

**Stopwords (optional)**

# 📦 Tools used:

re**gex (re): to search and replace patterns like URLs, hashtags, etc.**

**nltk: for stopword removal**

In [1]:
import re
import nltk
from nltk.corpus import stopwords

In [2]:
# Download stopwords (only the first time)
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [5]:
# Set of English stopwords
stop_words = set(stopwords.words('english'))

In [8]:
def clean_social_media_text(text):
    # Remove URLs (http/https)
    text = re.sub(r"http\S+|www\S+", "", text)

    # Remove mentions but keep the name
    text = re.sub(r"@(\w+)", r"\1", text)

    # Remove hashtags (but keep the word)
    text = re.sub(r"#(\w+)", r"\1", text)

    # Remove emojis and non-ASCII characters
    text = re.sub(r"[^\x00-\x7F]+", "", text)

    # Remove extra whitespace
    text = re.sub(r"\s+", " ", text).strip()

    # Remove stopwords (optional)
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]

    # Join back into cleaned sentence
    cleaned_text = " ".join(filtered_words)

    return cleaned_text

In [9]:
# Example usage
sample_text = "Hey @john! Check this out 😎🔥: https://t.co/example #awesome #AIrocks"
cleaned = clean_social_media_text(sample_text)
print("Cleaned Text:", cleaned)

Cleaned Text: Hey john! Check : awesome AIrocks


## 🔍 Why We Use These Functions:
### ***Function	Purpose***
* re.sub(r"http\S+	www\S+", "", text)
* re.sub(r"@\w+", "", text)	Removes mentions like @user123
* re.sub(r"#(\w+)", r"\1", text)	Removes the # symbol but keeps the word (e.g., "#AI" → "AI")
* re.sub(r"[^\x00-\x7F]+", "", text)	Removes emojis and any non-English characters
* re.sub(r"\s+", " ", text).strip()	Removes extra spaces and trims the text
stopwords.words('english')	Removes common unimportant words like "the", "is", "and"