<h1 style="text-align:center; font-size:50px;">Data Cleaning</h1>


<br/><br/>
### Purpose: Loading Library

 - **pandas**: For handling datasets (loading, cleaning, saving).
 - **re**: For text pattern matching and removal (regex-based cleaning).
 - **ENGLISH_STOP_WORDS**: To remove common stopwords.
 - **WordNetLemmatizer**: For reducing words to their root form (lemmatization).
<br/><br/>


In [1]:
import pandas as pd
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import WordNetLemmatizer

<br/>

### Purpose: Cleans raw text by

 - Removing URLs, mentions, hashtags, single letters, punctuation, and numbers.
 - Converts text to lowercase.
 - Expands contractions (e.g., "n't" → "not").
 - Removes stopwords (like "the", "is").
 - Lemmatizes words if specified (e.g., “running” → “run”).

<br>

In [2]:
# -----------------------------------------------------------------------
# Preprocessing function
# -----------------------------------------------------------------------
def preprocess_text(text, lemmatize=False):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r"http\S+", "", text)      # Remove URLs
    text = re.sub(r"@\w+", "", text)         # Remove mentions
    text = re.sub(r"#\w+", "", text)         # Remove hashtags
    text = re.sub(r'\b[a-zA-Z]\b', '', text) # Remove single characters (like s, t, m)
    text = text.replace("n't", " not").replace("'re", " are").replace("'m", " am").replace("'s", " is")
    text = re.sub(r'[^a-z\s]', '', text)     # Remove punctuation/numbers
    tokens = re.findall(r"\b\w+\b", text)
    tokens = [t for t in tokens if t not in ENGLISH_STOP_WORDS]
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return " ".join(tokens)


<br></br>
### Purpose:

 - Defines file paths for raw datasets and cleaned datasets.
 - Ensures consistent references across cleaning functions.
<br></br>

In [3]:
# -------------------------------------------------------------------------
# File paths (raw and cleaned)
# -------------------------------------------------------------------------
sentiment140_path = "Sentiment140.csv"
balanced_path = "balanced_sentiment_dataset.csv"
million_path = "milliondataset.csv"

clean_sentiment140_path = "Sentiment140_clean.csv"
clean_balanced_path = "balanced_sentiment_dataset_clean.csv"
clean_million_path = "milliondataset_clean.csv"

<br></br>
### Purpose:

 - Loads raw Sentiment140 dataset.[Only Column needed for analysis to save the processing space
 - Maps target values: 0 → negative, 1 → positive.
 - Cleans text using preprocess_text.
 - Saves the cleaned dataset with only essential columns.
   - New Dataset Name: **"*Sentiment140_clean.csv*"**
<br></br>

In [4]:
# -------------------------------------------------------------------------
# 1. Clean Sentiment140 in chunks
# -------------------------------------------------------------------------

def clean_sentiment140():
    # Load entire dataset
    df = pd.read_csv(sentiment140_path, encoding='latin-1', header=None)
    # Rename columns
    df = df.rename(columns={0: "target", 5: "text"})
    # Map 4 → 1 (positive)
    df["target"] = df["target"].replace(4, 1)
    # Clean text
    df["Cleaned text"] = df["text"].apply(preprocess_text)
    # Keep only required columns
    df = df[["Cleaned text", "target"]]
    # Save cleaned dataset
    df.to_csv(clean_sentiment140_path, index=False)
    print(f"Cleaned Sentiment140 saved to {clean_sentiment140_path}")

<br/><br/>
### Purpose:

 - Cleans Balanced sentiment dataset.
 - Renames columns and preprocesses text.
 - Saves cleaned file with only required columns.
    - New Dataset Name : **"*balanced_sentiment_dataset_clean.csv*"**

<br/><br/>

In [5]:
# ------------------------------------------------------------------------
# 2. Clean Balanced dataset
# ------------------------------------------------------------------------
def clean_balanced():
    df = pd.read_csv(balanced_path)
    df = df.rename(columns={"sentiment": "target", "text": "text"})
    df["Cleaned text"] = df["text"].apply(preprocess_text)
    df[["Cleaned text", "target"]].to_csv(clean_balanced_path, index=False)
    print(f"Cleaned Balanced dataset saved to {clean_balanced_path}")

<br/><br/>
### Purpose:

 - Cleans Million dataset.
 - Filters only English and sentiment-labeled data.
 - Maps sentiment strings to numeric targets (negative=0, positive=1).
 - Preprocesses text and saves cleaned dataset.
    - New Dataset Name : **"*milliondataset_clean.csv*"**
<br/><br/>


In [6]:
# -----------------------------------------------------------------------
# 3. Clean Million dataset
# -----------------------------------------------------------------------
def clean_million():
    df = pd.read_csv(million_path)
    df = df[df["Language"] == "en"]
    df = df[df["Label"].isin(["negative", "positive"])]
    df["target"] = df["Label"].map({"negative": 0, "positive": 1})
    df["Cleaned text"] = df["Text"].apply(preprocess_text)
    df[["Cleaned text", "target"]].to_csv(clean_million_path, index=False)
    print(f"Cleaned Million dataset saved to {clean_million_path}")



<br/><br/>
## Execute All Cleaning Functions
<br/><br/>

In [7]:
# -------------------
# Run cleaning
# -------------------
clean_sentiment140()
clean_balanced()
clean_million()


Cleaned Sentiment140 saved to Sentiment140_clean.csv
Cleaned Balanced dataset saved to balanced_sentiment_dataset_clean.csv
Cleaned Million dataset saved to milliondataset_clean.csv
