# 02. Data Preprocessing
This notebook focuses on cleaning and transforming raw tweets into a structured format suitable for analysis.

### Why Preprocess Before Labeling?
Due to the "noisy" nature of Twitter data, preprocessing is performed **before** sentiment labeling. This ensures the pretrained model receives clean text, reducing misclassification caused by links, tags, and irregular characters.

### Preprocessing Steps:
1. **Remove Missing Value**  Eliminating rows with empty text to ensure data integrity.
2. **Cleaning & Casefolding:** Stripping out URLs, Mentions (@), Hashtags (#), and non-alphabetical characters, then converting all text to lowercase.
3. **Remove Duplicate Data** Dropping redundant entries to prevent bias in the modeling stage.
4. **Normalization:** Correcting informal words (Slang) to standard Indonesian.
5. **Stopword Removal:** Filtering out common words that don't carry significant meaning (using `Sastrawi`).
6. **Tokenization** Breaking down sentences into individual words (tokens) for granular processing.
4. **Stemming:** Reducing each word to its base form (root word) using the `Sastrawi` stemmer to maintain consistency across the dataset.

In [2]:
import re
import pandas as pd

from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import (
    StopWordRemoverFactory,
    ArrayDictionary,
    StopWordRemover
)

In [None]:
FILE_PATH = '../data/'
df = pd.read_csv(FILE_PATH + 'crawling_sample.csv', sep=';')
df.head()

In [None]:
df['full_text_original'] = df['full_text']
df = df[['created_at', 'full_text_original', 'full_text']]
df.head()

## Remove Missing Value

In [None]:
df = df.dropna(subset=["full_text"]).reset_index(drop=True)
df.shape

## Cleaning + Casefolding

In [None]:
def clean_text(text):
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = re.sub(r"#\w+", "", text)
    text = re.sub(r"RT[\s]+", "", text)
    text = re.sub(r"[^a-zA-Z\s]", " ", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip().lower()
df["full_text"] = df["full_text"].apply(clean_text)

## Remove Duplicate Data

In [None]:
df = df.drop_duplicates(subset=["full_text"]).reset_index(drop=True)

## Normalization

In [None]:
norm = {}  # normalization dictionary intentionally omitted

def add_spaces(text):
    return f" {text} "

def normalization(str_text):
    for key, value in norm.items():
        str_text = str_text.replace(key, value)
    return str_text

df["full_text"] = df["full_text"].apply(add_spaces)
df["full_text"] = df["full_text"].apply(normalization)
df["full_text"] = df["full_text"].str.strip()

## Remove Stopwords

In [None]:
stop_words = StopWordRemoverFactory().get_stop_words()

retain_words = [] # words to retain in the stopword list e.g. "tidak", "tanpa", etc.

for word in retain_words:
    if word in stop_words:
        stop_words.remove(word)

stopword_remover = StopWordRemover(ArrayDictionary(stop_words))

df["full_text"] = df["full_text"].apply(stopword_remover.remove)

## Tokenization

In [None]:
df['full_text'] = df['full_text'].apply(lambda x: x.split())

## Stemming

In [None]:
stemmer = StemmerFactory().create_stemmer()

def stemming(tokens):
    return " ".join(stemmer.stem(token) for token in tokens)

df["full_text"] = df["full_text"].apply(stemming)

In [None]:
df.to_csv(FILE_PATH + 'processed_sample.csv', index=False, sep=';')