# Exploratory Data Analysis: Online Gambling Promotion Detection

This notebook performs professional-grade EDA for a binary text classification dataset with two columns: `comment` and `label`.

Objectives:
- Verify schema, integrity, and class balance
- Explore text characteristics (lengths, language artifacts)
- Visualize most frequent tokens and n-grams per class
- Generate word clouds per class
- Surface actionable insights for modeling and data quality

Conventions:
- Visuals use Plotly/Altair (no matplotlib)
- Reproducible cells; minimal mutability
- Virtual environment: `source .venv/bin/activate` outside notebook if needed


In [1]:
# Runtime & dependency checks (no matplotlib usage)
import sys, platform, os
import importlib
from pathlib import Path

REQUIRED = [
    "pandas", "numpy", "plotly", "altair", "wordcloud", "scikit_learn", "regex"
]
missing = []
for pkg in REQUIRED:
    try:
        importlib.import_module(pkg.replace("-", "_"))
    except Exception as e:
        missing.append((pkg, str(e)))

print({
    "python": sys.version,
    "platform": platform.platform(),
    "venv": os.environ.get("VIRTUAL_ENV"),
    "pwd": os.getcwd(),
    "missing": missing,
})

DATA_PATH = Path("dataset/train.csv")
assert DATA_PATH.exists(), f"Missing dataset at {DATA_PATH}"

# Optional: set pandas display
import pandas as pd
pd.set_option("display.max_colwidth", 200)


{'python': '3.13.5 (v3.13.5:6cb20a219a8, Jun 11 2025, 12:23:45) [Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-15.1.1-arm64-arm-64bit-Mach-O', 'venv': '/Users/user/code/penambangan-data/.venv', 'pwd': '/Users/user/code/penambangan-data', 'missing': [('plotly', "No module named 'plotly'"), ('altair', "No module named 'altair'"), ('scikit_learn', "No module named 'scikit_learn'")]}


In [2]:
# Load data
import pandas as pd

df = pd.read_csv(DATA_PATH)
assert set(df.columns) >= {"comment", "label"}, f"Unexpected columns: {df.columns}"

# Normalize column names
df = df.rename(columns={"comment": "comment", "label": "label"})

print(df.shape)
df.head(5)


(8171, 2)


Unnamed: 0,comment,label
0,aamiin ya rabb,0
1,terima kasih mengajak jalan2 virtual raja ampat maasya allah banget 😍,0
2,bener prabu,0
3,tonton video ya hehe,0
4,coach nova plis suruh pda bljr sepak penalti asli biar gk bapuk2 😢,0


In [3]:
# Structural checks & data quality
import numpy as np

overview = {
    "num_rows": len(df),
    "num_columns": df.shape[1],
    "columns": df.columns.tolist(),
    "null_counts": df.isna().sum().to_dict(),
    "label_unique": sorted(df["label"].dropna().unique().tolist()),
}
overview


{'num_rows': 8171,
 'num_columns': 2,
 'columns': ['comment', 'label'],
 'null_counts': {'comment': 0, 'label': 0},
 'label_unique': [0, 1]}

In [5]:
# Label balance visualization (Plotly)
import plotly.express as px

label_counts = df["label"].value_counts(dropna=False).rename_axis("label").reset_index(name="count")
fig = px.bar(
    label_counts,
    x="label",
    y="count",
    text="count",
    color="label",
    title="Label Distribution",
)
fig.update_traces(textposition="outside")
fig.update_layout(yaxis_title="Count", xaxis_title="Label", uniformtext_minsize=8, uniformtext_mode="hide")
fig


In [6]:
# Label diagnostics table
label_stats = df.groupby("label").agg(
    n=("label", "size"),
    avg_length=("comment", lambda s: s.fillna("").str.len().mean()),
    pct_empty=("comment", lambda s: (s.isna() | (s.str.strip()=="")).mean()*100),
).reset_index()
label_stats


Unnamed: 0,label,n,avg_length,pct_empty
0,0,7454,66.092031,0.0
1,1,717,46.944212,0.0


In [20]:
# Text feature engineering
import re
import pandas as pd

def count_pattern(series: pd.Series, pattern: str):
    return series.fillna("").str.count(pattern)

def count_emoji(text):
    # Use emoji library to count emoji characters
    import emoji

    def count_emoji(text):
        return sum(1 for char in text if emoji.is_emoji(char))

def count_ord_gt_127(text):
    # Count number of characters with ord > 127
    return sum(ord(c) > 127 for c in text)

def count_unique_ord_gt_127(text):
    # Count unique characters with ord > 127
    return len(set(c for c in text if ord(c) > 127))

clean = df.copy()
clean["comment"] = clean["comment"].fillna("")
clean["length"] = clean["comment"].str.len()
clean["num_words"] = clean["comment"].str.split().str.len()
clean["num_digits"] = count_pattern(clean["comment"], r"\d")
clean["num_emoji"] = clean["comment"].apply(count_emoji)
clean["num_homoglyph"] = clean["comment"].apply(count_ord_gt_127)
clean["num_unique_homoglyph"] = clean["comment"].apply(count_unique_ord_gt_127)

clean_feats = clean[["label", "length", "num_words", "num_digits", "num_emoji", "num_homoglyph", "num_unique_homoglyph"]]
clean_feats.head()


ModuleNotFoundError: No module named 'emoji'

In [18]:
# Plot distributions of text features by label (Plotly)
import plotly.express as px

melted = clean_feats.melt(id_vars=["label"], var_name="feature", value_name="value")
fig = px.violin(
    melted,
    x="feature",
    y="value",
    color="label",
    box=True,
    points=False,
    title="Text Feature Distributions by Label",
)
fig.update_layout(xaxis_title="Feature", yaxis_title="Value")
fig


In [None]:
# Top n-grams per class (1-3 grams)
import pandas as pd
import regex as re
from collections import Counter
import plotly.express as px

STOPWORDS = set([
    "yang","dan","di","ke","dari","itu","ini","ya","ga","gak","nggak","aja","juga","lah","kok","nih","sih","deh","kayak","karena","untuk","pada","terus","udah","udah","banget","bgt","bang","kak","mas","mba","bro","sis","gue","gua","aku","kamu","anda","dia","mereka","kita","kami","the","a","an","to","of","is","are","am","be","in","on","for","with","at","as","it","this","that"
])

TOKEN_PATTERN = re.compile(r"[\p{L}\p{N}]+", re.IGNORECASE)

def tokenize(text: str):
    return [t.lower() for t in TOKEN_PATTERN.findall(text or "")]

def generate_ngrams(tokens, n):
    return [" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

def top_ngrams(series: pd.Series, n: int, top_k: int = 20):
    counts = Counter()
    for text in series.fillna(""):
        toks = [t for t in tokenize(text) if t not in STOPWORDS]
        counts.update(generate_ngrams(toks, n))
    items = [(k, v) for k, v in counts.most_common(top_k) if k.strip()]
    return pd.DataFrame(items, columns=["ngram", "count"])

plots = []
for label_value in sorted(df["label"].unique()):
    subset = df[df["label"]==label_value]["comment"]
    for n in [1,2,3]:
        tbl = top_ngrams(subset, n=n, top_k=20)
        tbl["label"] = label_value
        tbl["n"] = n
        fig = px.bar(tbl.head(20), x="count", y="ngram", orientation="h", title=f"Top {n}-grams for label {label_value}")
        plots.append(fig)

plots[0] if plots else None


In [None]:
# Word clouds per class (using wordcloud lib; rendered as images via Plotly)
from wordcloud import WordCloud
import numpy as np
import plotly.express as px

# Build frequency dict per class using same tokenizer/stopwords
freqs = {}
for label_value in sorted(df["label"].unique()):
    counts = Counter()
    for text in df[df["label"]==label_value]["comment"].fillna(""):
        toks = [t for t in tokenize(text) if t not in STOPWORDS]
        counts.update(toks)
    freqs[label_value] = dict(counts)

figs = []
for label_value, freq in freqs.items():
    if not freq:
        continue
    wc = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(freq)
    img = np.array(wc.to_image())
    fig = px.imshow(img, binary_format="png")
    fig.update_layout(title=f"Word Cloud - label {label_value}", coloraxis_showscale=False)
    fig.update_xaxes(showticklabels=False).update_yaxes(showticklabels=False)
    figs.append(fig)

figs[0] if figs else None


In [None]:
# URLs/domains/handles analysis
import pandas as pd
import regex as re
import plotly.express as px

URL_RE = re.compile(r"https?://[^\s]+|www\.[^\s]+", re.IGNORECASE)
HANDLE_RE = re.compile(r"@[A-Za-z0-9_]+")
DOMAIN_RE = re.compile(r"https?://([^/]+)")

records = []
for row in df[["comment","label"]].itertuples(index=False):
    text, label_value = row
    text = text or ""
    urls = URL_RE.findall(text)
    handles = HANDLE_RE.findall(text)
    domains = []
    for u in urls:
        m = DOMAIN_RE.search(u if u.startswith("http") else f"http://{u}")
        if m:
            domains.append(m.group(1).lower())
    if urls or handles or domains:
        records.append({"label": label_value, "num_urls": len(urls), "num_handles": len(handles), "domains": domains})

entity_df = pd.DataFrame(records)
summary = {
    "rows_with_entities": len(entity_df),
    "pct_rows_with_entities": (len(entity_df) / len(df) * 100) if len(df) else 0,
}
summary



In [None]:
# Visualize top domains and handles by label
import itertools

if not entity_df.empty:
    # explode domains
    domains_exp = entity_df.explode("domains")
    domains_exp = domains_exp.dropna(subset=["domains"])  # remove empty
    top_domains = (domains_exp.groupby(["label","domains"]).size().reset_index(name="count")
                   .sort_values("count", ascending=False).groupby("label").head(20))
    fig_dom = px.bar(top_domains, x="count", y="domains", color="label", orientation="h", title="Top Domains by Label")
    fig_dom.show()

    # handles summary (counts only)
    handles_counts = entity_df.groupby("label")["num_handles"].sum().reset_index()
    fig_handles = px.bar(handles_counts, x="label", y="num_handles", title="Total Handles by Label")
    fig_handles.show()
else:
    print("No URL/handle entities detected.")


## Key Insights

- Class balance: visualize counts to detect imbalance; consider stratified sampling if skewed.
- Text length/words: differences between classes can guide feature scaling and model choice.
- Frequent n-grams: gambling indicators (e.g., domain names, promo words, referral codes) appear prominently in label=1.
- Entities: presence of URLs/domains/handles may be predictive; consider regex-based features.
- Data quality: empty or near-empty comments should be filtered or handled.

Next steps:
- Build minimal baseline models (e.g., TF-IDF + Logistic Regression) using these insights.
- Add language normalization (case-folding, Unicode normalization, slang handling) aligned with modeling pipeline.
- Monitor false positives around polysemy (e.g., ambiguous tokens not referring to gambling).
