# TikTok Claims Classification — 04: Hypothesis Testing & Statistical Analysis

***Statistical evaluation of verification status impact on video engagement using hypothesis testing***

**Author:** Katherine Ygbuhay  
**Updated:** 2025-10-04  
**Stage:** 04 — Statistical Analysis  
**Runtime:** ~20 minutes  

## Objective

Assess data quality through systematic cleaning and evaluate whether TikTok account verification status is statistically associated with differences in video view counts using rigorous hypothesis testing.

## Scope & Approach

- **Data quality assessment** with systematic inspection and missing value handling
- **Statistical hypothesis formulation** comparing verified vs. unverified account engagement
- **Welch two-sample t-test** to account for unequal variances between groups
- **Effect size analysis** using Hedges' g to quantify practical significance
- **Confidence interval estimation** for mean difference with proper statistical interpretation

## Key Outputs

- Clean dataset with systematic missing value removal and quality validation
- Statistical significance testing results with p-values and test statistics
- Effect size quantification revealing moderate practical differences between groups
- 95% confidence intervals providing range estimates for population parameters
- Business interpretation of verification status impact on content reach

## Prerequisites

- Raw TikTok dataset with verification status and engagement metrics
- Understanding of hypothesis testing principles and effect size interpretation
- Familiarity with assumptions underlying two-sample t-tests and statistical inference

---

## Imports & Readability

In [1]:
# Core data analysis packages
import pandas as pd
import numpy as np

# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis / hypothesis testing
from scipy import stats

In [2]:
# Pandas display settings (improves readability in notebooks)
pd.options.display.float_format = '{:.3f}'.format
pd.set_option("display.max_columns", None)   # show all columns
pd.set_option("display.max_rows", 100)       # up to 100 rows

# Seaborn theme for readability + accessibility
# - "whitegrid" for clarity
# - "colorblind" = Okabe–Ito palette (colorblind-friendly)
sns.set_theme(style="whitegrid", palette="colorblind")

# Matplotlib defaults for consistent figure sizing & typography
import matplotlib as mpl
mpl.rcParams["figure.figsize"] = (7, 5)
mpl.rcParams["axes.titlesize"] = 14
mpl.rcParams["axes.labelsize"] = 12
mpl.rcParams["legend.title_fontsize"] = 11
mpl.rcParams["legend.fontsize"] = 10

# Consistent color palettes for categorical variables
claim_palette = {"claim": "#0072B2", "opinion": "#E69F00"}                 
verified_palette = {"verified": "#009E73", "not verified": "#0072B2"}      
ban_palette = {
    "active": "#0072B2", 
    "under review": "#E69F00", 
    "banned": "#D55E00"
}

In [3]:
# Resolve the case-study root so paths work from any launch directory
from pathlib import Path

def find_case_root(start: Path | None = None) -> Path:
    p = start or Path.cwd()
    for q in [p, *p.parents]:
        if (q / "notebooks").exists() and (q / "data").exists():
            return q
    return p  # fallback

CASE_ROOT = find_case_root()
DATA_FILE = CASE_ROOT / "data" / "raw" / "tiktok_dataset.csv"
assert DATA_FILE.exists(), f"Missing data file: {DATA_FILE}"

In [4]:
# Load dataset
df = pd.read_csv(DATA_FILE)

## Data Inspection Utilities
Small helper to summarize a DataFrame (shape, head, missingness, numeric stats, and top categories).

In [5]:
# DataFrame summary utility (run once, reuse below)
def df_summary(df, head=5, top_k=5):
    """
    Structured DataFrame summary:
      - Shape (rows, columns)
      - Head (first N rows)
      - Column info: dtype, non-null, missing count, % missing
      - Descriptive stats (numeric)
      - Categorical overview: top-K values per object/category column
    """
    import numpy as np
    import pandas as pd
    from IPython.display import display

    # Shape
    print("=== Shape ===")
    print(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]:,}\n")

    # Head
    print(f"=== Head (first {head} rows) ===")
    display(df.head(head))
    print()

    # Column info (missingness)
    print("=== Column Info ===")
    rows, info = df.shape[0], []
    for col in df.columns:
        non_null = df[col].notna().sum()
        nulls = rows - non_null
        pct_missing = (nulls / rows * 100) if rows else 0.0
        info.append([col, df[col].dtype, non_null, nulls, f"{pct_missing:.2f}%"])
    info_df = pd.DataFrame(
        info, columns=["Column", "Dtype", "Non-Null Count", "Missing Count", "% Missing"]
    )
    display(info_df)
    print()

    # Numeric stats
    print("=== Descriptive Statistics (Numeric) ===")
    display(df.describe(include=[np.number]).T.round(3))

    # Categorical overview
    cat_cols = [c for c in df.columns if df[c].dtype == "object" or str(df[c].dtype) == "category"]
    if cat_cols:
        print()
        print(f"=== Categorical Overview (top {top_k}) ===")
        cat_rows = []
        for c in cat_cols:
            vc = df[c].value_counts(dropna=False)
            total = int(vc.sum())
            for val, cnt in vc.head(top_k).items():
                pct = (cnt / total * 100) if total else 0.0
                cat_rows.append([c, str(val), int(cnt), f"{pct:.2f}%"])
        cat_df = pd.DataFrame(cat_rows, columns=["Column", "Value", "Count", "Percent"])
        display(cat_df)

## Raw Data: Initial Inspection

In [6]:
# Inspect the raw dataset
df_summary(df)

=== Shape ===
Rows: 19,382 | Columns: 12

=== Head (first 5 rows) ===


Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0



=== Column Info ===


Unnamed: 0,Column,Dtype,Non-Null Count,Missing Count,% Missing
0,#,int64,19382,0,0.00%
1,claim_status,object,19084,298,1.54%
2,video_id,int64,19382,0,0.00%
3,video_duration_sec,int64,19382,0,0.00%
4,video_transcription_text,object,19084,298,1.54%
5,verified_status,object,19382,0,0.00%
6,author_ban_status,object,19382,0,0.00%
7,video_view_count,float64,19084,298,1.54%
8,video_like_count,float64,19084,298,1.54%
9,video_share_count,float64,19084,298,1.54%



=== Descriptive Statistics (Numeric) ===


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
#,19382.0,9691.5,5595.246,1.0,4846.25,9691.5,14536.75,19382.0
video_id,19382.0,5627454067.339,2536440464.169,1234959018.0,3430416807.25,5618663579.0,7843960211.25,9999873075.0
video_duration_sec,19382.0,32.422,16.23,5.0,18.0,32.0,47.0,60.0
video_view_count,19084.0,254708.559,322893.281,20.0,4942.5,9954.5,504327.0,999817.0
video_like_count,19084.0,84304.636,133420.547,0.0,810.75,3403.5,125020.0,657830.0
video_share_count,19084.0,16735.248,32036.174,0.0,115.0,717.0,18222.0,256130.0
video_download_count,19084.0,1049.43,2004.3,0.0,7.0,46.0,1156.25,14994.0
video_comment_count,19084.0,349.312,799.639,0.0,1.0,9.0,292.0,9599.0



=== Categorical Overview (top 5) ===


Unnamed: 0,Column,Value,Count,Percent
0,claim_status,claim,9608,49.57%
1,claim_status,opinion,9476,48.89%
2,claim_status,,298,1.54%
3,video_transcription_text,,298,1.54%
4,video_transcription_text,a colleague learned from the media that chihu...,2,0.01%
5,video_transcription_text,someone learned from the media that halley’s ...,2,0.01%
6,video_transcription_text,i read in the media that a candle’s flame is ...,2,0.01%
7,video_transcription_text,a friend read in the media a claim that icela...,2,0.01%
8,verified_status,not verified,18142,93.60%
9,verified_status,verified,1240,6.40%


## Cleaning Step: Drop Rows with Missing Values

In [7]:
# Drop rows with any missing values (row-wise) and re-inspect
df_clean = df.dropna(axis=0)
df_summary(df_clean)

=== Shape ===
Rows: 19,084 | Columns: 12

=== Head (first 5 rows) ===


Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0



=== Column Info ===


Unnamed: 0,Column,Dtype,Non-Null Count,Missing Count,% Missing
0,#,int64,19084,0,0.00%
1,claim_status,object,19084,0,0.00%
2,video_id,int64,19084,0,0.00%
3,video_duration_sec,int64,19084,0,0.00%
4,video_transcription_text,object,19084,0,0.00%
5,verified_status,object,19084,0,0.00%
6,author_ban_status,object,19084,0,0.00%
7,video_view_count,float64,19084,0,0.00%
8,video_like_count,float64,19084,0,0.00%
9,video_share_count,float64,19084,0,0.00%



=== Descriptive Statistics (Numeric) ===


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
#,19084.0,9542.5,5509.221,1.0,4771.75,9542.5,14313.25,19084.0
video_id,19084.0,5624839917.874,2537030180.259,1234959018.0,3425100251.25,5609500370.0,7840823300.5,9999873075.0
video_duration_sec,19084.0,32.424,16.226,5.0,18.0,32.0,47.0,60.0
video_view_count,19084.0,254708.559,322893.281,20.0,4942.5,9954.5,504327.0,999817.0
video_like_count,19084.0,84304.636,133420.547,0.0,810.75,3403.5,125020.0,657830.0
video_share_count,19084.0,16735.248,32036.174,0.0,115.0,717.0,18222.0,256130.0
video_download_count,19084.0,1049.43,2004.3,0.0,7.0,46.0,1156.25,14994.0
video_comment_count,19084.0,349.312,799.639,0.0,1.0,9.0,292.0,9599.0



=== Categorical Overview (top 5) ===


Unnamed: 0,Column,Value,Count,Percent
0,claim_status,claim,9608,50.35%
1,claim_status,opinion,9476,49.65%
2,video_transcription_text,a colleague learned from the media a claim th...,2,0.01%
3,video_transcription_text,a friend read in the media that badminton is ...,2,0.01%
4,video_transcription_text,a colleague learned from the media a claim th...,2,0.01%
5,video_transcription_text,a colleague read in the media that earth days...,2,0.01%
6,video_transcription_text,someone learned from the media a claim that t...,2,0.01%
7,verified_status,not verified,17884,93.71%
8,verified_status,verified,1200,6.29%
9,author_ban_status,active,15383,80.61%


## Group-Level Metric: Mean Views by Verification Status

In [8]:
# Mean views by verification status (cleaned data)
views_by_verification = (
    df_clean.groupby("verified_status")["video_view_count"]
            .mean()
            .round(2)
            .sort_values(ascending=False)
)
print("=== Mean Views by Verification Status ===")
print(views_by_verification)

=== Mean Views by Verification Status ===
verified_status
not verified   265663.790
verified        91439.160
Name: video_view_count, dtype: float64


## Hypothesis Testing

We test whether account verification status is associated with a difference in mean video views using a two-sample t-test (Welch, unequal variances) at α = 0.05.

- **Null hypothesis (H₀):** There is no difference in mean view counts between verified and unverified accounts. Any observed difference is due to sampling variability.  
- **Alternative hypothesis (Hₐ):** There is a difference in mean view counts between verified and unverified accounts.

In [9]:
# ------------------------------------------------------------
# Welch two-sample t-test: mean views by verification status
# ------------------------------------------------------------

# Pre-conditions
assert "verified_status" in df_clean.columns, "Missing column: verified_status"
assert "video_view_count" in df_clean.columns, "Missing column: video_view_count"

# Split samples
not_verified = df_clean.loc[df_clean["verified_status"] == "not verified", "video_view_count"]
verified     = df_clean.loc[df_clean["verified_status"] == "verified", "video_view_count"]

# Basic sample diagnostics
n_nv, n_v = len(not_verified), len(verified)
m_nv, m_v = not_verified.mean(), verified.mean()

# Welch t-test (unequal variances)
t_stat, p_val = stats.ttest_ind(a=not_verified, b=verified, equal_var=False)

print("Two-Sample t-Test (Welch)")
print(f"n (not verified) = {n_nv:,}, mean = {m_nv:,.2f}")
print(f"n (verified)     = {n_v:,}, mean = {m_v:,.2f}")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_val:.3e}")

Two-Sample t-Test (Welch)
n (not verified) = 17,884, mean = 265,663.79
n (verified)     = 1,200, mean = 91,439.16
T-statistic: 25.499
P-value: 2.609e-120


In [10]:
# ------------------------------------------------------------
# Effect size (Hedges' g) + 95% CI for mean difference (Welch)
# ------------------------------------------------------------
import numpy as np
from math import sqrt
from scipy import stats as _stats  # avoid name shadowing

s_nv = not_verified.std(ddof=1)
s_v  = verified.std(ddof=1)

# Cohen's d with pooled SD, then small-sample correction to Hedges' g
sp2 = ((n_nv - 1) * s_nv**2 + (n_v - 1) * s_v**2) / (n_nv + n_v - 2)
d   = (m_nv - m_v) / np.sqrt(sp2)
J   = 1 - 3 / (4 * (n_nv + n_v) - 9)
g   = d * J

# Welch 95% CI for mean difference
diff = m_nv - m_v
se   = np.sqrt(s_nv**2 / n_nv + s_v**2 / n_v)
df_w = (s_nv**2 / n_nv + s_v**2 / n_v)**2 / ((s_nv**2 / n_nv)**2 / (n_nv - 1) + (s_v**2 / n_v)**2 / (n_v - 1))
tcrit = _stats.t.ppf(0.975, df_w)
ci_lo, ci_hi = diff - tcrit * se, diff + tcrit * se

print(f"Mean difference (not verified − verified): {diff:,.2f}")
print(f"95% CI (Welch): [{ci_lo:,.2f}, {ci_hi:,.2f}]")
print(f"Hedges' g: {g:.2f}")

Mean difference (not verified − verified): 174,224.62
95% CI (Welch): [160,822.87, 187,626.37]
Hedges' g: 0.54


## Conclusion

After cleaning, the dataset contained **19,084 videos × 12 features** with no missing values. Most accounts were **not verified (~94%)** and **active (~81%)**. Unverified accounts averaged ~265K views per video, while verified accounts averaged ~91K.

A Welch two-sample t-test showed the difference in mean views is **highly significant** (t ≈ 25.5, p ≈ 2.609 × 10⁻¹²⁰). Effect size analysis (Hedges’ g ≈ 0.54) indicates a **moderate practical difference**, with a mean gap of ~174K views (95% CI: 161K–188K). We therefore **reject H₀**: verified and unverified accounts differ substantially in reach.

**Implication:** Verification status is strongly associated with audience reach. Next steps: model verification alongside content and behavioral features (e.g., content type, posting cadence, follower count) to determine whether verification itself drives visibility or proxies for other factors.